Author |
Topic |
MarcelG
Retired Support Moderator
Netherlands
2625 Posts |
Posted - 13 June 2005 : 08:45:26
|
Hi there, I've been having some problems with GoogleBot since 3 days... They've been pulling over 14000 pageviews with a total of 1.11 gigabytes since last Saturday, and have been ignoring my robots.txt file (by indexing my post.asp file for about 836 megs).... (See here for a more detailed discription) I was just curious if some of you also experienced this behaviour the last couple of days.... |
portfolio - linkshrinker - oxle - twitter |
Edited by - MarcelG on 13 June 2005 08:46:44 |
|
Podge
Support Moderator
Ireland
3775 Posts |
|
pdrg
Support Moderator
United Kingdom
2897 Posts |
Posted - 13 June 2005 : 09:11:02
|
try challenging google - they may deny it and it may be an imposter as podge suggests, but an imposter to what end is a mystery :(
Can you firewall the rogue IP block out? |
|
|
wii
Free ASP Hosts Moderator
Denmark
2632 Posts |
Posted - 13 June 2005 : 09:49:53
|
Yeah, several years ago I had problems with this - I contacted Google and they removed my forum from the spiders, which is fine in my case, since it´s a private forum anyway. |
|
|
wii
Free ASP Hosts Moderator
Denmark
2632 Posts |
|
Podge
Support Moderator
Ireland
3775 Posts |
|
pdrg
Support Moderator
United Kingdom
2897 Posts |
Posted - 13 June 2005 : 10:09:25
|
haha
yep, algorithms change frequently - they're the only way a search company has any edge over any other. Really ought to respect the robots.txt though and 'play nice', so worth challenging google anyway, in case they're inadvertantly 'playing nasty' |
|
|
Podge
Support Moderator
Ireland
3775 Posts |
|
pdrg
Support Moderator
United Kingdom
2897 Posts |
Posted - 13 June 2005 : 10:37:23
|
agreed :) |
|
|
MarcelG
Retired Support Moderator
Netherlands
2625 Posts |
Posted - 13 June 2005 : 10:47:41
|
At this moment I'm seeing 66.249.66.12 crawling the site. That's 100% google.
Tonight I'll have full access to my logfiles, so I'll dive in then and see what's been pulling so much lately. I'll let you know what I come up with, and what response I got from Google. |
portfolio - linkshrinker - oxle - twitter |
Edited by - MarcelG on 13 June 2005 10:54:54 |
|
|
Podge
Support Moderator
Ireland
3775 Posts |
|
MarcelG
Retired Support Moderator
Netherlands
2625 Posts |
Posted - 13 June 2005 : 14:48:25
|
No, it's mainly post.asp that's been indexed so many times by them.... Diving into the logfiles at this moment.
[edit]Dived into the logfiles...It's google allright: stats with IP Grr... Now, how can I 'mod' my inc_header to redirect all incoming googlebot-traffic to post.asp to an empty file ? (And prevent image parsing?) |
portfolio - linkshrinker - oxle - twitter |
Edited by - MarcelG on 13 June 2005 15:13:35 |
|
|
MarcelG
Retired Support Moderator
Netherlands
2625 Posts |
Posted - 13 June 2005 : 15:51:54
|
I think I've done it. I've implemented this neat piece of code in config.asp:
'Spider check
' Well, normally the browser isn't a spider...
dim isSpider
isSpider = 0
' No other meaning than forcing the isSpider behaviour
' for testing pourpose
if request("spider") = 1 then isSpider = 1
' Takes the name of the UserAgent currently used and put it
' into lower case for compairson
agent = lcase(Request.ServerVariables("HTTP_USER_AGENT"))
' Now, most of the Bots refers to themself as libwww,
' java, perl, crawl, bot. let's start with some conditions
' If the agent contains "bot" then it is a Spider
if instr(agent, "bot") > 0 then isSpider = 1
' If the agent contains "perl" then it is a Spider
if instr(agent, "perl") > 0 then isSpider = 1
' If the agent contains "java" then it is a Spider
if instr(agent, "java") > 0 then isSpider = 1
' If the agent contains "libw" then it is a Spider
if instr(agent, "libw") > 0 then isSpider = 1
' If the agent contains "crawl" then it is a Spider
if instr(agent, "crawl") > 0 then isSpider = 1
'end spider check Now, I've got the isSpider value for every page. So, edited inc_func_common.asp, for the function function FormatStr(fString) I added this part:
if strIMGInPosts = "1" and isSpider = 0 then
fString = ReplaceImageTags(fString)
end if
So, no more image parsing for the GoogleBots. And, everywhere where I didn't want the GoogleBot to go to, I inserted this piece of code in the header, redirecting to an empty file. For example post.asp, right after the include of config.asp:
if isSpider = 1 then
server.transfer("empty.asp")
end if And, I changed the linkshrinker to stop redirecting GoogleBot requests (edited my 404.asp). Now, we'll just have to wait and see. |
portfolio - linkshrinker - oxle - twitter |
|
|
Podge
Support Moderator
Ireland
3775 Posts |
|
MarcelG
Retired Support Moderator
Netherlands
2625 Posts |
Posted - 14 June 2005 : 03:39:46
|
quote: Originally posted by Podge Nicely done. I tested it with Firefox's user agent switcher and it works flawlessly.
Nice plugin Used it in Mozilla to test this too. I'm not sure if google accepts this 'cloaking', but at least they won't be slurping my bandwidth now. If someone thinks that my spider-detection scripts needs some improvement, please feel free to suggest changes.
quote: Originally posted by Podge One thing (off topic) I thought I would bring your attention to; The flash Oxle logo on the top left of your forums behaves like a link but doesn't work when clicked.
Thanks for the tip ; looking into it later today. |
portfolio - linkshrinker - oxle - twitter |
|
|
pdrg
Support Moderator
United Kingdom
2897 Posts |
Posted - 14 June 2005 : 04:39:25
|
splendid stuff - be interesting to see what google say about why they didn't respect your robots.txt!
Out of interest, how is your robots.txt configured? |
|
|
Topic |
|