I'm using a webcrawler to index my site for Google. The forum has the "Event Calendar" mod which spans about a hundred years - and each day is seen as a link by the crawler.
How do I keep the crawler from looking in the forum?
Search is your friend “I was having a mildly paranoid day, mostly due to the fact that the mad priest lady from over the river had taken to nailing weasels to my front door again.”
Are you sure the webcrawler obeys robots files? and are you sure that webcrawler is the user-agent?
Yes. . .there was a big improvement once I cleared out the webcrawler memory, erased all files that it had generated, erased the project, restarted the application and established a new project.
It seems the crawler would not "forget" URL's that it already visited and it would start down that path again.
Now there are only 8,900 URL's on my site being counted - so yes, it appears that the webcrawler is obeying the robot.txt. But that's still way to many URLs.
The "Event Calendar MOD" will absolutely add several thousands of URL's to a site map - all pointing to empty daily dates for the next 100 years! Thus the need for an effective robots.txt
Bye the way, the webcrawler used was recommended by Google for creating a site index: SOFTplus GSiteCrawler