Your site will get crawled one day. Probably not fast enough for you or your business, but unless you have a high ranking site to put a link back to the site you want crawled or pay to have a link put there, you are at the mercy of the Search Engines….Or rather your knowledge of Search!
So what to expect when spiders, robots, crawlers, and the unknown come creeping around your site looking for yummy content and links to gobble up. Most often you want every gizmo, whamming bot, and creepy crawler to ding and ping your site. BUT NOT Always…
Huh….?
Why not you ask?
Well, some bots will crawl and crawl and duplicate crawls, and chew the hell out of your bandwidth. That is one reason. Another may be to tell certain engines, scraper type stuff, to go away. I have seen bots head out and collect a mess of stuff from forums and websites and plop it right on their own site for content! That’s some nerve, eh! But some offer ways to block them out, they reference the text.txt file and tell you what code to add so it blocks them out. You should be prepared to allow and disallow whom-ever you want into your site.
The General rule is :
IF YOU DO NOT SPECIFY WHAT IS AND ISN’T ALLOWED TO BE INDEXED…IT’S ALL FREE GAME
Here are some basics to learn about your robot.txt file so you are somewhat prepared when then time comes.
—–
To exclude all robots from the entire server
User-agent: *
Disallow: /
To allow all robots complete access
User-agent: *
Disallow:
Or create an empty “/robots.txt” file.
To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /private/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: WebCrawler
Disallow:
To exclude every robot:
User-agent: *
Disallow: /
To exclude all files except one
T here is no “Allow” field. So, easist way to do this is to put all files to be disallowed into a separate directory, say “docs”, and leave the one file in the level above this directory:
User-agent: *
Disallow: /~my/docs/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~my/private.html
Disallow: /~my/diary.html
Disallow: /~my/addresses.html
Hope that helps out!