Robots.txt and why you should have one

Your site will get crawled one day. Probably not fast enough for you or your business, but unless you have a high ranking site to put a link back to the site you want crawled or pay to have a link put there, you are at the mercy of the Search Engines….Or rather your knowledge of Search!
So what to expect when spiders, robots, crawlers, and the unknown come creeping around your site looking for yummy content and links to gobble up. Most often you want every gizmo, whamming bot, and creepy crawler to ding and ping your site. BUT NOT Always…
Huh….?

Why not you ask?

Well, some bots will crawl and crawl and duplicate crawls, and chew the hell out of your bandwidth. That is one reason. Another may be to tell certain engines, scraper type stuff, to go away. I have seen bots head out and collect a mess of stuff from forums and websites and plop it right on their own site for content! That’s some nerve, eh! But some offer ways to block them out, they reference the text.txt file and tell you what code to add so it blocks them out. You should be prepared to allow and disallow whom-ever you want into your site.

The General rule is :
IF YOU DO NOT SPECIFY WHAT IS AND ISN’T ALLOWED TO BE INDEXEDIT’S ALL FREE GAME

Here are some basics to learn about your robot.txt file so you are somewhat prepared when then time comes.

—–

To exclude all robots from the entire server

User-agent: *
Disallow: /

To allow all robots complete access

User-agent: *
Disallow:

Or create an empty “/robots.txt” file.
To exclude all robots from part of the server

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /private/

To exclude a single robot

User-agent: BadBot
Disallow: /

To allow a single robot

User-agent: WebCrawler
Disallow:

To exclude every robot:

User-agent: *
Disallow: /

To exclude all files except one
T here is no “Allow” field. So, easist way to do this is to put all files to be disallowed into a separate directory, say “docs”, and leave the one file in the level above this directory:

User-agent: *
Disallow: /~my/docs/

Alternatively you can explicitly disallow all disallowed pages:

User-agent: *
Disallow: /~my/private.html
Disallow: /~my/diary.html
Disallow: /~my/addresses.html

Hope that helps out!

Leave a Reply