When a web crawler such as GoogleBot creeps around the web, it starts sucking up information and reporting it back to the search engine. In an effort to keep bots out of certain parts of a website (for whatever reason), a guy by the name of Martijn Koster came up with an idea:
Put a file in the root directory of the site that tells robots what not to look at!
From there, the Robots Exclusion Standard was born. Basically, you create a text file named robots.txt in your root directory (example.com/robots.txt), and it tells crawlers which parts of your website to stay away from. You can read about it in more detail here or by performing a google search.
What’s the problem?
A sample robots.txt file might look something like this:
User-Agent: * Disallow: /images/ Disallow: /cgi-bin/
In this instance, the file is telling all bots (by using the * wildcard character) that it’s not allowed to look in the /images/ or /cgi-bin/ folders. This is reasonable enough, and most legitimate web crawlers follow the robots.txt file. However, you can plainly view the file in your browser (see:http://www.facebook.com/robots.txt) and this does nothing to prevent malicious or poorly-coded bots from ignoring your wishes. The robots.txt file is essentially a sign that reads “I have data in these folders that I don’t want anyone to know about. Please don’t look there and please don’t tell anyone.”
If I’m snooping around a website, one of the first things I look at is the robots.txt file. It’s usually a huge list of things that people don’t want you to look at – which, of course, makes me all the more interested in looking for them. Here’s an example:
User-agent: * Disallow: /admin/ Disallow: /members/ Disallow: /webmail/ Disallow: /personaldata/
I hope you see the problem.
Originally, the robots.txt standard only allowed a Disallow directive, but lots of search engines are now incorporating an Allow directive, as well as some basic pattern matching.
I leveraged the Allow directive to write a “backwards” robots.txt:
User-agent: * Disallow: /* Allow: /$ Allow: /articles/ Allow: /files/ Allow: /txt/ Allow: /tor/ Allow: /tools/
Allow: /about Allow: /anon-sopa Allow: /cards Allow: /computers Allow: /crypto Allow: /cryptographic-hashes Allow: /documents Allow: /ems-home Allow: /ems-videos Allow: /index Allow: /links Allow: /medicine Allow: /misc Allow: /software Allow: /voynich Allow: /zombies
To break this down line-by-line:
- User-agent: tells all bots that they should follow these rules
- Disallow: / tells the bot not to crawl the entire site
- Allow: /$ makes use of Googlebot’s pattern matching, and allows http://cmattoon.com/ to be crawled, as the URI ends in a slash. (The $ marks the end of the URI.) This overrides the Disallow: /* directive on the line before it.
- As you can see, the file goes on to grant permission for the public parts of the site, rather than announcing the parts I want to remain hidden.
The big question becomes whether to Disallow a directory (in my case, the entire site), then grant explicit permission (General => Specific), or whether to Allow files before issuing a Disallow for the directory. I can’t find a solid answer on this, so I’m modeling mine based on Google’s robots.txt (I’ve heard they know a thing or two about search engines). Google follows the (logical) General => Specific pattern, which was my first intuition. Mark the calendar: I did something right on the first try!
As a warning, this could easily cause a conflict with any of the myriad crawlers out there. There is no uniform standard, and nobody (including you!) is required to adhere to the recommendations that do exist.
That being said, a quick test of my site with the new backwards robots.txt (conducted using this tool) showed that it works for the major search engines. I’m not very concerned about my search engine ranking, so I’d rather be a geek and play with the file than fret over my page rank. If page rank and SEO are important to you, this may not be the best way to go.
Finally, for the people that are really worried about this, I recommend looking into using metadata, or playing with things like the x-robots-tag. There’s also an article on .htaccess and SEO that discusses the canonicalization of HTTPS vs HTTP versions of your site.