Email Address Validation in PHP

Of all the input to validate, e-mail addresses seem to be one of the trickiest. At first glance, you might try validating the address with a simple regular expression, based on the usual requirements of an email provider. Let’s say two or more characters followed by an ‘@’ sign, followed by two or more characters, then a period and two or more characters. Two characters seems to be a good lower limit, because of addresses like ‘me@domain.com’ or ‘admin@site.co.uk’. But here’s where problems start to crop up.

[a-z0-9_]{2,}\@[a-z0-9]{2,}\.[a-z0-9]{2,}

If we specified only alphanumeric characters, plus maybe an underscore, a domain like “.co.uk” would fail, or only return “user@site.co”. We could add an optional part to the TLD regex to allow domains like that, but it looks like we’ve forgotten about users like “user.name@mail.com”. So maybe we should go back and expand the username portion as well. While we’re at it, we might as well incorporate all of the RFC spec, which results in something like this. While most mail clients (gmail, hotmail, etc) may not allow things like the plus sign (firstname+lastname@mail.com), there are plenty of users out there who still use this for various reasons. When your validation is too strict, chances are you’ve overlooked something.

I won’t continue to drag on about the inadequacy of regular expressions in validating email addresses. If you’ve visited the RFC-compliant regular expression (which is a bit overkill, but illustrates the point nicely), you get the message. Time to move on.

So how can we make sure the user is entering a valid email address? Well, the most practical way would be to simply send them an email. Sanitize the input, and fire off a validation email with a “confirm account” link. No worries about regular expressions, no frustrated users with odd email addresses, and no fake emails. If you’re on a shared host that limits the amount of emails you can send, you could try stripping the domain off of the email and validating the domain before sending. This should stop emails like “asdf@asdf.com” from getting through, but will pass along “aksjdflljklasdflj@gmail.com”. You can do this with the following bit of code:

$is_valid = (filter_var($email, FILTER_VALIDATE_EMAIL)) ? checkdnsrr(substr(strrchr($email, "@"), 1),"MX") : false;

Email Address Validation in PHP

A Backwards Robots.txt File

When a web crawler such as GoogleBot creeps around the web, it starts sucking up information and reporting it back to the search engine. In an effort to keep bots out of certain parts of a website (for whatever reason), a guy by the name of Martijn Koster came up with an idea:

Put a file in the root directory of the site that tells robots what not to look at!

From there, the Robots Exclusion Standard was born. Basically, you create a text file named robots.txt in your root directory (example.com/robots.txt), and it tells crawlers which parts of your website to stay away from. You can read about it in more detail here or by performing a google search.

What’s the problem?

A sample robots.txt file might look something like this:

User-Agent: *
Disallow: /images/
Disallow: /cgi-bin/

In this instance, the file is telling all bots (by using the * wildcard character) that it’s not allowed to look in the /images/ or /cgi-bin/ folders. This is reasonable enough, and most legitimate web crawlers follow the robots.txt file. However, you can plainly view the file in your browser (see:http://www.facebook.com/robots.txt) and this does nothing to prevent malicious or poorly-coded bots from ignoring your wishes. The robots.txt file is essentially a sign that reads “I have data in these folders that I don’t want anyone to know about. Please don’t look there and please don’t tell anyone.”

[Do not throw stones at this sign.]

If I’m snooping around a website, one of the first things I look at is the robots.txt file. It’s usually a huge list of things that people don’t want you to look at – which, of course, makes me all the more interested in looking for them. Here’s an example:

User-agent: *
  Disallow: /admin/
  Disallow: /members/
  Disallow: /webmail/
  Disallow: /personaldata/

I hope you see the problem.

Originally, the robots.txt standard only allowed a Disallow directive, but lots of search engines are now incorporating an Allow directive, as well as some basic pattern matching.

I leveraged the Allow directive to write a “backwards” robots.txt:

User-agent: *
Disallow: /*
Allow: /$
Allow: /articles/
Allow: /files/
Allow: /txt/
Allow: /tor/
Allow: /tools/

Allow: /about Allow: /anon-sopa Allow: /cards Allow: /computers Allow: /crypto Allow: /cryptographic-hashes Allow: /documents Allow: /ems-home Allow: /ems-videos Allow: /index Allow: /links Allow: /medicine Allow: /misc Allow: /software Allow: /voynich Allow: /zombies

To break this down line-by-line:

  • User-agent: tells all bots that they should follow these rules
  • Disallow: / tells the bot not to crawl the entire site
  • Allow: /$ makes use of Googlebot’s pattern matching, and allows http://cmattoon.com/ to be crawled, as the URI ends in a slash. (The $ marks the end of the URI.) This overrides the Disallow: /* directive on the line before it.
  • As you can see, the file goes on to grant permission for the public parts of the site, rather than announcing the parts I want to remain hidden.

The big question becomes whether to Disallow a directory (in my case, the entire site), then grant explicit permission (General => Specific), or whether to Allow files before issuing a Disallow for the directory. I can’t find a solid answer on this, so I’m modeling mine based on Google’s robots.txt (I’ve heard they know a thing or two about search engines). Google follows the (logical) General => Specific pattern, which was my first intuition. Mark the calendar: I did something right on the first try!

As a warning, this could easily cause a conflict with any of the myriad crawlers out there. There is no uniform standard, and nobody (including you!) is required to adhere to the recommendations that do exist.

That being said, a quick test of my site with the new backwards robots.txt (conducted using this tool) showed that it works for the major search engines. I’m not very concerned about my search engine ranking, so I’d rather be a geek and play with the file than fret over my page rank. If page rank and SEO are important to you, this may not be the best way to go.

Finally, for the people that are really worried about this, I recommend looking into using metadata, or playing with things like the x-robots-tag. There’s also an article on .htaccess and SEO that discusses the canonicalization of HTTPS vs HTTP versions of your site.

A Backwards Robots.txt File