E-mails, E-mails Everywhere and Not a One to Spam

It was brought to my attention the other day that there are some concerns about e-mail addresses published on our college’s web site and the effect it has on spam.  It turns out the filters here run through about 10,000,000 emails a day, about 7% of which are passed on as being actual, legitimate messages.  We are not a huge campus, but I’m going to guess that many of you would see a similar ratio.  Naturally, this has brought up conversation of obfuscating e-mail addresses.  We’ll set aside the “closing the gate after the horse got out” metaphor for now, because techniques can always help prevent spam from hitting new addresses, so at least that way we can lighten the load for our new users.

email_codeOf course, the ultimate e-mail obfuscation problem is how to do it accessibly.  By its very nature, if you are making information accessible to those without JavaScript, or with screen readers, etc, then you are publishing data in a fashion that can be picked up by spammers.  There are plenty of methods that work great, and I’d be happy to use them on my personal blog or such, but they simply aren’t feasible for a college trying to be 508 compliant (or otherwise, depending on if your state has its own guidelines as well).  If you have done any research on this topic, you have undoubtedly come across A List Apart’s article on this subject.  In a lot of ways, the conversation can start and end there, because they’ve broken the issue down to the atomic level, and reconstructed it as gold.  But, there are some other methods, and some other considerations I want to point out, especially because I’m in a non-PHP environment now, so I had to go another route to find a solution.

CSS (Code Direction)

This technique came to me by way of Silvan Mühlemann’s blog.  I think of any method, this is both the easiest and coolest, and it works in FireFox and IE6.  The problem is, it’s also the worst.  It relies on the idea that you can take a string, and with a CSS attribute reverse the flow of the information inside the selector and make it readable.  So, when you write the address, you say moc.elpmaxe@liame, and with CSS reverse it to be email@example.com.  The reason this is bad though is twofold.  First, you can’t make it clickable.  The CSS only works on content within the selector, so you can’t manipulate an href, and obviously putting the email in as a plain href is as bad as having it normal in the page in the first place.  Secondly, it breaks copy + paste, because copying the text causes you to copy from the source, which is backwards.  So pasting it pastes the original moc.elpmaxe@liame.  If you make the link not clickable, you darn sure better not break copying.  The bad part is that Mühlemann’s blog reported a 0% spam rate over a year and a half on an address encoded in this manner, so it appears to be great at stopping spam.

CSS (display:none)

This faces pretty much all the same problems as the other CSS technique, but instead relies on using a span inside an email address to hide a part and make it human readable: email@<span style="display:none;">I hate spam</span>example.com.  A user can read the address without issue, but still can’t copy it, and you still can’t make it a link.

Character Entity Encoding

This is the practice of taking an email address and encoding all the characters into HTML entity values, so email@example.com becomes &#101;&#109;&#097;&#105;&#108;&#064;&#101;&#120;&#097;&#109;&#112;&#108;&#101;&#046;&#099;&#111;&#109;.  This is better than having an email in plain text (affecting a 62% decrease in spam volume over plain text), and it allows you to make it clickable.  However, it’s straightforward enough that it comes in second behind plain text as the easiest to get past, though the decrease in spam volume was fairly significant.

A similar, but alternative method that appears to reduce spam load by 92% over plain text is to mix in entities for the “@” and “.”, producing a mailto like email&#64;example&#46;com.  This is probably because the crawlers are set to ignore single occurrences of encoded entities, and with them there, the email doesn’t match an email pattern (at least until they get smart enough to match this pattern).

Both of these methods can be considered viable for accessibility purposes, and they make a big enough impact that one could serious consider employing them full time.

Insert Comments

Inserting comments results in addresses like email<!-- -->@<!-- @ -->example<!-- -->.<!-- . -->com.  This however fails the test to make the address clickable.  It is more effective than fully character encoding the address, but less so than selectively encoding the “@” and “.”, receiving about 444% more spam than that method.  Comments decrease spam by 11% over full on entity encoding of the address.

Build with Javascript

The process of using Javascript to concatenate the components of an email string is almost foolproof in its ability to trick spiders.  This relies on setting a couple variables and combining them all in a fashion similar to document.write("<a href=" + "mail" + "to:" + string1 + string2 + string3 + ">" + string4 + "</a>");.  But naturally this is a problem for those not using Javascript.  They would simply get no output where this is used, in other words, it doesn’t degrade gracefully.

Use “name AT site DOT com”

If you look around on blogs and forums, there is a growing trend to type out an email in the fashion of “username AT website DOT com,” or some variation thereof.  First, this doesn’t address clickability, and second, it’s not really a trick.  All spammers have to do is Google a phrase, like “AT gmail DOT com” (I got 10.3 million hits) and start saving matches.  Oddly enough though, this appears to produce less spam than building with Javascript, but the click problem combined with almost inevitable circumvention makes this pretty useless to us.  And personally, I’m not a fan of making a visitor do extra work to change a deliberately tweaked address if it is at all avoidable.

ROT-13 Encoding

ROT-13 is a basic substitution cypher, that rotates a character 13 places.  This allows it to be encoded and decoded very easily.  Using this to process email addresses appears to be one of the foolproof means of avoiding spam crawlers (along with the CSS techniques).  Here’s a basic tool that you can test the technique on.  PHP readily includes the str_rot13() function that can be used for this.  But one last time, you’re limited to people using Javascript.

A List Apart Method

Rather than explain this, go read their tutorial.  It’s very clever, and is probably the best alternative out there, but only if you are using PHP and can write some custom .htaccess URI rewrite rules.

So, given this boat load of information, where does it leave us? I think many of us in the educational circles can use A List Apart’s system for any of our emails that show up in dynamically generated listings.  Email addresses added to a page by an editor or such would have to be handled manually though (you can get around this with some additional work using Apache’s mod_substitution).  My solution is a combination of techniques.  Our CMS is Java based, so A List Apart’s methodology doesn’t exactly work.  But, what I can do is combine ROT13 encoding with a <noscript> alternative that incorporates an image generator and character encoded link to make it clickable.  This would create an image representation of the address that is properly alt tagged so that screen readers can still interpret the address and users could still click it.  I think this is a good blend in my case.  There is a URIRewrite application on our server as well that would allow me to do some of the A List Apart system in the future.  The point being, you don’t have to use only one solution, you can combine different options to try and get the best of every world.  But there is no magic bullet if you are trying not to break accessibility.

For many of us, the horse may already be out of the gates, so closing the gate now might not do much.  But we can at least try to ease the load on new addresses that become published, and make the spammer’s job harder (and make email admins less likely to gripe at you).  There’s no good excuse for handing over emails as plaintext when we have tools to easily avoid it.  And ultimately, if a human can read it, it’s inevitable that spammers will crack through it.  For the time being, that process isn’t cost effective for them though, so we might as well take advantage of it.