E-mails, E-mails Everywhere and Not a One to Spam

It was brought to my attention the other day that there are some concerns about e-mail addresses published on our college’s web site and the effect it has on spam.  It turns out the filters here run through about 10,000,000 emails a day, about 7% of which are passed on as being actual, legitimate messages.  We are not a huge campus, but I’m going to guess that many of you would see a similar ratio.  Naturally, this has brought up conversation of obfuscating e-mail addresses.  We’ll set aside the “closing the gate after the horse got out” metaphor for now, because techniques can always help prevent spam from hitting new addresses, so at least that way we can lighten the load for our new users.

email_codeOf course, the ultimate e-mail obfuscation problem is how to do it accessibly.  By its very nature, if you are making information accessible to those without JavaScript, or with screen readers, etc, then you are publishing data in a fashion that can be picked up by spammers.  There are plenty of methods that work great, and I’d be happy to use them on my personal blog or such, but they simply aren’t feasible for a college trying to be 508 compliant (or otherwise, depending on if your state has its own guidelines as well).  If you have done any research on this topic, you have undoubtedly come across A List Apart’s article on this subject.  In a lot of ways, the conversation can start and end there, because they’ve broken the issue down to the atomic level, and reconstructed it as gold.  But, there are some other methods, and some other considerations I want to point out, especially because I’m in a non-PHP environment now, so I had to go another route to find a solution.

CSS (Code Direction)

This technique came to me by way of Silvan Mühlemann’s blog.  I think of any method, this is both the easiest and coolest, and it works in FireFox and IE6.  The problem is, it’s also the worst.  It relies on the idea that you can take a string, and with a CSS attribute reverse the flow of the information inside the selector and make it readable.  So, when you write the address, you say moc.elpmaxe@liame, and with CSS reverse it to be email@example.com.  The reason this is bad though is twofold.  First, you can’t make it clickable.  The CSS only works on content within the selector, so you can’t manipulate an href, and obviously putting the email in as a plain href is as bad as having it normal in the page in the first place.  Secondly, it breaks copy + paste, because copying the text causes you to copy from the source, which is backwards.  So pasting it pastes the original moc.elpmaxe@liame.  If you make the link not clickable, you darn sure better not break copying.  The bad part is that Mühlemann’s blog reported a 0% spam rate over a year and a half on an address encoded in this manner, so it appears to be great at stopping spam.

CSS (display:none)

This faces pretty much all the same problems as the other CSS technique, but instead relies on using a span inside an email address to hide a part and make it human readable: email@<span style="display:none;">I hate spam</span>example.com.  A user can read the address without issue, but still can’t copy it, and you still can’t make it a link.

Character Entity Encoding

This is the practice of taking an email address and encoding all the characters into HTML entity values, so email@example.com becomes &#101;&#109;&#097;&#105;&#108;&#064;&#101;&#120;&#097;&#109;&#112;&#108;&#101;&#046;&#099;&#111;&#109;.  This is better than having an email in plain text (affecting a 62% decrease in spam volume over plain text), and it allows you to make it clickable.  However, it’s straightforward enough that it comes in second behind plain text as the easiest to get past, though the decrease in spam volume was fairly significant.

A similar, but alternative method that appears to reduce spam load by 92% over plain text is to mix in entities for the “@” and “.”, producing a mailto like email&#64;example&#46;com.  This is probably because the crawlers are set to ignore single occurrences of encoded entities, and with them there, the email doesn’t match an email pattern (at least until they get smart enough to match this pattern).

Both of these methods can be considered viable for accessibility purposes, and they make a big enough impact that one could serious consider employing them full time.

Insert Comments

Inserting comments results in addresses like email<!-- -->@<!-- @ -->example<!-- -->.<!-- . -->com.  This however fails the test to make the address clickable.  It is more effective than fully character encoding the address, but less so than selectively encoding the “@” and “.”, receiving about 444% more spam than that method.  Comments decrease spam by 11% over full on entity encoding of the address.

Build with Javascript

The process of using Javascript to concatenate the components of an email string is almost foolproof in its ability to trick spiders.  This relies on setting a couple variables and combining them all in a fashion similar to document.write("<a href=" + "mail" + "to:" + string1 + string2 + string3 + ">" + string4 + "</a>");.  But naturally this is a problem for those not using Javascript.  They would simply get no output where this is used, in other words, it doesn’t degrade gracefully.

Use “name AT site DOT com”

If you look around on blogs and forums, there is a growing trend to type out an email in the fashion of “username AT website DOT com,” or some variation thereof.  First, this doesn’t address clickability, and second, it’s not really a trick.  All spammers have to do is Google a phrase, like “AT gmail DOT com” (I got 10.3 million hits) and start saving matches.  Oddly enough though, this appears to produce less spam than building with Javascript, but the click problem combined with almost inevitable circumvention makes this pretty useless to us.  And personally, I’m not a fan of making a visitor do extra work to change a deliberately tweaked address if it is at all avoidable.

ROT-13 Encoding

ROT-13 is a basic substitution cypher, that rotates a character 13 places.  This allows it to be encoded and decoded very easily.  Using this to process email addresses appears to be one of the foolproof means of avoiding spam crawlers (along with the CSS techniques).  Here’s a basic tool that you can test the technique on.  PHP readily includes the str_rot13() function that can be used for this.  But one last time, you’re limited to people using Javascript.

A List Apart Method

Rather than explain this, go read their tutorial.  It’s very clever, and is probably the best alternative out there, but only if you are using PHP and can write some custom .htaccess URI rewrite rules.

So, given this boat load of information, where does it leave us? I think many of us in the educational circles can use A List Apart’s system for any of our emails that show up in dynamically generated listings.  Email addresses added to a page by an editor or such would have to be handled manually though (you can get around this with some additional work using Apache’s mod_substitution).  My solution is a combination of techniques.  Our CMS is Java based, so A List Apart’s methodology doesn’t exactly work.  But, what I can do is combine ROT13 encoding with a <noscript> alternative that incorporates an image generator and character encoded link to make it clickable.  This would create an image representation of the address that is properly alt tagged so that screen readers can still interpret the address and users could still click it.  I think this is a good blend in my case.  There is a URIRewrite application on our server as well that would allow me to do some of the A List Apart system in the future.  The point being, you don’t have to use only one solution, you can combine different options to try and get the best of every world.  But there is no magic bullet if you are trying not to break accessibility.

For many of us, the horse may already be out of the gates, so closing the gate now might not do much.  But we can at least try to ease the load on new addresses that become published, and make the spammer’s job harder (and make email admins less likely to gripe at you).  There’s no good excuse for handing over emails as plaintext when we have tools to easily avoid it.  And ultimately, if a human can read it, it’s inevitable that spammers will crack through it.  For the time being, that process isn’t cost effective for them though, so we might as well take advantage of it.

Is Hosted Search Really Ready for Prime Time?

In my years that I’ve now spent in higher education, one universal truth I have found is that nothing quite moves a project along like when someone much more important and much less web savvy than you deems an issue worth addressing.  Such was the case only a couple months after I had started at the university, when the Director of Marketing noticed that new information she had put up on the site wasn’t coming up in search results, and the results that were hitting weren’t particularly relevant to the topic in the first place.  Thus, a mission was born, to find a way to make our search better, and to do it NOW.  That’s the other thing about people higher up than you, when they say jump, generally you jump.

At the time (approximately three years ago), we had been using the pretty straight forward Google search for web sites.  It amounted to putting a box on your page that submitted to Google, restricting results to your domain.  You couldn’t really do anything else with it then besides add a banner to the top.  So began the odyssey.  Most of the major players all offered a basic site search back then, all of which fairly equally crippled.  The Google Search Appliance was (and still is) crazy expensive and totally overkill for our site.  The IBM/Yahoo product OmniFind was still a few months from launch (nor did we have hardware to run it on at the time).  The Thunderstone Parametric Search Appliance just looked a little… well, no one I know had ever heard of them, and their website wasn’t (and still isn’t) something that inspires my confidence.  The Mini, on the other hand, was cheap, more than adequate for our site size, and was getting good reviews.  Not to mention that the money to get it was ready, willing, and able.  All that made the choice pretty easy for us, so we dove in.

Now, fast forward a couple years.  We are still using our Mini.  In fact, I just upgraded to 5.0.4 on Monday.  I’ve never had a lick of problem with it and became a pretty quick fan of it.  This year at eduWeb I had the good fortune to share my experience with a couple people, and the conversation generally drifted towards: “Why is that better than Google Site Search?”  Originally, the Mini offered a ton of unique features, such as custom collections, theming, or the ability to export results as XML.  The past year has seen a growth in the availability and features provided by free, hosted search solutions.  Yahoo BOSS looks to be an API that wants to take a serious swing at the hosted search crown.  Google’s Custom Search Business Edition (CSBE) AKA Google Site Search is also offering business and schools the opportunity for search with many of the features of the Mini like ability to remove branding and ads and call results as XML (note: Google Site Search is free for universities).

With all these new options, is the Mini even a worthwhile investment now?  We’re coming up on the end of our support term, so I figured this was a prime time to evaluate the field.  My short answer is: Yes, it still is.  My long answer also happens to be yes.  See, search is important. Search is doubly important for universities because we have so much crap out there, and so many different topics to address (many of which also happen to be crap, but you can’t tell that to the people putting it out there).  A Mini now costs $3000 with 2 years of support, which would be equal to six years of equivalent CSBE service (assuming you had to pay) which prices out at $500 a year for 50,000 pages.  Obviously Google isn’t trying to mothball its own products, so where does the Mini make up that cost?

First, I think there’s huge value in crawling.  Remember our original problem?  Content was not making it into the search results fast enough.  With the Mini I can schedule crawls, or just set it on continuous mode and let it go nuts.  Using nightly scheduled crawls, I ensure that any content added to the web site shows up in search within 24 hours, and usually faster than that (unless some crazy person is up and adding content to the site at 12:01 AM).  Going through the Webmaster Tools, I can only tell Google to crawl our site at a Normal or Slower rate.  We don’t even rate high enough to get the Faster crawl rate option.  So users of Site Search are pretty well cornered on the matter.  Once I crawl our site with the Mini, I can have the it output a sitemap that I feed to Google’s spider to help with their indexing as well, so the benefit becomes twofold.

Next up, raise your hand if you have an intranet, or otherwise secured information not available to the public.  All of you can pretty well scratch CSBE/SiteSearch off your short list if you’re looking for a way to dig through it.  If you want to index any kind of protected content, you’ll have to go with an actual hardware solution, as both the Mini and GSA support mechanisms to crawl and serve content that is behind a security layer.  This is a great option if you buy a Mini, use up the initial two years of support, then buy a second one: use one for internet and the other for intranet.

You’re also going to find that you are capable of pulling more valuable metrics out of the Mini than what you get with CSBE/SiteSearch.  Granted the standard “what are people searching for” question is easily enough answered.  But what about “what are people searching for that isn’t returning results?”  That can be equally as valuable in a lot of cases.  And while Site Search allows for search numbers by month and day, the Mini can go down to the hour, as well as show you your current queries per minute.  It’ll even keep tabs on how many pages it’s crawling currently, how many errors it found, and email you about it all.  All the reports can be saved out as XML, naturally, so you can mix and match datasets as you need for custom reports.

dir1boxAnd I have one word for you: OneBox.  Mini has it, thanks to a trickle down effect from the GSA – hosted Google options do not have it.  The OneBox essentially allows you to add in custom search results based on query syntax, and tailor the styling of the results.  You see this all the time at Google, for instance when you type in a phone number, or FedEx tracking number.  As you can see, these results need not come from your Google Mini search index.  It can come from other collections, or other sources entirely.  In the screenshot to the right, you can see a mock up of a OneBox result that matches a name format and returns contact information along with the standard search results.  Uses for this are many, and can span anything you might store in databases, such as course listings, book ISBNs, names, weather (if you have campuses in different cities), room information, etc.  Anything that you can define some kind of search pattern for.

On a quasi similar note, you can also link certain searches (or parts of searches) to keymatches.  These are commonly used for ads on Google that appear at the very top of search results (usually highlighted light yellow with the “Sponsored Link” caption), but you can use them to highlight a link that goes right to the automotive department when someone searches for something containing the word “auto.”  This is another feature unique to the Mini and GSA, and one more way to make sure searches are presented with relevant links.  This is very useful in cases where a department might not have a well optimized site which doesn’t show up first in a search for their department.

Ultimately, it’s a judgment call whether or not these features are worth the money to you.  At $3000, you’re basically paying $1000 each for the server itself and two years of support.  You can’t buy the unit without support though, so that notwithstanding, you’re getting a full featured search box with support for about twice the cost of a good PC.  If you have more than 50,000 pages to index though, you’ll find that price goes up.  At the same time, if you do have over 50,000 pages, there are a lot of other reasons not to go hosted, such as control over results, index freshness, result relevance, etc.  All these are always important, but they become even more so the bigger your site is.  Consider, if you have half a million pages on your site, and you need to make sure people find the needle that they need to in that haystack, would you rather have some control over that, or cross your fingers and hope Google gets it right?

My end impression is the Google’s Site Search is a great little tool for small businesses that are dealing in a few thousand pages, who can’t afford a server, or who don’t have the resources to maintain it.  Keeping up the server isn’t an involved job at all, but does require someone capable of checking in on it monthly or so, at least.  But, as universities, we generally have the resources for such a tool, both financially and manpower-wise.  We’re also large enough to justify a dedicated box for such an important task.

If you’re still researching what’s right for you in hosted search, it might well be worth keeping an eye on Yahoo BOSS though, it’s making some pretty cool claims on functionality.  OmniFind is also great free software if you have the resources to run it already in place (like a VMWare cluster or other virtualized environment) and can function within its limitations (only having up to five collections being the big one).  Just remember, search is possibly the single biggest tool on your website behind maybe your portal, and it deserves due process to get the treatment and attention your users expect and deserve.