Is Hosted Search Really Ready for Prime Time?

In my years that I’ve now spent in higher education, one universal truth I have found is that nothing quite moves a project along like when someone much more important and much less web savvy than you deems an issue worth addressing.  Such was the case only a couple months after I had started at the university, when the Director of Marketing noticed that new information she had put up on the site wasn’t coming up in search results, and the results that were hitting weren’t particularly relevant to the topic in the first place.  Thus, a mission was born, to find a way to make our search better, and to do it NOW.  That’s the other thing about people higher up than you, when they say jump, generally you jump.

At the time (approximately three years ago), we had been using the pretty straight forward Google search for web sites.  It amounted to putting a box on your page that submitted to Google, restricting results to your domain.  You couldn’t really do anything else with it then besides add a banner to the top.  So began the odyssey.  Most of the major players all offered a basic site search back then, all of which fairly equally crippled.  The Google Search Appliance was (and still is) crazy expensive and totally overkill for our site.  The IBM/Yahoo product OmniFind was still a few months from launch (nor did we have hardware to run it on at the time).  The Thunderstone Parametric Search Appliance just looked a little… well, no one I know had ever heard of them, and their website wasn’t (and still isn’t) something that inspires my confidence.  The Mini, on the other hand, was cheap, more than adequate for our site size, and was getting good reviews.  Not to mention that the money to get it was ready, willing, and able.  All that made the choice pretty easy for us, so we dove in.

Now, fast forward a couple years.  We are still using our Mini.  In fact, I just upgraded to 5.0.4 on Monday.  I’ve never had a lick of problem with it and became a pretty quick fan of it.  This year at eduWeb I had the good fortune to share my experience with a couple people, and the conversation generally drifted towards: “Why is that better than Google Site Search?”  Originally, the Mini offered a ton of unique features, such as custom collections, theming, or the ability to export results as XML.  The past year has seen a growth in the availability and features provided by free, hosted search solutions.  Yahoo BOSS looks to be an API that wants to take a serious swing at the hosted search crown.  Google’s Custom Search Business Edition (CSBE) AKA Google Site Search is also offering business and schools the opportunity for search with many of the features of the Mini like ability to remove branding and ads and call results as XML (note: Google Site Search is free for universities).

With all these new options, is the Mini even a worthwhile investment now?  We’re coming up on the end of our support term, so I figured this was a prime time to evaluate the field.  My short answer is: Yes, it still is.  My long answer also happens to be yes.  See, search is important. Search is doubly important for universities because we have so much crap out there, and so many different topics to address (many of which also happen to be crap, but you can’t tell that to the people putting it out there).  A Mini now costs $3000 with 2 years of support, which would be equal to six years of equivalent CSBE service (assuming you had to pay) which prices out at $500 a year for 50,000 pages.  Obviously Google isn’t trying to mothball its own products, so where does the Mini make up that cost?

First, I think there’s huge value in crawling.  Remember our original problem?  Content was not making it into the search results fast enough.  With the Mini I can schedule crawls, or just set it on continuous mode and let it go nuts.  Using nightly scheduled crawls, I ensure that any content added to the web site shows up in search within 24 hours, and usually faster than that (unless some crazy person is up and adding content to the site at 12:01 AM).  Going through the Webmaster Tools, I can only tell Google to crawl our site at a Normal or Slower rate.  We don’t even rate high enough to get the Faster crawl rate option.  So users of Site Search are pretty well cornered on the matter.  Once I crawl our site with the Mini, I can have the it output a sitemap that I feed to Google’s spider to help with their indexing as well, so the benefit becomes twofold.

Next up, raise your hand if you have an intranet, or otherwise secured information not available to the public.  All of you can pretty well scratch CSBE/SiteSearch off your short list if you’re looking for a way to dig through it.  If you want to index any kind of protected content, you’ll have to go with an actual hardware solution, as both the Mini and GSA support mechanisms to crawl and serve content that is behind a security layer.  This is a great option if you buy a Mini, use up the initial two years of support, then buy a second one: use one for internet and the other for intranet.

You’re also going to find that you are capable of pulling more valuable metrics out of the Mini than what you get with CSBE/SiteSearch.  Granted the standard “what are people searching for” question is easily enough answered.  But what about “what are people searching for that isn’t returning results?”  That can be equally as valuable in a lot of cases.  And while Site Search allows for search numbers by month and day, the Mini can go down to the hour, as well as show you your current queries per minute.  It’ll even keep tabs on how many pages it’s crawling currently, how many errors it found, and email you about it all.  All the reports can be saved out as XML, naturally, so you can mix and match datasets as you need for custom reports.

dir1boxAnd I have one word for you: OneBox.  Mini has it, thanks to a trickle down effect from the GSA – hosted Google options do not have it.  The OneBox essentially allows you to add in custom search results based on query syntax, and tailor the styling of the results.  You see this all the time at Google, for instance when you type in a phone number, or FedEx tracking number.  As you can see, these results need not come from your Google Mini search index.  It can come from other collections, or other sources entirely.  In the screenshot to the right, you can see a mock up of a OneBox result that matches a name format and returns contact information along with the standard search results.  Uses for this are many, and can span anything you might store in databases, such as course listings, book ISBNs, names, weather (if you have campuses in different cities), room information, etc.  Anything that you can define some kind of search pattern for.

On a quasi similar note, you can also link certain searches (or parts of searches) to keymatches.  These are commonly used for ads on Google that appear at the very top of search results (usually highlighted light yellow with the “Sponsored Link” caption), but you can use them to highlight a link that goes right to the automotive department when someone searches for something containing the word “auto.”  This is another feature unique to the Mini and GSA, and one more way to make sure searches are presented with relevant links.  This is very useful in cases where a department might not have a well optimized site which doesn’t show up first in a search for their department.

Ultimately, it’s a judgment call whether or not these features are worth the money to you.  At $3000, you’re basically paying $1000 each for the server itself and two years of support.  You can’t buy the unit without support though, so that notwithstanding, you’re getting a full featured search box with support for about twice the cost of a good PC.  If you have more than 50,000 pages to index though, you’ll find that price goes up.  At the same time, if you do have over 50,000 pages, there are a lot of other reasons not to go hosted, such as control over results, index freshness, result relevance, etc.  All these are always important, but they become even more so the bigger your site is.  Consider, if you have half a million pages on your site, and you need to make sure people find the needle that they need to in that haystack, would you rather have some control over that, or cross your fingers and hope Google gets it right?

My end impression is the Google’s Site Search is a great little tool for small businesses that are dealing in a few thousand pages, who can’t afford a server, or who don’t have the resources to maintain it.  Keeping up the server isn’t an involved job at all, but does require someone capable of checking in on it monthly or so, at least.  But, as universities, we generally have the resources for such a tool, both financially and manpower-wise.  We’re also large enough to justify a dedicated box for such an important task.

If you’re still researching what’s right for you in hosted search, it might well be worth keeping an eye on Yahoo BOSS though, it’s making some pretty cool claims on functionality.  OmniFind is also great free software if you have the resources to run it already in place (like a VMWare cluster or other virtualized environment) and can function within its limitations (only having up to five collections being the big one).  Just remember, search is possibly the single biggest tool on your website behind maybe your portal, and it deserves due process to get the treatment and attention your users expect and deserve.