Classic Search (Google, HTML-based) v. Syndication-based Search (Google BlogSearch, RSS-based)


Google has a new search product in beta that I think everyone working with websites, portals, etc. will be interested in.  This new search approach isn't all good (isn't all bad either) but offers a very different cost/benefit proposition from what how most of us currently think of web search tools as working. 
In my opinion, any professional working in the website, portal, etc. space will want to know about this option. Even systems deploying their own portal-based search solution (e.g. the USDA site uses a portal and not Google) should consider the is alternative approach to how to manage their content even if they aren't currently using the exact product/solution described below...
The new Google product is currently called "BlogSearch" but that is really a misnomer (far worse that the "Java" in "JavaScript") and I suspect it will be re-named when released to general availability.  It is really an XML-based search of any syndicated (RSS, ATOM, RDF, etc.) web content - what the short, catchy, name for that will be is beyond me but they have to do better than "BlogSearch" because that grossly understates the actual capabilities.
How it works is explained on the Google website but I'll try my hand at the summary version....
The basic use case for search is:
  1. Someone publishes the content they have
  2. Someone searches for the content they want
The realization in "classic" web search (Google, etc.) works like this:
  1. Someone publishes the content they have - that content is formatted as HTML pages with links to other HTML pages, PDF documents, etc.
  2. Search engines "crawl" websites looking for content and index what they find and can use - not all engines can read PDF documents for example. Also, that "crawl" can take awhile (more on this in a bit).
  3. People looking for content use the search engine (on the public Internet, in an intranet appliance, thru a portal, etc.) to find links to the content they want and follow those links (hopefully) to that content.
The syndication-based search realization works a bit differently:
  1. Someone publishes the content they have.  That raw underlying content may be in any format (HTML, PDF, etc.) but it must summarized in one of the popular syndication formats (RSS, ATOM, etc.) as an index.  Blog engines do this automatically (hence the current working name for this product) but syndication is widely used for other content as well - most news services provide RSS syndication, there are syndication add-ons for SharePoint (both publish and subscribe).  People have been coding syndication facades for databases but the native XML in IBM DB2 v9 ("Viper") screams for syndication of database content.
  2. The search engine is notified that updated content is available.  In the Google solution, this done via WebLogs. The search engine consumes the syndication feed from the original website and indexes based upon the feed content.
  3. People looking for content query the search engine that has been consuming syndication feeds - that search has the same "feel" as regular search (search terms, date ranges, etc.).  The links in those results lead them to the content they are looking for....with a few "twists" I'll explain below.
Those realizations look pretty similar so here are the key differences
Shifts some of the indexing work back onto the publisher.  HTML-based "index" files have become a real misnomer - they aren't indices at all (from an information engineering perspective) but just part of the presentation layer for the human viewer of web content.  They are also increasingly difficult sources of information for machine-tools like search engines (SharePoint, most portals and rich media sites are the worst).  XML-based search tools like BlogSearch don't read HTML though - they read the XML-based syndication feeds instead.  They benefits they offer (see below) are the reward for the publisher describing their content in this machine-friendly format.  That isn't a difficult or laborious task though.. blogs like Blogger do it "behind the scenes" and so do more serious web-apps like NOAA's Weather Alert.  Once you get the hang of the RSS format, it is pretty easy to work with and it is widely supported in an increasing number of web-oriented products and tools.
Dramatically changes the timing of how content is indexed and available for search.   Classic web search relies upon the search engine "crawling" some defined webspace (public or private) and indexing what it finds.  Crawling is slow to update, and indices get out of date, because as the volume of web content expands (often dramatically) the time it takes to get back and re-index a particular URL gets longer and longer.  In the public internet, search engines focus on the web equivalent of the Fortune 1000 and ignores the periphery - Google crawls the CNN website all the time but probably doesn't get around to the Comprehensive Minority Biomedical Branch of NIH website quite as often.  Even when using an intranet search appliance, searching is resource intensive so most organization schedule search (nightly, weekly, etc.) - that means there is still a lag until new content gets re-indexed. 
With "BlogSearch" things work a bit differently.  In this approach the search engine doesn't crawl anything but waits for notifications (via the syndication feeds) that new content is available.  The search engine doesn't even index the content but only the feed.  This means that indices on content can be kept much more current and maintaining them is much less resource intensive.  If you point regular Google at my personal website (not quite as popular as CNN!)  with the search phrase "java site:www.brianlawler.net/blog/" you get very few (and old) results because Google doesn't get around to my little corner of the Internet very often. If you point Google BlogSearch (beta) at the same content with the search phrase "java blogurl:www.brianlawler.net/blog" you get many times as many results (all the Java content available) including any posts made as recently as the past hour.  My reward for ensuring the availability of an RSS feed is almost up-to-the-minute indexing of even my remote outpost on the web. 
Requires a change in thinking about change management.    One reason that this search approach doesn't "crawl" web content is because it is based upon the assumption that the URLs for that content never go away. If (or when) this is true, the links in your search results are never dead-ends - which is the proverbial "good thing".  What is the thinking behind this?  The genesis of this approach is more from email and blogs than from web pages (at least as web pages are most often used).  With an email, once I hit the "send" button, I can never delete or change the email - I can send out a subsequent email that says something different but the original is still floating around out there.  Blogs generally work the same way - once you make a post (original URL) you are generally stuck with it and have to clarify any changes in your thinking with an additional post (new URL).  Blogs are bit different than email in that you can change the content of the post even if the post (and its URL) have to be preserved - a common correction/revision technique is to update the original page with a link to the new page.  The key is that content is always additive and URLs for old pages never go away. 
As a result, syndication feeds are usually only indices of the most recent content on a site.  As the site content expands, the syndication feed "moves forward" to cover the new entries and drops some of the old ones out of the feed.  This is configurable and each publisher can make their own decision about what counts as "new" but the key point it that the search engine doesn't need to revisit old pages.  The URLs for those pages are never supposed to go away and even if the content has changed (e.g. with a link to the new page) the URL still logically applies to some topic and the search engine already "knows" about the new content when that shows up in subsequent revisions of the syndication feed.
Dramatic changes in how search results can be personalized or "branded".   In classic search you are generally very aware that you are using search - putting the Google logo in front of you over and over again helps with their branding.....doesn't do much for your enterprise.  This is true even if you are using an intranet appliance - that search is going on isn't exactly subtle. 
Many organizations would like to incorporate search with a bit more nuance - both to keep the site visitor thinking about the organization (and not Google) and to give better service to the customer.  With BlogSearch everything is XML based (including the search results) and more suitable to personalization.  So, using a dynamic web page technology like JSP/PHP/ASP/etc., sites can call the search engine but consume the XML results themselves, keep the presentation with the organization's brand (not Google's) and make the organization more "sticky" with the customer.  This also helps in personalization in that behind the scenes the website can weave in search results to suggest content that might be interesting - e.g. as I search the NASA site for information about the Europa moon on Jupiter, the site can search itself for content about other moons (Callisto, Ganymede, etc.) and weave that into the pages I am seeing.
Organizations using a Google search appliance (not that public service) have always gotten search results in XML format by default - you have to format that content (with the "proxystylesheet" argument) to get HTML results.  This differentiator applies more to small organizations (e.g. the CMBB at NIH) who don't use an appliance or a portal but still want their site well-indexed.
 

- Brian

No comments: