Can search engines (Google et al) index web pages returned from filling in a form or similar dynamic web page?

The question is if a search engine such as Google would index the the hundreds (even thousands) of pages possible that result from using an HTML FORM as the entry point into a web site -  - e.g. search of documents, query of a database, etc.
The answer is "yes" if the following conditions are present....

Depends upon if POST or GET is used. With a GET the form data is encoded into the URL and with POST it is not.  So, with the GET and URL encoding you get a whole bunch of different URLs and pages of content - each of which is separately identifiable and can be indexed.  With the POST the query data is in the HTTP request and not visible in the URL - the only page visible to the search engine is the main page (URL) and then only as if no data had been filled in.  The W3 specs on this specifically point out the benefits of being able to save form queries and results as URLs and that this should be done when the result of calling that URL hasn't changed the state of the data (just a query and not processing a transaction).

Do other pages have links to those long, URL encoded, FORM GET requests?.   Google in particular is an exercise in the tyranny of democracy.  So, the likelihood of a page being indexed is based (in part) upon the frequency with which links to that page are present on other pages.  This creates one of the misunderstandings about if Google indexes dynamically generated pages.  It isn't that Google can't index a dynamically generated page but that so few people will capture links on their own pages where the link format is:
        http://www.domain.com/path/page?var1=value1&var2=value2&var3=value3.....

People don't do that but will capture links just to the form (or domain/path) and let the browsing human go from there.  As a result the "page" from a long URL isn't very popular as a link and doesn't get indexed by Google.
The URL alternative of http://www.domain.com/path/page/value1/value2/value3 or (as del.icio.us does it) http://www.domain.com/path/page/value1+value2+value3 gets indexed more often not so much because of the URL (it really is all just strings) but because that URL is easier, and so more likely, to be a link on a referring page. .

Webmasters may intentional request that search engines skip dynamically generated pages.   Dynamic pages get skipped because webmasters intentionally configure their sites so that they will be skipped.  Some engines decide of their own to exclude dynamic pages but others (incl. Google will) index dynamic pages.  Still, even those that do index dynamic pages are often "polite" about it in that they will skip your pages if you want them to.  So, for example, a webmaster may configure the site's robot.txt file to ask search engines to skip portions of the site - certain pages, directories, etc.  While this cuts down on the depth of search index for the site, it has several benefits such as avoiding the server processing to generate the dynamic pages (with no human benefit), cuts down on the bandwidth usage (which helps contain costs under most ISP pricing plans), helps protect copyright or intellectual property, impacts how web statistics are calculated for your site, etc.

Is the dynamically generated content in HTML/XHTML format?   Google and most search engines will only index the content of your page that is in HTML/XHTML (or supported document formats like DOC, PDF, etc.).   SharePoint sites don't get indexed very well by search engines because it pushes lots of JavaScript to the browser - you may see HTML text, links, etc. but if that is the result of JavaScript and not HTML/XHTML as it leaves the site.... then the search engines won't index that information.  Same goes for ActiveX controls, applets, Flash plug-ins, etc.   This stuff ends up in the HTML/XHTML stream more often than you might think.  If you hand code a simple form (using ASP, PHP, etc.) and carefully code the returned content to be HTML/XHTML then you are on the road to index nirvana.  However, people get away from that pretty quickly.  There are plenty of ASP.Net controls that produce only HTML/XHTML but others that embed plenty of JavaScript, graphics, etc. into the response.  Some of the web page design tools (FrontPage, DreamWeaver, NetObjects, WebSphere Studio, GoLive, etc.) may also be injecting this non HTML/XHTML content into your pages as well.

Do you syndicate content on your site using RSS.  An extension of the "do other pages have links to these pages" is that even if use link-friendly URLs, don't put non-HTML/XHTML content into your response to the browser, etc.....well, even if you build it...no one may come to your site.  Especially with Google,  that prevents even the best of sites from getting indexed just because they aren't popular - the tyranny of democracy strikes again.  However, Google now offers BlogSearch.  I've written about this before and how the name is a bit of a mis-nomer but the short of is that if you describe your content with RSS files and use a ping service to let Google BlogSearch know when updates are available, well then even the most obscure sites get indexed within a few minutes of their updates.   "sites" here just means some URL under a domain so it is possible to have content at www.domain.com/path/ and more content at www.domain.com/path/sub1/sub2/ and both will be indexed.  When you use the query to BlogSearch you can pick where in the hierarchy you want as the top-level of the search as a way of focusing the results from from the search engine - in addition to keyword, when updated, etc..
 
So, if your form is about restaurants and you know there are a limited number of combinations (say by state)  for the results in the form, you could create an RSS feed to each of those GET encoded URLs and each of the resulting pages would be indexed by the blogsearch..... allowing you to then find the restaurant when you know the something in the menu description (using search) but not the state (which could use the original form).
 
 
Hope this is informative.
 
-  Brian

No comments: