Google Search Appliance: Index Database-Driven Web Sites
This guide will allow you to index Web Pages that are served by a template script from a database. The Search Appliance typically only finds pages that are linked from other crawled pages, so if some of your database-driven content is not linked from anywhere, then it will likely remain unsearchable.
This article will cover:
Create an HTML site map that contains links to every page (record) in your database. For example, each of the content pages may have a URL that differs from the next by only an ID number:
Run a query against your database to retrieve a list of unique record IDs (the part that is unique to each page's URL). When creating the site map, iterate over your recordset, writing a link to each content page (record). At the top of the site map, in the HEAD, write a META tag that instructs search engine crawlers to not index the page itself, but to only follow the links on the page.
Now you need to give the Search Appliance an entry point to your site map so the page and all of its links can be crawled. There are two ways you could do this:
- (preferred) Create a hidden link on a page in your Web site that the Search Appliance has already crawled and indexed. The Search Appliance will follow this link the next time it crawls the page. This method is preferred because it does not require extra Search Appliance configuration. It also allows other search engines to index your database content.
- Submit the URL of the site map to the Search Appliance by contacting email@example.com. It will be added to the crawler's list of starting URLs.
Follow this step if your content pages (linked from the site map) contain a question mark (?) in their URLs. Currently, the Search Appliance is configured to ignore documents whose URLs contain a "?", due to excessive crawling that has historically occurred within database-driven Web applications.
Submit the URL of a content page to firstname.lastname@example.org and indicate the portion of the URL that is common to all such pages for your database. We will add an exception to the crawler's list of ignored URL patterns. Using the example from Step 1, you could submit the URL "http://myserver.umn.edu/news/Article.php?ArticleID=1202" to us, noting that the "1202" part is what changes from page to page.