Google Search: Searchable Content
Our current Search Appliance license allows us to index 3 million documents. We would quickly reach this limit if we did not exclude problematic Web pages from the index. We also exclude groups of pages that have no searchable value as we encounter a significant number of them in the index. Such pages include reply forms for blog entries and "print versions" of articles (when the normal version exists) — these pages are really only valuable when visited.
Currently Excluded Content
- Pages leading to infinitely deep URL recursion. This includes highly dynamic pages, such as calendars, link to slightly different versions of themselves.
- Pages whose URLs contain session data. These are a problem because the Search Appliance perceives a new page for each unique URL it encounters, when in reality the same page has random session data appended to its URL.
- Binary files, such as ZIP archives.
- URLs containing a '?' character. We realize this is very exclusive, but we had quickly found that it was not feasible to continually prune the index of problematic pages when most of them contained a '?' in their URLs. Unfortunately, this exclusion is likely to remove several valuable, database-driven Web pages from the index.
If you cannot find your documents with the Search Appliance, you may submit their URLs to us using our Request Form. We can add them to the document index.
Non-HTML Document Types Included in the Index
- Microsoft Office: Word (.doc), PowerPoint (.ppt), Excel (.xls)
- PDF (.pdf) and PostScript (.ps)
- Plain text
- Shockwave and Flash (.swf)