Google Search Appliance: Search Your Server Logs for URL Recursion
Google is retiring the Google Search Appliance at the end of 2018. In anticipation, the University of Minnesota will adopt Google's Custom Search Engine. Please contact firstname.lastname@example.org with questions about this change.
URL recursion can happen when your Web server allows multiple URLs to fetch the same document, where one URL makes the document appear deeper in the directory structure. When relative links are used to link to neighboring documents on the server, the Search Appliance interprets this as finding more documents in successively deeper directories, often leading to exponential crawling and growth of our search index. This incurs unnecessary load on both the Search Appliance and your Web server.
You can spot URL recursion in your server logs fairly easily. The most obvious sign is a set of requests with repetitive sequences of directories becoming progressively longer. Here is a simplified example:
gsa-crawler GET /dept/index.html gsa-crawler GET /dept/support/index.html gsa-crawler GET /dept/support/dept/index.html gsa-crawler GET /dept/support/dept/support/index.html . . .
To fix this, check for recursive symbolic links in the filesystem of your Web server. If you must use such filesystem links, then fix the links within your Web pages so that they use site-absolute references instead of relative references. For example, use "/dept/support/index.html" instead of "support/index.html". Your server should also return HTTP 404 status codes for pages that do not exist, instead of returning a success code and a page of links (the Search Appliance would interpret the latter as a new page to crawl).