Search Spider Capabilities

  1. Which web servers and protocols can the appliance crawl?
  2. Can the appliance authenticate to a proxy server when crawling?
  3. My server has several cnames. Will each be indexed separately?
  4. What file formats can the appliance index?
  5. Are there limits on the file sizes that the Google Search Appliance can crawl?
  6. How are PDF files handled?
  7. Does the appliance crawl through pages containing Javascript or using Javascript menus?
  8. Can the appliance extract the internal modification date of non-HTML files?
  9. Can the appliance crawl XML files?
  10. How are unreachable URLs handled?
  11. How does the crawler work with links on my pages that automatically send email?
  12. Why are some files not being crawled on my file servers?

Q: Which web servers and protocols can the appliance crawl?
A: The appliance crawls using standard HTTP and HTTPS requests and can handle a variety of web servers, including: Apache, Netscape Enterprise, Lotus Domino Enterprise, and Microsoft Internet Information Server.
Q: Can the appliance authenticate to a proxy server when crawling?
A: The appliance can authenticate to proxy servers that support Basic Authentication.
Q: My server has several cnames. Will each be indexed separately?
A: Possibly. However, the Google Search Appliance maintains a list of duplicate server names. So if we catch or you submit the cnames only one record will be returned.
Q: What file formats can the appliance index?
A: The Google Search Appliance can index a wide variety of formats, including HTML, PDF, text, PostScript, Microsoft Office, and many more.
Q: Are there limits on the file sizes that the Google Search Appliance can crawl?
A: Some limits are imposed on crawling large HTML and non-HTML files. For HTML files, the GSA crawls up to a size of 2.5 MB, then discards the remainder of the file. For non-HTML files, Google crawls files up to 30 MB in size. Files larger than 30MB are discarded and not crawled. Non-HTML files are converted to HTML. Then the first 2.5MB of the HTML file are indexed. The remainder is discarded.
Q: How are PDF files handled?
A: Crawling large numbers of PDF files can be slow due to the CPU time required to convert PDF files to HTML. A process is in place so that PDF documents are not recrawled unless they have changed.
Q: Does the appliance crawl through pages containing Javascript or using Javascript menus?
A: The appliance will not crawl through URLs contained within Javascript code. This is especially important to remember for sites that rely heavily on menus driven by Javascript.

Because of this, we recommend using jump pages or basic HTML site maps. Jump pages are similar to site maps and offer a list of links that lead the user deeper within the site. The crawler will then be able to follow the links contained on these pages.

Q: Can the appliance extract the internal modification date of non-HTML files?
A: The appliance cannot extract the internal creation dates of non-HTML files. For non-HTML documents the appliance is limited to using the last modified date returned by the HTTP server. For HTML documents, the appliance can extract dates from the title, text, or metatags of a document or from the last modified date returned by the HTTP server.
Q: Can the appliance crawl XML files?
A: XML is not a supported filetype, but you can index XML files providing you remove lines that begin:

<?xml
Q: How are unreachable URLs handled?
A: While fetching URLs, the appliance waits for two minutes before declaring a URL unreachable. If a URL does become unreachable, it is immediately put back into the URLs-inflight queue to be tried again. If the first retry fails, the URL will be tried occasionally during the crawl. Unreachable URLs remain in the inflight list until they are crawled or an error like a 404 is returned.
Q: How does the crawler work with links on my pages that automatically send email?
A: You want to prevent the crawler from following hyperlinks that cause a state change, such as those that, when clicked, delete a record in the database, send an email, or print a web page. To prevent this behavior, you’ll want to modify your "robots.txt" file.
Q: Why are some files not being crawled on my file servers?
A: This situation typically occurs when the web interface to the file server, or the file server itself, cannot support the load generated during the crawl. Thus, the directory requests will close early, removing files listed below the break.

To improve the throughput of your file listings, you can tune your web server installation. Here are some tips for Apache:

  • Turn off FancyIndexing - Cuts down the amount of HTML generated for each directory listing
  • Increase the number of MinSpareServers and MaxSpareServers
  • Increase the number of StartServers
  • Decrease KeepAliveTimeout

← Back

Picture of the Google Search Appliance
Google Logo