Google Search Appliance - User FAQ

  1. What is a Google Search Appliance?
  2. How is the appliance different from the public Google search?
  3. Is every web page published at Texas A&M university indexed in the search appliance?
  4. How many documents are in the Texas A&M license?
  5. What file formats can the appliance index?
  6. Will the appliance crawler follow the URLs found within non-html documents?
  7. How are ranking and relevancy determined?
  8. Does the appliance perform any result filtering?
  9. What's the difference between keywords and queries?
  10. Can you link results to relevant areas inside of large documents?
  11. How does the appliance handle session IDs in the URL?
  12. How does the appliance determine whether a URL has changed since the previous crawl?
  13. Can the appliance extract the internal modification date of non-HTML files?
  14. Where I get more information on the Google Search Appliance?

Q: What is a Google Search Appliance?
A: The Google Search Appliance brings the power of Google to the Texas A&M University campus. The appliance is a custom built server containing the Google search software that is run and administered locally. This allows us customize returns and make them more relevant that users would get from the Google public search.
Q: How is the appliance different from the public Google search?
A: Only the Texas A&M web presence is included in the search index. Further, we can customize the returns to allow for more relevant returns than is provided in the public search. Most of the features and syntax available in the public Google search is incorporated into the search appliance.
Q: Is every web page published at Texas A&M university indexed in the search appliance?
A: No, there are a number of reasons a web page might not be included in the index. The engine finds pages by "spidering" the web, meaning that if a page isn't linked from another page in the index and hasn't been submitted to search engine administrators it likely will not be in the index. Further, we have purposefully removed pages from the index that violate our license agreement or which create "black holes" (for example, calendars or bulletin boards with links to "next month" that effectively go on forever.)
Q: How many documents are in the TAMU license?
A: The Division of Marketing & Communications has licensed the search appliance for 2 million documents. However, the hardware upon which the appliance runs is capable of licensing up to 3 million documents should we see the need to expand the size of the index.
Q: What file formats can the appliance index?
A: The Google Search Appliance can index a wide variety of formats, including HTML, PDF, text, PostScript, Microsoft Office, and many more.
Q: Will the appliance crawler follow the URLs found within non-html documents?
A: Yes, the Google Search Appliance will follow links contained in PDF documents and Flash content. It will not follow links in other formats, such as Microsoft Office documents.
Q: How are ranking and relevancy determined?
A: Algorithms developed by Google—Hypertext Analysis and PageRank™—determine relevancy. Hypertext Analysis looks at more than 100 key factors as it completely indexes each page, while PageRank looks at the relationship of the links themselves within the sites.
Q: Does the appliance perform any result filtering?
A: Yes, the results returned for each query are automatically filtered through a Google quality filter. This filter removes duplicate results with identical snippets (descriptions) for the given search terms. Only the most relevant result is returned. The filter can be bypassed on a query-by-query basis by adding "&filter=0" to the query URL. Note that the appliance will also filter out pages with identical text content during indexing. It is currently not possible to disable duplicate page filtering.
Q: What's the difference between keywords and queries?
A: Keywords are the individual words that a user types into the search box. Queries are the full search query.

For example, if I search for Texas A&M, there are two keywords: Texas and A&M, but that is only one query: Texas A&M.

Q: Can you link results to relevant areas inside of large documents?
A: This is currently not supported. However, terms are highlighted in the cached version of the pages to allow users to easily see where terms appear in the original document.
Q: How does the appliance handle session IDs in the URL?
A: The appliance strips certain session IDs that match built-in patterns. The session ID is stripped before the URL is fetched from the server. The session ID will not be stripped unless it matches one of the built-in patterns.
Q: How does the appliance determine whether a URL has changed since the previous crawl?
A: Unless the URL matches the patterns in a "Force Recrawl" field we've set up, the appliance performs a HEAD request to determine the Last Modified date of the document. If the host claims that the document has not changed in the last 20 days and our indexed version is less than 20 days old, the appliance will assume that we have the most recent version of the document.

Once the document is downloaded, the appliance uses a content checksum to determine whether the document has changed.

Q: Can the appliance extract the internal modification date of non-HTML files?
A: The appliance cannot extract the internal creation dates of non-HTML files. For non-HTML documents the appliance is limited to using the last modified date returned by the HTTP server.

For HTML documents, the appliance can extract dates from the title, text, or metatags of a document or from the last modified date returned by the HTTP server.

Q: Where I get more information on the Google Search Appliance?
A: Google maintains a public documentation/help site for the Google appliance.

← Back

Picture of the Google Search Appliance
Google Logo