In my post regarding Google’s Desktop Search Engine, I mentioned several objections I had to the technology. One of them was that it did not search past 5,000 words in documents.
Subsequent research has brought to light that this partial-search is a common trait of search engines (although most of them go much farther than 5,000 words). How far down the document the indexer will go is a metric known as page depth. And, according to searchenginewatch.com, there is considerable variety in the page depth. According to the site’s 2003 figures (the latest one they have), here is how the major search engines measure up in terms of page depth:
- Google: 101KB
- MSN: 150 KB
- Yahoo: 500 KB
So, as you see: In terms of results, a search engine only gets you partial information on what’s known to be available. Such that the absence of a result should not imply that the item does not exist. In fact, the item might even be sitting in the search engine’s document cache. However, it is simply not in the engine’s index.