Change In The DNA 3: What Happened To My Site On Google?
- My page no longer comes up tops at Google for a particular search term. Why not?
- Why does/did running the filter test bring my site back into the top results?
- Has Google done this to force people to buy ads?
- Is there REALLY no connection with ads and free listings at Google?
- Does Google have a "dictionary" of "money terms" it uses to decide when to filter out some web sites?
- How can Google be allowed to hurt my business in this way?
- I heard Google's dropping pages that show signs of search engine optimization. Do I need to deoptimize my web pages?
- Does the filter test indicate that I've spammed Google?
- Does this mean Google no longer uses the PageRank algorithm?
- I thought the Google Dance was over, that the massive monthly update of pages had been replaced by a consistent crawl?
- If we remove our shopping cart, could that help us get back on Google, even though we'd be booted off Froogle?
- Can you get on a soapbox about all these Google changes?
- Does Google favor some large sites like Amazon because of partnerships it has?
- Does being in AdSense help sites rank better?
- Can I sue Google for being dropped?
Google made a change to its algorithm at the end of last month. This fact is obvious to any educated search observer, plus Google itself confirms it. The change has caused many people to report that some of their pages fell in ranking. These pages no longer please Google's algorithm as much as in the past.
If your page has suddenly dropped after being top ranked for a relatively long period of time (at least two or three months), then it's likely that your page is one of those no longer pleasing the new Google algorithm. Running what's called the filter test may help confirm this for you, at least in the short term.
Keep in mind that while many pages dropped in rank, many pages also consequently rose. However, those who dropped are more likely to complain about this in public forums that those who've benefited from the move. That's one reason why you may hear that "everyone" has lost ranking. In reality, for any page that's been dropped, another page has gained. In fact, WebmasterWorld is even featuring a thread with some comments from those who feel the change has helped them.
Why would Google run two different systems? My ideas are covered more in the Speculation On Google Changes article for Search Engine Watch members. The short answer is that I think the new system requires much more processing power than the old one. If so, then Google probably applied it initially to "easy" queries, such as those that didn't involve the exclusion or "subtraction" of terms
Why are more and more "hard" queries now going through the new system? It could be that Google was testing out the new system on easier queries and then planned to slowly unleash it on everything.
Alternatively, Google may have intended to run two algorithms all along but is being forced to abandon that plan because of the furor as site owners who've lost rankings use the filter test to see what they consider to be "old" Google.
Since it was discovered, the filter test has been used by hundreds, if not thousands of webmasters. These queries are processor intensive. They also have created an embarrassing situation Google has never faced before, where anyone can compare what looks to be "old" versus "new" results to show how the old results are better. Sometimes the new results might be better, of course -- but it's the mistakes in relevancy that get the most attention. They can be used as proof that new Google is worse than old Google.
As a result, Google may have ultimately decided that it needs to bring all queries into the new system -- if only to plug a "hole" it may have never anticipated opening into how it works internally.
Google won't confirm if it has been using two algorithms simultaneously. I can only tell you I've spoken with them at length about the recent changes, and that they've reviewed the article your reading now.
Whether you choose to believe my speculation or instead the idea that Google has employed some type of "filter" almost makes no difference. The end result is the same. For some queries, there are now dramatic difference from what "old" Google was showing.
However, there's also plenty of evidence of people who, despite being advertisers, lost their "free" top rankings. There are also people who've never run ads that continue to rank well. This makes it difficult for anyone to conclusively say that this change was ad driven.
Google completely denies charges it's trying to boost ad sales. The company says the algorithm change was done as part of its continual efforts to improve results. Google has always said that there is no connection between paying for an ad and getting listed in its "free" results.
In my view, there are far easier ways that Google could boost ad revenue uptake without doing sneaky, behind-the-scene actions -- which is why I tend to believe this is not why the change happened.
For instance, Google could make the first five links on a page -- rather than the first two links -- be paid ads for certain queries. They might also make this happen for terms determined to be commercial in orientation and offer up a defense that they've determined the commercial intent of the query is strong enough to justify this.
article about the Google changes. Other serious observers of search engines I know also doubt this, though certainly not all. Those in the "I believe" camp feel Google would simply risk too much in the long-term for any short-term gains it might get.
In terms of listing support, buying ads may be helpful. Some who spend a lot on paid listings at Google have reported success in getting their ad reps to pass along problems about their entirely separate free listings to Google's engineering department for investigation.
To some degree, this is like a backdoor for fast support. Those who aren't spending with Google's AdWords program have no such speedy solution to getting an answer back. Google has continually rejected suggestions that it should offer a "listing support" or paid inclusion program, saying it fears this might be seen as establishing a link between payment and its free results. For a deeper exploration of this, see my article for Search Engine Watch members from last year, Coping With Listing Problems At Google.
For the record, Google flatly denies that those who are advertising get more access. The company says it takes feedback from many sources, and every report is assessed for how it might have an impact on search quality.
Indeed, it's important to note that Google does provide another backdoor that plenty of non-advertisers have made use of. This is the WebmasterWorld.com forum site, where public and private messages to "GoogleGuy," a Google employee monitoring discussions, have been acted upon.
Google also turns out to various search engine conferences, such as the Search Engine Strategies show produced by Search Engine Watch that begins in Chicago on Tuesday. Google provides assistance to those with questions at these type of conferences, as well.
Google also offers a front door in the form of email addresses it publishes. Yes, expect you'll likely get a canned response to many queries. However, people do get some more personal investigation, as well.
It's also crucial to make the HUGE distinction between listing support and rank boosting. Investigating why a page may not be listed at all (rather than ranking well) is an appropriate activity for Google or any search engine. Boosting the rank of a particular page in return for payment, and not disclosing this, is not acceptable.
Q. Does Google have a "dictionary" of "money terms" it uses to decide when to filter out some web sites?This theory has emerged as people have run the filter test and discovered that for some queries, Google will show many more changes than for others. The Scroogle hit list provides a long look at examples like this. It reflects 24 hours worth of queries various people have tried at Scroogle to see if they've declined in the new ranking algorithm. Terms that had many changes are at the top of the list.
For example, earlier this week the Scroogle hit list showed that the top 99 of 100 results in a search for christmas present idea at Google were different under the new algorithm compared to the old. That's not entirely accurate, as explained more in my previous article. But overall, it's close enough. For that query, things have radically changed. The same was true for terms such as diet pill and poker gambling, both of which could be considered highly commercial in nature.
That's where the idea of there being "money terms" comes out of. Sites aiming to rank well for these terms may be expecting to make money. Some believe Google has thus decided to filter out some of these sites -- particularly the ones showing an intent to optimize their pages for Google and which are not major commercial entities -- and force them into buying ads.
It's a compelling theory. However, there are also commercial terms that showed little change, such as christmas time, books, sharp ringtones and games. The hit list is also compiled by those who are checking their own terms. As you might expect, that means it will be heavily skewed toward commercial queries. If a bunch of librarians entered a mass of non-commercial terms, there might have been some dramatic changes seen for that class of queries, as well.
In fact, a search for 1 2 3 4 5 6 7 8 was on the Scroogle hit list, someone obviously trying to test what happens with non-commercial searches. It came up with a score of 36 dropped pages. That's high enough to make you think that phrase might be in the "money list" dictionary, yet nothing about it appears commercial in nature.
There's no doubt the new algorithm does seem to have impacted many commercial queries very hard, in terms of the amount of change that's been seen. However, this seems more a consequence of how the new algorithm works rather than it coming into play only for certain terms. In other words, new criteria on how much links should count, whether to count particular links, when to count anchor text more (text in a hyperlink) and even what's considered spam probably have more impact on commercially-oriented queries.
It is possible that Google is also making use of its AdWords data. It wouldn't be difficult to examine what terms attract a lot of earning and use that data to make a list or even to feed the new algorithm.
For its part, Google won't confirm whether it is using some type of list or not.
In the end, whether there's a predefined list of terms or this is something happening just as a consequence of the new algorithm is moot. The final result is the same -- many sites that did well in the past are no longer ranking so highly, leaving many feeling as if they've been targeted.
Back before we had paid listings, one of my top search engine optimization tips was not to depend solely on search engines. They have always been fickle creatures. Today's cries about Google and lost traffic are certainly the worst I've ever heard. But I can remember similar complaints being made about other major search engines in the past, when algorithm changes have happened. WebmasterWorld.com even has a good thread going where people are sharing past memories of this.
We do have paid listings today, of course. That means you can now depend on search engines solely for traffic -- but only if you are prepared to buy ads.
As for free listings, these are the search engine world's equivalent of PR. No newspaper is forced to run favorable stories constantly about particular businesses. It runs the stories it decides to run, with the angles it determines to be appropriate. Free listings at search engines are the same. The search engines can, will and have in the past ranked sites by whatever criteria they determine to be best. That includes all of the major search engines, not just Google.
To me, the main reason Google's changes are so painful is because of the huge amount of reach it has. Google provides results to three of the four most popular search sites on the web: Google, AOL and Yahoo. No other search engine has ever had this much range, in the past. Go back in time, and if you were dropped by AltaVista, you might still continue to get plenty of free traffic from other major search engines such as Excite or Infoseek. No one player powered so many important other search engines, nor were typical web sites potentially left so vulnerable to losing traffic.
The good news for those who've seen drops on Google is that its reach is about to be curtailed. By the middle of January, it will be Yahoo-owned Inktomi results that are the main "free" listings used by MSN. Sometime early in next year, if not earlier, I'd also expect Yahoo to finally stop using Google for its free results and instead switch over to Inktomi listings.
When these changes happen, Google will suddenly be reduced from having about three quarters of the search pie to instead controlling about half. That means a drop on Google won't hurt as much.
Inktomi will have most of the other half of that pie. Perhaps that will be better for some who were recently dropped in ranking at Google. However, it's possible they'll find problems with Inktomi, as well.
In the past, I've heard people complain that paid inclusion content with Inktomi gets boosted or that crawling seems curtailed to force them into paid inclusion programs. Those complaints have diminished primarily because Inktomi's importance has diminished. Indeed, when Inktomi changed its algorithm in October, there were some negative impacts on site owners that surfaced. However, those concerns were hardly a ripple compared to the tidal wave of concern over Google. Once Inktomi's importance returns, so will likely a focus on any perceived injustices by Inktomi.
Q. I heard Google's dropping pages that show signs of search engine optimization. Do I need to deoptimize my web pages?If you absolutely know you are doing something that's on the edge of spam -- invisible text, hidden links or other things that Google specifically warns about -- yes, I would change these.
Aside from that, I'd be careful about altering stuff that you honestly believe is what Google and other search engines want. In particular, I would continue to do these main things:
- Have a good, descriptive HTML title tag that reflects the two or three key search phrases you want your page to be found for.
- Have good, descriptive body copy that make use of the phrases you want to be found for in an appropriate manner.
- Seek out links from other web sites that are appropriate to you in content
I almost hesitate to write the above. That's because I'm fearful many people will assume that some innocent things they may have done are hurting them on Google. I really don't feel that many people have dropped because Google is suddenly penalizing them. Instead, I think it's more a case that Google has done a major reweighing of factors it uses, in particular how it analyzes link text. In fact, that's exactly what Google says. Most changes people are seeing are due to new ranking factors, not because someone has suddenly been seen to spam the service, the company tells me.
Should you start asking sites to delink to you, or to drop the terms you want to be found for from the anchor text of those links? Some have suggested this. If these sites have naturally linked to you, I wouldn't bother. Links to you shouldn't hurt. In fact, the biggest reason for a lot of these changes is likely that links are simply being counted in an entirely new way -- and some links just may not count for as much.
Should you not link out to people? Linking out is fine in my view and should only hurt you if you are linking to perhaps "bad" sites such as porn content. Do that, and you could be associated with that content.
It's also a good time for me to repeat my three golden rules of link building:
- Get links from web pages that are read by the audience you want.
- Buy links if visitors that come solely from the links will justify the cost.
- Link to sites because you want your visitors to know about them.
Think about it like a test. Let's say that in this test, people were judged best primarily on how they answered a written question, but multiple choice and verbal portions of the test also counted. Now the criteria has changed. The verbal portion counts for more, and you might be weaker in this area. That means someone stronger might do better in the test. You aren't doing worse because of any attempt to "cheat" but simply because the criteria is different.
Unfortunately, some writing about Google have called its system of ranking PageRank, and Google itself sometimes makes this mistake, as seen in its webmaster's information page:
The method by which we find pages and rank them as search results is determined by the PageRank technology developed by our founders, Larry Page and Sergey Brin.
In reality, the page describing Google's technology more accurately puts PageRank at the "heart" of the overall system, rather than giving the system that overall name.
By the way, PageRank has never been the factor that beats all others. It's has been and continues to be the case that a page with low PageRank might get ranked higher than another page. Search for books, and if you have the PageRank meter switched on in the Google Toolbar, you'll see how the third-ranked Online Books Page with a PageRank of 8 comes above O'Reilly, even though O'Reilly has a PageRank of 9. That's just one quick example, but I've seen others exactly like this in the past, and you can see plenty first-hand by checking yourself.
Q. I thought the Google Dance was over, that the massive monthly update of pages had been replaced by a consistent crawl?To some degree, the Google Dance had diminished. Historically, the Google Dance has been the time every month when Google updated its web servers with new web pages. That naturally produced changes in the rankings and so was closely monitored. Sometimes, an algorithm change would also be pushed out. That could produce a much more chaotic dance.
Since June, life has been mercifully quiet on the dance front. Google has been moving to refresh more of its database on a constant basis, rather than once per month. That's resulted in small changes spread out over time.
Google says that continual updates are still happening. The dance came back not because of a return to updating all of its servers at once but rather because of pushing out a new ranking system.
Q. If we remove our shopping cart, could that help us get back on Google, even though we'd be booted off Froogle?This question coincidentally came in just after I saw Google implement Froogle links in it search results for the first time. Talk about timing!
No, removing your shopping cart really shouldn't have an impact on your regular Google web page rankings. Lots of sites have shopping carts. It's perfectly normal to have them.
As you also note, having an online shopping service means you have data to feed Google's shopping search engine Froogle. And Froogle's now hit Google in a big way. If Froogle has matches to a query, then Froogle links may be shown above web page matches at Google.
It happens similar to the way you may get news headlines. Search for iraq, and you'll see headlines appear above the regular web listings next to the word "News." If you search for a product, then you may see similar links appear listing product information from Froogle, next to the words "Product Search."
Google unveiled the new feature late Friday, and it's to be rolled out over this weekend, the company tells me. A formal announcement is planned for next week, and Search Engine Watch will bring you more about this.
In the meantime, anyone who's been dropped by Google in its regular web search results should seize upon Froogle as a potential free solution to getting back in. Froogle accepts product feeds for free -- see its Information For Merchants page for more. And since Froogle listing are now integrated into Google's pages, it means you can perhaps regain visibility this way.
For more about Froogle, see these past articles from Search Engine Watch:
- Online Shopping with Google's Froogle
- Getting Listed In Google's "Froogle" Shopping Search Engine (for Search Engine Watch members)
I truly believe that Google has done us wrong. We worked hard to play by the rules, and Google shot us in the back of the head.
That comment is typical of many you see in the forums. Many people are mystified as to why they are suddenly no longer deemed good enough by Google, especially if they had been doing well for a long period of time and feel they played by the "rules."
Yes, free listings aren't guaranteed. Yes, search engines can do what they want. Yes, it's foolish for anyone to have built a business around getting what are essentially free business phone calls via Google.
None of that helps the people feeling lost about what to do next. Many have been dropped but may see sites similar to theirs still making it in. That suggests there's a hope of being listed, if they only understood what to do. So what should they do? Or what shouldn't they be doing?
My advice is unchanged -- do the basic, simple things that have historically helped with search engines. Have good titles. Have good content. Build good links. Don't try to highly-engineer pages that you think will please a search engine's algorithm. Focus instead on building the best site you can for your visitors, offering content that goes beyond just selling but which also offers information, and I feel you should succeed.
Want some more advice along these lines? Brett Tabke has an excellent short guide of steps to take for ranking better with Google, though I think the tips are valid for any search engine. Note that when GoogleGuy was recently asked in a WebmasterWorld members discussion what people should do to get back in Google's good graces, he pointed people at these tips.
I Did That -- And Look At How It Hasn't Helped!
Unfortunately, some believe they've followed these type of tips already. Indeed, one of the nice things about Google's growth over the past three years is that it has rewarded webmasters who have good content. As they've learned this, we've seen a real shift away from people feeling they need to do what's often dubbed "black hat" techniques such as targeted doorway pages, multiple mirror sites and cloaking.
That's why it's so alarming to see the sudden reversal. Some people who believe they've been "white hat" now feel Google's abandoned them. Perhaps some have not been as white hat as they thought, but plenty are. Many good web sites have lost positions on Google, and now their owners may think they need to turn to aggressive tactics. This thread at WebmasterWorld is only one of several that show comments along these lines.
Maybe the aggressive techniques will work, and maybe not. By my concern is really reserved for the mom-and-pop style operations that often have no real idea what "aggressive" means. To them, aggressive means that they think they need to place H1 tags around everything, or that every ALT tag should be filled with keywords, or that they should use the useless meta revisit tag because somewhere, somehow, they heard this was what you need to do.
More Openness From Google
One thing that would help is for Google to open up more. It has a new ranking system, obviously. It should be trumpeting this fact and outlining generally what some of these new mystery "signals" are that it is using to help determine page quality and context.
Google can provide some additional details about how it is ranking pages in a way that wouldn't give away trade secrets to competitors nor necessarily give some site owners a better ability to manipulate its listings. Doing so would make the company look less secretive. It might also help explain some of the logic about why sites have been dropped. That would help readers like this:
What really concerns me right now is that there doesn't appear to be any rhyme or reason as to why some sites have a good ranking and what we could do to improve our rankings.
Maybe Google has decided that it makes more sense to provide informational pages on certain topics, because otherwise its listings look the same as ads (see the honeymoon case study for an example of this).
If so, that's fine. It can defend this as helping users, ensuring they have a variety of results. But at least the declaration that it is doing so will let site owners understand that they may need to create compelling informational content, not sales literature. They may also realize that they simply are not going to get back free listings, for some terms. With that understanding, they can move on to ads or other non-search promotional efforts.
Searchers Want To Know, Too
Google doesn't just need to explain what's going on to help webmasters and marketers. Most important, some of Google's searchers want to know how it works behind the scenes.
Google has set itself up almost as a Consumer Reports of web pages, effectively evaluating pages on behalf of its searchers. But Consumer Reports publishes its testing criteria, so that readers can be informed about how decisions are made. It's essential that Google -- that any search engine -- be forthcoming in the same manner.
To its credit, Google has given out much information. There's a huge amount published for webmasters, and even more is shared through forums and conferences. But if Google is now doing things beyond on-the-page text analysis and link analysis that it has publicly discussed, it needs to share this so searchers themselves can be more informed about how decisions are reached.
Right now, some of these searchers are reading news reports that a search for miserable failure brings up US president George W. Bush's biography as the top result. They'll want to understand why. Is Google calling Bush a miserable failure? Is this an example of Google's "honest and objective way to find high-quality websites with information relevant to your search," as its technology page describes?
The answer to both question is no. Google Bombing has made that biography come up first, and those doing the bombing have no "objective" intentions behind it. They think Bush is a failure, and they are using Google as a means to broadcast that view.
Does this mean Google is a miserable failure as a search engine? No. Ideally, Google should have caught such an overt attempt to influence its rankings, and it's notable that this got past even its new ranking system. However, Google is not perfect, nor will it ever be. Fortunately, searchers seeing a listing like that can understand why it came up if they understand a bit about how link analysis works. That helps them better evaluate the information they've received.
Now go search for christmas at Google. I bet plenty of searchers are wondering why, like my colleague Gary Price of ResourceShelf who reported this to me, Marylaine Block's web site is ranked sixth for christmas out of 36 million possible web pages?
Block's not sure herself. Links may have something to do with it, but so might some of these new "signals" about page quality and content of which Google cannot speak. Since Google's not talking, we can't understand -- and crucially -- forgive when it makes mistakes.
Marketer Reality Check
Having dumped on Google, it's also important that webmasters and marketers understand that Google is never going to outline exactly how it works. No popular search engine will ever do this, because the volume of successful spam that would result would bring the search engine to its knees.
Marketers also have to recognize that Google and other search engines will continue altering their ranking systems, just as they always have done -- and that listings will change, sometimes dramatically, as a result.
Whether Google and the others discuss openly how they work or not, people eventually discover new ways to be successful with spam. That has to be fought.
More important, the nature of search keeps changing. Links were a useful "signal" to use and one that gave the relevancy of web crawling a new lease on life several years ago. Now linking is different. Blogs link in a way that didn't exist when Google launched. Reciprocal linking and link selling is much more sophisticated and often designed to take Google and search engines into account. These are just two reasons why the methods of analyzing links has to change.
It's also a certain fact that the most popular and lucrative real estate on a search engine is not going to continue to use web crawling as its first source of data. It simply makes more sense to go with specialized data sources when these are available. Web search's destiny is to be backfill for when these other forms of data fail to find matches.
Free traffic from web listings will inevitably decline as search engines make use of specialized data sources through invisible tabs. It won't go away entirely, and there's always going to be a need to understand "search engine PR" to influence free results. But smart marketers will realize that they need to look beyond web search to stay ahead.
If Google dropped you, Froogle just got a promotion as a new way to get back in. So, too, will other opportunities come up. The downside is, unlike Google -- or even Froogle -- they'll likely cost money. Smart businesses will realize they need to budget for this, just as they budget for advertising and to obtain leads in the real world. It's the rare and exceptional company that can get by on PR alone -- even the UK's popular Pizza Express chain had to diversify into advertising.
Become an Expert Digital Marketer at SES New York
March 25-28, 2013: With dozens of sessions on Search, Social, Local and Mobile, you'll leave SES with everything and everyone you need to know. Hurry, early bird rates expire February 21. Register today!
Change In The DNA 4: Google - Update "Cassandra" is here
Change In The DNA 7: Explaining algorithm updates and data refreshes
To answer in more detail, let’s review the definitions. You may want to review this post or re-watch this video (session #8 from my videos). I’ll try to summarize the gist in very few words though:
Algorithm update: Typically yields changes in the search results on the larger end of the spectrum. Algorithms can change at any time, but noticeable changes tend to be less frequent.
Data refresh: When data is refreshed within an existing algorithm. Changes are typically toward the less-impactful end of the spectrum, and are often so small that people don’t even notice. One of the smallest types of data refreshes is an:
Index update: When new indexing data is pushed out to data centers. From the summer of 2000 to the summer of 2003, index updates tended to happen about once a month. The resulting changes were called the Google Dance. The Google Dance occurred over the course of 6-8 days because each data center in turn had to be taken out of rotation and loaded with an entirely new web index, and that took time. In the summer of 2003 (the Google Dance called “Update Fritz”), Google switched to an index that was incrementally updated every day (or faster). Instead of a monolithic monthly event, the Google would refresh some of its index pretty much every day, which generated much smaller day-to-day changes that some people called everflux.
Over the years, Google’s indexing has been streamlined, to the point where most regular people don’t even notice the index updating. As a result, the terms “everflux,” “Google Dance,” and “index update” are hardly ever used anymore (or they’re used incorrectly ). Instead, most SEOs talk about algorithm updates or data updates/refreshes. Most data refreshes are index updates, although occasionally a data refresh will happen outside of the day-to-day index updates. For example, updated backlinks and PageRanks are made visible every 3-4 months.
Okay, here’s a pop quiz to see if you’ve been paying attention:
Q: True or false: an index update is a type of data refresh.
A: Of course an index update is a type of data refresh! Pay attention, I just said that 2-3 paragraphs ago. Don’t get hung up on “update” vs. “refresh” since they’re basically the same thing. There’s algorithms, and the data that the algorithms work on. A large part of changing data is our index being updated.
I know for a fact that there haven’t been any major algorithm updates to our scoring in the last few days, and I believe the only data refreshes have been normal (index updates). So what are the people on WMW talking about? Here’s my best MEGO guess. Go re-watch this video. Listen to the part about “data refreshes on June 27th, July 27th, and August 17th 2006.” Somewhere on the web (can’t remember where, and it’s Christmas weekend and after midnight, so I’m not super-motivated to hunt down where I said it) in the last few months, I said to expect those (roughly monthly) updates to become more of a daily thing. That data refresh became more frequent (roughly daily instead of every 3-4 weeks or so) well over a month ago. My best guess is that any changes people are seeing are because that particular data is being refreshed more frequently.
Change In The DNA 8: Search Engine Size Wars & Google's Supplemental Results
Around this time last year, AllTheWeb kicked off a round of "who's biggest" by claiming the largest index size. Now it's happened again, when AllTheWeb said last month that its index had increased to 3.2 billion documents, toppling the leader, Google.
Google took only days to respond, quietly but deliberately notching up the number of web pages listed on its home page that it claims to index. Like the McDonald's signs of old that were gradually increased to show how many customers had been served, Google went from 3.1 billion to 3.3 billion web pages indexed.
Actually, not yawn. Instead, I'm filled with Andrew Goodman-style rage (and that's a compliment to Andrew) that the search engine size wars may erupt once again. In terms of documents indexed, Google and AllTheWeb are now essentially tied for biggest -- and hey, so is Inktomi. So what? Knowing this still gives you no idea which is actually better in terms of relevancy.
Size figures have long been used as a surrogate for the missing relevancy figures that the search engine industry as a whole has failed to provide. Size figures are also a bad surrogate, because more pages in no way guarantees better results.
How Big Is Your Haystack?There's a haystack analogy I often use to explain this, the idea that size doesn't equal relevancy. If you want to find a needle in the haystack, then you need to search through the entire haystack, right? And if the web is a haystack, then a search engine that looks through only part of it may miss the portion with the needle!
That sounds convincing, the reality is more like this. The web is a haystack, and even if a search engine has every straw, you'll never find the needle if the haystack is dumped over your head. That's what happens when the focus is solely on size, with relevancy ranking a secondary concern. A search engine with good relevancy is like a person equipped with a powerful magnet -- you'll find the needle without digging through the entire haystack because it will be pulled to the surface.
Google's Supplemental IndexI especially hate when the periodic size wars erupt because examining the latest claims takes time away from other more important things to write about. In fact, it was a great relief to have my associate editor Chris Sherman cover this story initially in SearchDay last week (Google to Overture: Mine's Bigger). But I'm returning to it because of a twist in the current game: Google's new "supplemental results."
What are supplemental results? At the same time Google posted new size figures, it also unveiled a new, separate index of pages that it will query if it fails to find good matches within its main web index. For obscure or unusual queries, you may see some results appear from this index. They'll be flagged as "Supplemental Result" next to the URL and date that Google shows for the listing.
Google's How To Interpret Your Search Results page illustrates this, but how about some real-life examples you can try? Here are some provided by Google to show when supplemental results might kick in:
- "St. Andrews United Methodist Church" Homewood, IL
- "nalanda residential junior college" alumni
- "illegal access error" jdk 1.2b4
- supercilious supernovas
Two Web Page Indexes Not Better Than OneUsing a supplemental index may be new for Google, but it's old to the search engine industry. Inktomi did the same thing in the past, rolling out what became known as the small "Best Of The Web" and larger "Rest Of The Web" indexes in June 2000.
It was a terrible, terrible system. Horrible. As a search expert, you never seemed to know which of Inktomi's partners was hitting all of its information or only the popular Best Of The Web index. As for consumers, well, forget it -- they had no clue.
It also doesn't sound reassuring to say, "we'll check the good stuff first, then the other stuff only if we need to." What if some good stuff for whatever reason is in the second index? That's a fear some searchers had in the past -- and it will remain with Google's revival of this system.
Why not simply expand the existing Google index, rather than go to a two tier approach?
"The supplemental is simply a new Google experiment. As you know we're always trying new and different ways to provide high quality search results," said Google spokesperson Nate Tyler.
OK, it's new, it's experimental -- but Google also says there are currently no plans to eventually integrate it into the main index.
Deconstructing The Size Hot DogMuch as I hate to, yeah, let's talk about what's in the numbers that are quoted. The figures you hear are self-reported, unaudited and don't come with a list of ingredients about what's inside them. Consider the hot dog metaphor. It looks like it's full of meat, but if you analyze it, it could be there's a lot of water and filler making it appear plump.
Let's deconstruct Google's figure, since it has the biggest self-reported number, at the moment. The Google home page now reports "searching 3,307,998,701 web pages." What's inside that hot dog?
First, "web pages" actually includes some things that aren't web pages, such as Word documents, PDF files and even text documents. It would be more accurate to say "3.3 billion documents indexed" or "3.3 billion text documents indexed," because that's what we're really talking about.
Next, not all of those 3.3 billion documents have actually been indexed. There are some documents that Google has never actually indexed. It may list these in search results based on links it has seen to the documents. The links give Google some very rough idea of what a page may be about.
For example, try a search for pontneddfechan, a little village in South Wales where my mother-in-law lives. You should see in the top results a listing simply titled "www.estateangels.co.uk/place/40900/Pontneddfechan" That's a partially indexed page, as Google calls it. It would be fairer to say it's an unindexed page, since in reality, it hasn't actually been indexed.
What chunk of the 3.3 billion has really been indexed? Google's checking on that for me. They don't always provide an answer to this particular question, however. Last time I got a figure was in June 2002. Then, 75 percent of the 2 billion pages Google listed as "searching" on its home page had actually been indexed. If that percentage holds true today, then the number of documents Google actually has indexed might be closer to 2.5 billion, rather than the 3.3 billion claimed.
But wait! The supplemental index has yet to be counted. Sorry, we can't count it, as Google isn't saying how big it is. Certainly it adds to Google's overall figure, but how much is a mystery.
Let's mix in some more complications. For HTML documents, Google only indexes the first 101K that it reads. Given this, some long documents may not be totally indexed -- so do they count as "whole" documents in the overall figure? FYI, Google says only a small minority of documents are over this size.
Auditing SizesOK, we've raised a lot of questions about what's in Google's size figure. There are even more we could ask -- and the same questions should be directed at the other search engines, as well. AllTheWeb's 3.2 billion figure may include some pages only known by seeing links and might include some duplicates, for example. But instead of asking questions, why not just test or audit the figures ourselves?
That's exactly what Greg Notess of Search Engine Showdown is especially known for. You can expect Greg will probably take a swing at these figures in the near future -- and we'll certainly report on his findings. The last test was done in December. His test involves searching for single word queries, then examining each result that appears -- a time-consuming task. But it's a necessary one, since the counts from search engines have often not been trustworthy.
Grow, But Be Relevant, TooI'm certainly not against index sizes growing. I do find self-reported figures to also be useful, at least as a means of figuring out who is approximately near each other. Maybe Google is slightly larger than AllTheWeb or maybe AllTheWeb just squeaks past Google -- the more important point is that both are without a doubt well above a small service like Gigablast, which has only 200 million pages indexed.
However, that's not to say that a little service like Gigablast isn't relevant. It may very well be, for certain queries. Indeed, Google gained many converts back when it launched with a much smaller index than the established major players. It was Google's greater relevancy -- the ability to find the needle in the haystack, rather than bury you in straw -- that was the important factor. And so if the latest size wars should continue, look beyond the numbers listed at the bottom of the various search engine home pages consider instead the key question. Is the search engine finding what you want?
By the way, the baby of the current major search engine line-up Teoma did some growing up last month. The service moved from 500 million to 1.5 billion documents indexed.
Paul Gardi, vice president of search for Ask Jeeves, which owns Teoma, wants to grow even more. He adds that Teoma is focused mainly on English language content at the moment -- so the perceived smaller size of Teoma may not be an issue for English speakers. Subtract non-English language pages from Teoma's competitors, and the size differences may be much less.
"Comparatively speaking, I would argue that we are very close to Google's size in English," Gardi said.