The same on WebGuild Silicon Valley
French version / Version française
Index caching and refreshing
[Update – October 5, 2007] Five days after posting this article (in French), 118 pages of the site are indexed on Google, which wins across the board for exhaustiveness, relevance and speed. Without contest!
Yahoo! and Microsoft are still at the same point…and the others are worse: it’s unknown on Ask, and Exalead shows a thumbnail of a parking service for my site, which was parked over a year ago. Hello, relevance (it’s l'exception française)!
* * *
A few days ago I uploaded XBRL.name, a glossary in 7 languages on IFRS terminology.
For one, I was surprised to see that the domain name, which has existed on the site Studio92.net for over two years, had retained the PR4 of the page it was on, but that wouldn’t last!
At the same time, you can imagine how avidly I’m on the lookout to see when my site will be indexed in the search engines. I check every day on GYM. The results are edifying! Here is the status as of October 1, after the site was uploaded on September 23, in other words in eight days.
I should specify that it’s not completed; only 1/7 of the site is finished, a little less than 200 pages out of approximately 1400 expected when the site is complete.
Finally, this post has no pretension to being more than it is: the simple tracking of a week of the indexing of a new site. Nothing scientific here, just a personal experience. [Top]
* * *
It goes without saying that each of the three index generously exceeds 20 billion web pages!!! If you’re nostalgic, click here...
The engines don’t communicate much on the topic, except Microsoft, which makes a point to let you know it has caught up, quadrupling the size of its index from 5 billion to 20 billion pages. OK!
However, Yahoo! was already declaring more than 19 billion pages in… August 2005 (despite Jean Véronis’s questioning) and Google, 24 billion pages three months later (see here, end of page 5)!
So while I partially agree with Eric Enge when he states that At some level, the exact index size is not a big issue, unless, your index is simply too small, I agree less with his idea that increased index size is related to increased relevance (In short, Microsoft needed to make a move of this type to improve their relevance).
Relevance is not necessarily dependent on coverage (What's at issue is coverage... and if you don't have the related sites in the index, you can't return the right result), since the engine may very well have the relevant site in its index and still keep quiet (not list a result).
And of course, Microsoft presented a demo to illustrate its point of view, specifically on "shelli segal" and the site of a corresponding designer, which appears first on Live Search but makes the grave error of being absent in Google’s index!
Might one suspect Microsoft of cooking up an ad hoc search just to justify its relevance, relevance, relevance?
A good way to find out is to test it with xbrl.name, where the three search engines are on equal footing against it, since it was uploaded eight days ago without being intentionally presented for indexing; I just put the link on my blog and on several other sites. [Top]
* * *
Until yesterday, Google returned 190 results total and gave the following excerpt for the site:
My SPIP site. Search. Home page. My SPIP site. Follow-up of the site's activity RSS 2.0 | Site Map | Private area | SPIP | template.That is, it had saved the SPIP installation I tested, before opting for a site in HTML.
But today – sigh of relief – Google returns 300 results and finally sees the new version of the site:
Conclusion: Google took note of the site in 8 days, although the content of the glossary does not yet seem to be indexed. [Top]
* * *
Yahoo! returns 30 results and the following excerpt:
This is the placeholder for domain xbrl.name. If you see this page after uploading site content ... This page has been automatically generated by Plesk.
Plus one page correctly indexed. What about the 200-some others?
So Yahoo! presents a tenth as many results as Google and just one page indexed. [Top]
* * *
Just one result! Period. Same excerpt as Yahoo.
Then that last line that kills me: “Are you satisfied with Live Search? Tell us."
What to say? That in light of what preceded it, Microsoft definitely deserves its third place. Dead last!
The ranking is confirmed by my blog’s visit stats, as you can see in the table below:
Search engines were the source of 2,826 visits on Adscriptor during September and represented 41.21% of total visits (188 visitors and 242 pages viewed per day, with an average time on site of 1'35'' per visit) (not everyone’s named Otto, fortunately for him ;-).
With 2,575 referring links, Google alone represents >91% of these visits, versus 5.4% from Yahoo! and three times less than Yahoo! for Microsoft. Google is overwhelming superior. Why?
Clearly, if Google weren’t there, I would have a presence on the Internet…with zero visibility on search engines! [Top]
* * *
Index caching and refreshing
In addition to size and relevance, one last aspect related to engine indices concerns their refreshing frequency, with a cache cycle that has shortened considerably recently for Google (I don’t use Yahoo! or Microsoft enough to say about them). Before, it seemed like the cache stayed around for a while and you could retrieve information several weeks later; now, it’s only a matter of days. For example, I was previously able to retrieve practically all of Alexis Debat’s fake interviews, but as the days go on, fewer and fewer can be found. [Top]
* * *
Concerning the performance Microsoft claims, Eric Enge is right when he says:
Ultimately, the point is, you can't return the right result if the site you should be returning for a given search is not in your index.That’s clear. But it’s even worse to have the site in your index and not understand that the “right” site is precisely that one! [Top]
Share on Facebook
P.S. Well, it seems that Yahoo! and Microsoft are not giving up. They must have read my post overnight!
I tried Yahoo! Search again (it was recently improved, other details here); the tool still offers no suggestions:
but it has finally correctly indexed the home page. Everything else was the same: 31 results total and only 2 of the site’s pages.
On Live Search, too, the indexing is now correct for 2 of the site’s pages, which are the only results offered.
Meanwhile, Google has gone from 17 to 47 pages indexed: now several lengths ahead of the competition.
That said, given the number of web pages on the Internet (???), it’s pretty remarkable to see a new site indexed in eight days on GYM. And it makes sense why the next steps in searching in 2010 will be:
- search engine verticalization
- personalization of results
- universal search
GYM, Google, Yahoo, Microsoft, search engine, index, relevance, Internet