It recently came to my attention that Google has a new Search Console where you can see the status of your web site in Google’s search index. I checked out what it says for this blog and I was a bit surprised.
Some things I expected, like the number of pages I’ve blocked in the robots.txt file to prevent crawling (however I didn’t know that blocking an URL there means that it can still appear in search results). Other things were weirder, like this old post being soft recognized as a 404 Not Found response. My web server is properly configured and quite capable of sending correct HTTP response codes, so ignoring standards in that regard is just craziness on Google’s part. But the thing that caught my eye the most was the number of Excluded pages on the Index Coverage pane:
Considering that I have less than a thousand published blog posts this number seemed high. Diving into the details, it turned out that most of the excluded pages were redirects to canonical URLs and Atom feeds for post comments. However at least 160 URL were permalink addresses of actual blog posts (there may be more, because the CSV export only contains the first 1000 URLs).
All of these were in the “crawled, not indexed” category. In their usual hand-waving way, Google describes this as:
The page was crawled by Google, but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling.
I read this as “we know this page exists, there’s no technical problem, but we don’t consider it useful to show in search results”. The older the blog post, the more likely that it was excluded. Google’s index apparently contains only around 60% of my content from 2006, but 100% of that published in the last couple of years. I’ve tried searching for some of these excluded blog posts and indeed they don’t show in the results.
I have no intention to complain about my early writings not being shown to Google’s users. As long as my web site complies with generally accepted technical standards I’m happy. I write about things that I find personally interesting and what I earnestly believe might be useful information in general. I don’t feel entitled to be shown in Google’s search results and what they include in their index or not is their own business.
That said, it did made me think. I’m using Google Search almost exclusively to find information on the web. I suspected that they heavily prioritize new over old, but I’ve never seriously considered that Google might be intentionally excluding parts of the web from their index altogether. I often hear the sentiment how the old web is disappearing. That the long tail of small websites is as good as gone. Some old one-person web sites may indeed be gone for good, but as this anecdote shows, some such content might just not be discoverable through Google.
All this made me switch my default search engine in Firefox to DuckDuckGo. Granted I don’t know what they include or exclude from their search either. I have yet to see how well it works, but maybe it isn’t such a bad idea to go back to the time where trying several search engines for a query was a standard practice.