Update May 2017: Facebook has removed this page from their website.
Private email addresses of individuals who have Facebook accounts are getting indexed by the search engines, including Google, Yahoo!, and Microsoft’s Bing.com. There are certain web pages on Facebook, mainly the Facebook Email Opt Out web page, that are getting indexed by the search engines. There is a problem because these web pages contain email addresses that can easily be harvested.
To see for yourself, go on over to Google and search for something like this:
site:facebook.com “Do you want to stop receiving Facebook emails”
Or, you can simply go to Google.com and search for the page itself:
Google is not alone in indexing personal email addresses of Facebook users. In fact, Yahoo! Site Explorer reveals over 5700 email addresses:
Bing.com, on the other hand, has been very slow at indexing these web pages, and that’s probably due to the fact that Microsoft has been very slow at indexing ANY web page on the internet. In fact, there are websites that are actually complaining about the fact that Bing has not indexed their entire website. (For those of you who want their pages indexed more quickly by Bing, I recommend that you start getting more links to your web pages, the deep links).
I have yet to figure out exactly how the search engines are indexing these web pages, as they actually contain people’s personal email addresses. If this were a really bad problem, there would be a lot more than 5000 email addresses from Facebook being indexed in the search engines. We already know that there are a lot more than 5000 Facebook users, so it doesn’t seem to be a really big issue. There are, though, a few ways that the search engines could be picking up on this:
– A Facebook application gone awry. Perhaps these users all have a certain common Facebook application that they’re using that is causing this data to be indexed. Perhaps the URLs are being recorded somewhere, on another server, and the search engines have started indexing those URLs.
– There’s some issue with the real-time feed data that is causing Google and Yahoo! to index those URLs. I have seen this happen before with other sites like Twitter. My main Twitter ID, @bhartzer, for a while had a source code (parameters) associated with it. I had a feeling Google was picking it up somehow, but couldn’t narrow down exactly where they were picking up the URL. It’s just recently that a search for “bhartzer” on Google has been showing http://www.twitter.com/bhartzer
So, what can you do to combat issues like this?
One of the best things you can do is start using a social media monitoring tool to start tracking mentions of you, your company, and your brand online. If there’s an issue like this, if you’ve set up a tracking tool properly, you will be notified if your brand, your company name, your domain name, or even your email address shows up.
You can also set up a Google alert for something like “billhartzer.com”. Whenever that’s mentioned, and if there is a new mention of it, most likely you’ll be notified. Setting up the alert to track the ‘keyword’ your domain name without the www part of it will most likely help in notifying you in the event that your email address shows up online. Certainly that won’t work for “@gmail.com” or other generic emails, but you get my point.
Certainly, Facebook has been plaugued by all sorts of privacy issues, and the fact that email addresses from an opt out web page on their own website are being indexed is not a good thing. ALL Facebook had to do was to make sure that their “opt out” page, the one that contains email addresses, isn’t allowed to be indexed by the search engines.
A hat tip goes out to Cory Watilo for finding this gem.
Update: June 4, 2010 – Apparently Facebook has taken care of this issue. The way they did it, though, was to add a directive in their robots.txt file to disallow the search engines from spidering the o.php file like this:
Apparently, this was enough, as both Yahoo! and Google have stop indexing that file and they have stopped indexing the o.php file.
We found those pages by crawling normal links on public web pages. It also would have been nice if the author of that blog post had asked us before claiming that the “only way” Google could have possibly found pages was by following links in emails. We could have saved him the trouble of making up a new conspiracy theory.