As part of an experiment in how to use crowdsourcing to generate descriptions of images, he had posted thumbnails of 25,000 pictures into a Google document, and then he invited people to describe the images. The problem was that these thumbnails linked back to original images stored on Amazon’s S3 storage service, and apparently, Google’s servers went slightly bonkers. “Google just very aggressively grabbed the images from Amazon again and again and again,” he says.
After analyzing traffic logs he was able to determine that every hour a total of 250 gigabytes of traffic was sent out because of Google’s Feedfetcher, the mechanism that allows the search engine to grab RSS or Atom feeds when users add them to Reader or the main page.
After speaking with Google representatives, Ipeirotis believes that the company is trying to balance user privacy with a desire to present fresh content. It seems that Google doesn’t want to store the information on its own servers so it uses Feedfetcher to retrieve it every time, thus generating large amounts of traffic.
“Google becomes such a powerful weapon due to a series of perfectly legitimate design decisions,” Ipeirotis wrote in a blog posting on the issue.