Les datacenters et serveurs de Google

Google domine la recherche sur internet et devient un acteur de plus en plus important dans les autres domaines du web. Tout cela nécessite de grosses infrastructures informatiques, Jeff Dean en présente la nouvelle architecture dans le document ci-dessus. La suite de l’article montre des photos des générations précédentes, du matériel conçu par Google.

photo serveurs google 2007 2008 2009
Les serveurs de Google aujourd’hui

photo serveurs google 2001
Les serveurs de Google en 2001

photo serveurs google 1999
Les serveurs de Google en 1999

photo serveurs google 1997
Les serveurs de Google en 1997

Récapitulatif (source) :

Jeff Dean did a great talk at Google IO this year. Some key points from Steve Garrity (msft pm) and some note from the excellent write-up at Google spotlights data center inner workings:

  • many unreliable servers to fewer high cost servers
  • Single search query touches 700 to up to 1k machines in < 0.25sec
  • 36 data centers containing > 800K servers
    – 40 servers/rack
  • Typical H/W failures: Install 1000 machines and in 1 year you’ll see: 1000+ HD failures, 20 mini switch failures, 5 full switch failures, 1 PDU failure
  • There are more than 200 Google File System clusters
  • The largest BigTable instance manages about 6 petabytes of data spread across thousands of machines
  • MapReduce is increasing used within Google.
    – 29,000 jobs in August 2004 and 2.2 million in September 2007
    – Average time to complete a job has dropped from 634 seconds to 395 seconds
    – Output of MapReduce tasks has risen from 193 terabytes to 14,018 terabytes
  • Typical day will run about 100,000 MapReduce jobs
    – each occupies about 400 servers
    – takes about 5 to 10 minutes to finish

More detail on the typical failures during the first year of a cluster from Jeff:

  • ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
  • ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
  • ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
  • ~1 network rewiring (rolling ~5% of machines down over 2-day span)
  • ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
  • ~5 racks go wonky (40-80 machines see 50% packetloss)
  • ~8 network maintenances (4 might cause ~30-minute random connectivity losses)
  • ~12 router reloads (takes out DNS and external vips for a couple minutes)
  • ~3 router failures (have to immediately pull traffic for an hour)
  • ~dozens of minor 30-second blips for dns
  • ~1000 individual machine failures
  • ~thousands of hard drive failures