Web scale data management

Not research data as such (although it could be the subject of research), but a long and interesting blog post about how Tumblr manages huge amounts of user generated data. It’s interesting not just because of the scale of the task day-to-day, but also because it offers some lessons learned about how to scale up to managing an ingest of several terabytes a day. When we talk about ‘big data’ in the sciences, is it this big? Bigger? How is big science actually managing data on this scale? I really don’t know.

  • 500 million page views a day
  • 15B+ page views month
  • ~20 engineers
  • Peak rate of ~40k requests per second
  • 1+ TB/day into Hadoop cluster
  • Many TB/day into MySQL/HBase/Redis/Memcache
  • Growing at 30% a month
  • ~1000 hardware nodes in production
  • Billions of page visits per month per engineer
  • Posts are about 50GB a day. Follower list updates are about 2.7TB a day.
  • Dashboard runs at a million writes a second, 50K reads a second, and it is growing.