Web scale data management

Not research data as such (although it could be the subject of research), but a long and interesting blog post about how Tumblr manages huge amounts of user generated data. It’s interesting not just because of the scale of the task day-to-day, but also because it offers some lessons learned about how to scale up to managing an ingest of several terabytes a day. When we talk about ‘big data’ in the sciences, is it this big? Bigger? How is big science actually managing data on this scale? I really don’t know.

500 million page views a day

15B+ page views month

~20 engineers

Peak rate of ~40k requests per second

1+ TB/day into Hadoop cluster

Many TB/day into MySQL/HBase/Redis/Memcache

Growing at 30% a month

~1000 hardware nodes in production

Billions of page visits per month per engineer

Posts are about 50GB a day. Follower list updates are about 2.7TB a day.

Dashboard runs at a million writes a second, 50K reads a second, and it is growing.

Tumblr Architecture – 15 Billion Page Views a Month and Harder to Scale than Twitter

via John Naughton.