Not research data as such (although it could be the subject of research), but a long and interesting blog post about how Tumblr manages huge amounts of user generated data. It’s interesting not just because of the scale of the task day-to-day, but also because it offers some lessons learned about how to scale up to managing an ingest of several terabytes a day. When we talk about ‘big data’ in the sciences, is it this big? Bigger? How is big science actually managing data on this scale? I really don’t know.
- 500 million page views a day
- 15B+ page views month
- ~20 engineers
- Peak rate of ~40k requests per second
- 1+ TB/day into Hadoop cluster
- Many TB/day into MySQL/HBase/Redis/Memcache
- Growing at 30% a month
- ~1000 hardware nodes in production
- Billions of page visits per month per engineer
- Posts are about 50GB a day. Follower list updates are about 2.7TB a day.
- Dashboard runs at a million writes a second, 50K reads a second, and it is growing.
via John Naughton.