Web scale data management

Not research data as such (although it could be the subject of research), but a long and interesting blog post about how Tumblr manages huge amounts of user generated data. It’s interesting not just because of the scale of the task day-to-day, but also because it offers some lessons learned about how to scale up to managing an ingest of several terabytes a day. When we talk about ‘big data’ in the sciences, is it this big? Bigger? How is big science actually managing data on this scale? I really don’t know.

  • 500 million page views a day
  • 15B+ page views month
  • ~20 engineers
  • Peak rate of ~40k requests per second
  • 1+ TB/day into Hadoop cluster
  • Many TB/day into MySQL/HBase/Redis/Memcache
  • Growing at 30% a month
  • ~1000 hardware nodes in production
  • Billions of page visits per month per engineer
  • Posts are about 50GB a day. Follower list updates are about 2.7TB a day.
  • Dashboard runs at a million writes a second, 50K reads a second, and it is growing.

5 Replies to “Web scale data management”

  1. A modern X-ray beamline at the Diamond Light Source can produce 10s of TB of raw data in an hour at full pace. One group recently collected 1PB in a 16 hour session.

    A “Next Generation Sequencing” experiment may generate 100TB of raw data, over a few days, and the instruments which are currently being designed will increase this by more than an order of magnitude.

    Whether people consider their data to be big depends on what they are doing with it: see http://www.citeulike.org/blog/chrishmorris/17743 for a discussion of this.

  2. Joss, hello.

    The (JISC-funded) project “Managing Research Data: Gravitational Waves” was nominally about gravvy wave date, but used that as a route in to talking about ‘big science’ data in general. The project URL is , and its report is at .

    Section 1.2 of the report gives some ‘big data’ numbers.

    The scale for this is I think set by the ATLAS experiment (one of the two big ones, out of four, at the LHC). That preserves what I now think of as ‘1 LHC’, namely 10 PB/yr. That’s in the region of 20-30 TB/day, in a mixture of bulk data and RDBMS data (I don’t know the mix), though the peak rates will be higher. The current LIGO experiment (gravvy waves) stores about 1PB/yr when running (it’s having a refit at present).

    The SKA radio observatory will, everyone hopes, be commissioned around 2020, and will require transporting, though not necessarily storing, about 1Tb/s locally and about 100Gb/s intercontinentally. That’s 0.5 EB/yr, and 0.05% of the predicted 1ZB/yr worldwide IP traffic for 2015.

    So those are respectable data volumes.

    That’s pretty rich data (lots of metadata, though this will be dwarfed by the size of the bulk data). In contrast, astronomical object databases — which are relational, and have substantially more information per byte — come in at around the 1-10TB scale, though these are carefully curated, and highly reduced datasets.

    There are some more details, and discussion of the consequences of all this, in the project report, which might be interesting to read.

    Enjoy!

    Norman

    1. It occurs to me to add that this volume of data is typically _not_ stored at an institution.

      * CERN is the single Tier-0 site, with copies of all the data on both spinning disks and tapes.
      * There are 11 Tier-1 sites around the world, all of which (I think) hold a copy of all of the data. The Rutherford Lab is the Tier-1 for the UK
      * There are multiple Tier-2 sites associated with each Tier-1, which hold shifting fractions of the data. Glasgow Uni is the Tier-2 for Scotland.
      * Individual institutions and departments are ‘tier-3’

      See http://lcg.web.cern.ch/lcg/public/tiers.htm

      The data management was designed and is run by the ‘LCG’ — the LHC Computing Grid: http://lcg.web.cern.ch/ — as a development group with roughly equal status to the detector and accelerator engineering groups.

      Norman

  3. At UH, big = Physics, Astronomy and Maths research. They have a 90 core hpc cluster with 200TB storage, which is nearly full. Nothing compared to Tumblr but enough to never let me sleep if I were responsible for it.

Comments are closed.