Your Daily Geekery

time series data

mapreduce

Many sites provide statistics for their users contents. And many of them provide a graph of how many plays/views/visits the content got over time. This requires a quick aggregate of counts over a certain time window.

One way to provide this information in the short lifespan of a web request is to offload the aggregation from "on read" to "on write". Instead of summing the number of raw events "on read", the aggregates are kept up-to-date. Therefor a single event

event at 2011-03-14

causes multiple increment operations.

year:  2011 += 1
month: 2011-03 += 1
day:   2011-03-14 += 1

For a non-trivial analytics application this can cause quite some write load.

timeseries By creating counters per different time units, the number of operations can be kept at a minimum and still provide fast aggregation over large time windows. Calculating the aggregate can become as simple as adding a couple of counters. It's the sum of the most covering aggregates that fit into the given time interval.

count(31.11.2008 .. 03.02.2011) =
  day(31.11.2008) +
  month(12.2008) +
  year(2009) +
  year(2010) +
  month(01.2011) +
  day(01.02.2011) +
  day(02.02.2011) +
  day(03.02.2011)

Choosing the time units here defines the read pattern and the number of counters. It's worth keeping a look on the upper bounds.

max(reads) =
  1 year +
  12 months +
  31 days
  = 54
max(counters) =
  1 + 12 + 12 * 31 = 385

max(reads) =
  1 year +
  52 weeks +
  7 week days
  = 60
max(counters) =
  1 + 52 + 52 * 7 = 417

For a informed decision checking against real user request is the recommended approach. The icing on the cake is when you can even keep this abstracted enough and you can switch between models without much effort.

Generating these counters from log files can be implemented with a simple map reduce job. All the work can be done in the reduce phase where the data comes in ordered by time.

def map
  emit date

def reduce
  year = year(date)
  if year != prev_year
    write "#{prev_year}", year_counter
    year_counter = 0
  end
  year_counter += 1

  month = month(date)
  if month != prev_month
    write "#{prev_year}-#{prev_month}", month_counter
    month_counter = 0
  end
  month_counter += 1

  day = day(date)
  if day != prev_day
    write "#{prev_year}-#{prev_month}-#{prev_day}", day_counter
    day_counter = 0
  end
  day_counter += 1

Be sure to flush the counters after the reducer finishes. The above approach also requires a custom partitioner that splits the data by the biggest aggregate - which is a year in the case above.

With all the counters in a fast store you can then serve your web requests in a timely manner.