# time series data

mapreduce

Many sites provide statistics for their users contents. And many of them provide a graph of how many plays/views/visits the content got over time. This requires a quick aggregate of counts over a certain time window.

One way to provide this information in the short lifespan of a web request is to offload the aggregation from "on read" to "on write". Instead of summing the number of raw events "on read", the aggregates are kept up-to-date. Therefor a single event

``````event at 2011-03-14
``````

causes multiple increment operations.

``````year:  2011 += 1
month: 2011-03 += 1
day:   2011-03-14 += 1
``````

For a non-trivial analytics application this can cause quite some write load.

By creating counters per different time units, the number of operations can be kept at a minimum and still provide fast aggregation over large time windows. Calculating the aggregate can become as simple as adding a couple of counters. It's the sum of the most covering aggregates that fit into the given time interval.

``````count(31.11.2008 .. 03.02.2011) =
day(31.11.2008) +
month(12.2008) +
year(2009) +
year(2010) +
month(01.2011) +
day(01.02.2011) +
day(02.02.2011) +
day(03.02.2011)
``````

Choosing the time units here defines the read pattern and the number of counters. It's worth keeping a look on the upper bounds.

``````max(reads) =
1 year +
12 months +
31 days
= 54
max(counters) =
1 + 12 + 12 * 31 = 385

1 year +
52 weeks +
7 week days
= 60
max(counters) =
1 + 52 + 52 * 7 = 417
``````

For a informed decision checking against real user request is the recommended approach. The icing on the cake is when you can even keep this abstracted enough and you can switch between models without much effort.

Generating these counters from log files can be implemented with a simple map reduce job. All the work can be done in the reduce phase where the data comes in ordered by time.

``````def map
emit date

def reduce
year = year(date)
if year != prev_year
write "#{prev_year}", year_counter
year_counter = 0
end
year_counter += 1

month = month(date)
if month != prev_month
write "#{prev_year}-#{prev_month}", month_counter
month_counter = 0
end
month_counter += 1

day = day(date)
if day != prev_day
write "#{prev_year}-#{prev_month}-#{prev_day}", day_counter
day_counter = 0
end
day_counter += 1
``````

Be sure to flush the counters after the reducer finishes. The above approach also requires a custom partitioner that splits the data by the biggest aggregate - which is a year in the case above.

With all the counters in a fast store you can then serve your web requests in a timely manner.