Sunday, April 13, 2014

Time series database survey for IoT and m2m devices

This is a survey of time series databases available for use, both the cloud offerings as well as "install on our own machines" solutions. The requirement we have is


  • store high velocity time series data (frequent data arriving from one node)
  • store data from lot of nodes 
  • compute aggregates (sum over a days worth of data)
  • Grouping functions (average, STDEV) 
  • Analyze the data for patterns etc.

No one is paying me to write this so I will stay clear of jargons like, slice and dice, Cubes and all that b.s. in plain simple terms, we are receiving data from lot of devices very frequently, so first problem is simply storing a lot of data. Mysql and other RDBMS are not optimized for storing such time series data.  That is problem #1.

Another problem is that it may not be prudent to fetch all the raw data points for certain queries later on. Let's say that you want to watch the trend over a month then just fetching all the raw datapoints may be a overkill. What you instead would like to do is to just fetch 30 data points, each an average over a day's worth of datapoints. Now, creating such buckets (rollups) on demand would be expensive operation, so we need to push data into such buckets (rollup) as and when they arrive. That is problem #2, a good solid support for whatever rollup I would like to create. For data arriving at millsecond intervals that can just be one minute! 

There is actually a rollup hierarchy. say, data is arriving at 5 minute intervals and then you make rollup of an hour (average over 12 datapoints) . Further you would like to make a rollup of a day (averaged over 24 datapoints of previous bucket) etc. 

Then we also need aggregates. We would like to sum over datapoints for a particular interval for reporting. (say Rainfall over a day). 

For IoT/m2m kind of use cases, you also need to detect patterns in real times (this is apart from the threshold alerts). Then we would like to analyze the data and perform statistical opeartions on it.

RRDTOOL

Nice circular buffer
Expects data at requried intervals
Language bindings available
Good fit for small numer of metric

KairosDB

forked from openTSDB
storing metric in HBase/ Cassandra
Good storage facility, allows tagging of data 
However Data model is very limited. 
Aggreates are calculated during query time and can be a performance drag
No support for automatic rollup


OpenTSDB 

Looks very married to the Graphs
Good for computer metric cases 
Does not look a good fit for device case 
(where data dictionary is device dependent)

Graphite

Cloud offerings
Xivey a.k.a  pachcube a.k.a whatever-it-was

Good PR buzz
Good ecosystem
support is a black hole if you are in Asia
Rollup supported (in their own way)
Good provisioning and device activation support
Device side things are unnecessarily complicated
support for average function only (haven't found others yet)


Librato

Digi m2m cloud

Tempo-DB


I think all cloud based application would run into limitation for serious applications.  Also, there is no way others can do your analytic for you. For the moment, my strategy is to prototype on xively and then switch to influxdb (or maybe another on-my-machine solution). For realtime analytic, look at 
amazon Kinesis or numPy with HDF5. The debate is far from settled.


© Life of a third world developer
Maira Gall