The big data ecosystem is currently on its expansion stage: A lot of technologies are popping up but too little consolidation happens. It's hard to keep track of the big picture. Today at the Strata Conference 2013 I visited some talks and participated in some discussions which helped me to better fit together some pieces of the big data technology puzzle:
- High performance writes
- Poor performance queries
- Ideal partners for data logistics: Flume, Storm, Samza
- Supports data updates / deletes but no SQL
- Best used for data streams (a flow of single-entry inserts) and to store the most recent data
- Can access both HBase and HDFS stored data
- Supports a subset of SQL
- Best used for big-in / big-out queries e.g. large joins, data enrichment
- Best used for batch processing (low CPU usage)
- Can access both HBase and HDFS stored data and share metadata with Hive. Can be used side-by-side to Hive to complement it without replicating data between them.
- Supports a subset of SQL and is compatible to the Hive API (but no real drop-in replacement).
- Not as mature as Hive but some success stories present
- Commercial MPPs like Vertica and Teradata are faster and more mature but Impala has a tighter integration into the Hadoop ecosystem and is therefore more flexible. Most important consequence: The data has not to be replicated into Impala like it has to be in Vertica et al. Impala can directly access HDFS/HBase data.
- Best used for big-in / small-out queries e.g. aggregations, groupings
- Best used for realtime queries (sec-to-min)
- oozie: More mature and flexible. Larger set of features.
- Azkaban: Nice and usable UI. Simpler to setup and use.
A possible outlook:
Storage & access layer
Storage & access layer
- HDFS is and will remain the dominant virtual file system for big data.
- The vast amount of (columnar) file formats (Parquet, HFile, RCfile, ...) will be consolidated. The beauty contest has already begun.
- HBase will be the storage layer above HDFS for row-based access and data streams.
- There will be one major SQL-on-HBase/HDFS open source MPP database assembling the best of Impala, Hive, shark, ...
- The choreography tools will be extended with intelligent cost-based scheduling capabilities.