, , , , , , , ,

Apache Tajo™: A big data warehouse system on Hadoop

A robust big data relational and distributed data warehouse system for Apache Hadoop.

Designed for

     low-latency and scalable ad-hoc queries

     online aggregation

     ETL (extract-transform-load process)

on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.


Tajo Vs Impala Vs Hive

The key differences between Tajo and Impala is the design goal. To increase the performance of query processing, Impala adopts an approach which the main memory is utilized as much as possible and intermediate data are transfered via streaming. If a query requires too much memory, Impala cannot process the query. Thus, Impala says that it is not an alternate of Hive.

However, Tajo uses a query optimization which considers user queries, characteristics of data, the status of cluster, and so on. Thus, Tajo can process a query with Impala’s algorithm, Hive’s algorithm or any other algorithms. For an example, Tajo can process a join query using the repartition join, or the merge join. Intermediate results can be materialized to disks or maintained in memory. Since Tajo builds a query plan considering above mentioned various factors, it can always process user queries. So, we can say that Tajo can be an alternate of Hive.

Tajo can perform well over Hive for most of queries. The key reason is that Tajo uses the own query engine while Hive uses MapReduce. This limits that Hive can uses only MapReduce-based algorithms. However, Tajo can uses a more optimized algorithm.

A sort query is a good example. Hive supports only the hash partitioning. Thus, each node sort data locally in the map phase and *ONE NODE* should perform global sort in the reduce phase.

However, Tajo supports a sort algorithm using the range partitioning. In the first phase, each node sort data locally as in Hive, but the intermediate data are partitioned by the range of the sort key. In the second phase, each node performs local sort to get the final results. Since intermediate data are partitioned by the range of sort key, final results are correct.


Gartner Blog on Apache Tajo


Self evaluation on a local PC installed with Hive on (Hadoop 1.x) and Tajo on (Hadoop 2.x)

Case 1:

For NYSE data with 2880648 records, filtering based on Stock Close Price, returns 8601 records. Time taken by

     a) Tajo – 4.239 sec

     b) Hive – 20.967 seconds