Apache Tajo – The SQL-on-Hadoop


, , , , , , , ,

Apache Tajo™: A big data warehouse system on Hadoop

A robust big data relational and distributed data warehouse system for Apache Hadoop.

Designed for

     low-latency and scalable ad-hoc queries

     online aggregation

     ETL (extract-transform-load process)

on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.


Tajo Vs Impala Vs Hive

The key differences between Tajo and Impala is the design goal. To increase the performance of query processing, Impala adopts an approach which the main memory is utilized as much as possible and intermediate data are transfered via streaming. If a query requires too much memory, Impala cannot process the query. Thus, Impala says that it is not an alternate of Hive.

However, Tajo uses a query optimization which considers user queries, characteristics of data, the status of cluster, and so on. Thus, Tajo can process a query with Impala’s algorithm, Hive’s algorithm or any other algorithms. For an example, Tajo can process a join query using the repartition join, or the merge join. Intermediate results can be materialized to disks or maintained in memory. Since Tajo builds a query plan considering above mentioned various factors, it can always process user queries. So, we can say that Tajo can be an alternate of Hive.

Tajo can perform well over Hive for most of queries. The key reason is that Tajo uses the own query engine while Hive uses MapReduce. This limits that Hive can uses only MapReduce-based algorithms. However, Tajo can uses a more optimized algorithm.

A sort query is a good example. Hive supports only the hash partitioning. Thus, each node sort data locally in the map phase and *ONE NODE* should perform global sort in the reduce phase.

However, Tajo supports a sort algorithm using the range partitioning. In the first phase, each node sort data locally as in Hive, but the intermediate data are partitioned by the range of the sort key. In the second phase, each node performs local sort to get the final results. Since intermediate data are partitioned by the range of sort key, final results are correct.


Gartner Blog on Apache Tajo


Self evaluation on a local PC installed with Hive on (Hadoop 1.x) and Tajo on (Hadoop 2.x)

Case 1:

For NYSE data with 2880648 records, filtering based on Stock Close Price, returns 8601 records. Time taken by

     a) Tajo – 4.239 sec

     b) Hive – 20.967 seconds


Discharge Summary Analysis using Hadoop MapReduce


, , ,

Here is a simple Hadoop based project for learning.

This project extracts the discharge summary data and using International Classification of Diseases (ICD) codes it will perform a quick search on a specific disease and returns the highest number of other diagnosis done related to the search.

To start of with next level on Hadoop MapReduce this project is designed. Will update the periodically.


Tools & Technologies used in current project


, , , , , ,

About project:
The project is all about to find the probability of Tornado that could happen over the period using Storm Data (http://www.spc.noaa.gov/wcm/#data) and Sea Surface Temperature Data from NOAA http://www.cpc.ncep.noaa.gov/data/indices/)

Tools & Technologies:
Code repository: git
Code: Java (Core Java, JPA)
Database: MySQL, HDFS
Database Workbench Tool: MySQL Workbench, HeidiSQL
Analytics: R, RMR (R in MapReduce)
Workflow: Askaban
Build & Continuous Integration: Jenkins

Extracting Salesforce data


, , , , , , , , , , , , , ,

Found few ways of extracting data from Salesforce to HDFS/HBase/MongoDB or other destinations.

The best was chosen from the below were, Apache Camel as Integration framework and Talend Open Studio for Big Data as Integration tool.

Kindly let me know if you find any changes to be done.

Tools & Frameworks Analysed/Evaluated for the need.

Sl. No Open Source Framework Open Source Tool
1 Apache Camel Talend Open Studio for Big Data
2 Spring Integration Mule Studio Open Source
3 Chukwa WSO2 Developer Studio
4 Cloudera Morphline  
5 Hortonworks Falcon  

1.     Chukwa

Chukwa is an open source data collection system for managing large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.


2.     Hortonworks – Falcon

Falcon is a data lifecycle management framework for Apache Hadoop that enables users to configure, manage and orchestrate data motion, disaster recovery, and data retention workflows in support of business continuity and data governance use cases.

Falcon’s goal is to simplify data management on Hadoop and achieves this by providing important data lifecycle management services that any Hadoop application can rely on. Instead of hard-coding complex data lifecycle capabilities, apps can now rely on a proven, well-tested and extremely scalable data management system built specifically for the unique capabilities that Hadoop offers.

For example consider the challenge of preparing raw data such that it can be consumed by business intelligence applications. In addition to this routine use case suppose you also want to replicate data to a failover cluster that is smaller than the primary cluster. In this case you probably only want to replicate the staged data as well as the data presented to BI applications, relying on the primary cluster to be the sole source of intermediate data.


3.     Mule ESB

Mule ESB is a lightweight Java-based enterprise service bus (ESB) and integration platform that allows developers to connect applications together quickly and easily, enabling them to exchange data. Mule ESB enables easy integration of existing systems, regardless of the different technologies that the applications use, including JMS, Web Services, JDBC, HTTP, and more.

The key advantage of an ESB is that it allows different applications to communicate with each other by acting as a transit system for carrying data between applications within your enterprise or across the Internet. Mule ESB includes powerful capabilities that include:

  • Service creation and hosting — expose and host reusable services, using Mule ESB as a lightweight service container
  • Service mediation — shield services from message formats and protocols, separate business logic from messaging, and enable location-independent service calls
  • Message routing — route, filter, aggregate, and re-sequence messages based on content and rules
  • Data transformation — exchange data across varying formats and transport protocols



4.     Apache Camel

Apache Camel ™ is a versatile open-source integration framework based on known Enterprise Integration Patterns.

Camel empowers you to define routing and mediation rules in a variety of domain-specific languages, including a Java-based Fluent APISpring or Blueprint XML Configuration files, and a Scala DSL. This means you get smart completion of routing rules in your IDE, whether in a Java, Scala or XML editor.

Apache Camel uses URIs to work directly with any kind of Transport or messaging model such as HTTP,ActiveMQJMSJBI, SCA, MINA or CXF, as well as pluggable Components and Data Format options. Apache Camel is a small library with minimal dependencies for easy embedding in any Java application. Apache Camel lets you work with the same API regardless which kind of Transport is used – so learn the API once and you can interact with all the Components provided out-of-box.

Apache Camel provides support for Bean Binding and seamless integration with popular frameworks such as SpringBlueprint and Guice. Camel also has extensive support for unit testing your routes.

The following projects can leverage Apache Camel as a routing and mediation engine:

  • Apache ServiceMix – a popular distributed open source ESB and JBI container
  • Apache ActiveMQ – a mature, widely used open source message broker
  • Apache CXF – a smart web services suite (JAX-WS and JAX-RS)
  • Apache Karaf – a small OSGi based runtime in which applications can be deployed
  • Apache MINA – a high-performance NIO-driven networking framework



5.     WSO2 ESB

The WSO2 Enterprise Service Bus (ESB) is a simple, lightweight and high performance enterprise service bus (ESB), based on the Apache Synapse enterprise service bus, providing enhanced management and development/configuration support and SOA Governance capabilities. [neutrality is disputed] Ebay uses WSO2 Enterprise Service Bus as one of the key elements in its transaction software, which continuously executes $2,000 worth of transactions per second.

The WSO2 Enterprise Service Bus supports the creation of Proxy Services graphically, which allows users to easily create virtual services on the ESB layer to front existing services. Existing services are SOAP, Plain Old XML (POX)/REST services over HTTP/S, as well as SOAP or legacy services over JMS, Apache VFS file systems (e.g. s/ftp, file, zip/tar/gz, webdav, and cifs), Mail systems (such as Post Office Protocol (POP3), Internet Message Access Protocol (IMAP), and Simple Mail Transfer Protocol (SMTP)), Financial Information eXchange (FIX), Hessian, and Advanced Message Queuing Protocol (AMQP).




6.     Talend Big Data

Talend’s open source approach and flexible integration platform for big data enables users to easily connect and analyze data from disparate systems to help drive and improve business performance. Talend’s big data capabilities integrate with today’s big data market leaders such as Cassandra, Cloudera, Hortonworks, Google, Greenplum, Mapr, MongoDB, Teradata and Vertica, positioning Talend as a leader in the management of big data.

Big Data Integration

Landing big data (large volumes of log files, data from operational systems, social media, sensors, or other sources) into a big data platform such as Apache Hadoop, Google Cloud Platform, Netezza, Teradata or Vertica is a cinch with the breadth of big data components provided by Talend. A full set of Talend data integration components (application, database, service and even a master data hub) is available, so that data movement can be orchestrated from any source or into almost any target.

NoSQL connectivity to MongoDB and Cassandra is simplified through pre-built graphical connector components.

Big Data Quality

Talend provides data quality functions that take advantage of the massively parallel environment of Hadoop, allowing you to understand the completeness, accuracy and integrity of data as well as to remove duplicates. Hadoop data profiling allows you do collect information and statistics about big data to assess data quality, repurposing and metadata. Additional functions include standardization, parsing, enrichment, matching, survivorship and monitoring of ongoing data quality.

Big Data Manipulation

Talend supports Apache Pig and HBase so you can perform basic transformations and analysis on massive amounts of data in little time. With these scripting languages you can compare, filter, evaluate and group data within an HDFS cluster. Google Big Query supports allows you to interactively analyze very large datasets. Talend speeds development and collaboration by providing a set of components that allow these scripts to be defined in a graphical environment and as part of a data flow.

Big Data Project Governance and Administration

Governance of a big data project is very similar to any integration project; however, big data projects sometimes lack necessary management functions. Talend presents a simple, intuitive environment to implement and deploy a big data program with the ability to schedule, monitor and deploy any big data job. Also included is a common repository so developers can collaborate and share project metadata and artifacts.


7.     Spring Integration

Extends the Spring programming model to support the well-known Enterprise Integration Patterns. Spring Integration enables lightweight messaging within Spring-based applications and supports integration with external systems via declarative adapters.

Using the Spring Framework encourages developers to code using interfaces and use dependency injection (DI) to provide a Plain Old Java Object (POJO) with the dependencies it needs to perform its tasks. Spring Integration takes this concept one step further, where POJOs are wired together using a messaging paradigm and individual components may not be aware of other components in the application. Such an application is built by assembling fine-grained reusable components to form a higher level of functionality. WIth careful design, these flows can be modularized and also reused at an even higher level.

In addition to wiring together fine-grained components, Spring Integration provides a wide selection of channel adapters and gateways to communicate with external systems. Channel Adapters are used for one-way integration (send or receive); gateways are used for request/reply scenarios (inbound or outbound). For a full list of adapters and gateways, refer to the reference documentation.

The Spring XD project builds on Spring Integration, where Spring Integration modules are assembled into an XD Stream.



8.     Morphline

Cloudera Morphlines is an open source framework that reduces the time and skills necessary to build or change Search indexing applications. A morphline is a rich configuration file that makes it easy to define an ETL transformation chain that consumes any kind of data from any kind of data source, processes the data and loads the results into Cloudera Search. Executing in a small embeddable Java runtime system, morphlines can be used for Near Real Time applications as well as Batch processing applications.

Morphlines are easy to use, configurable and extensible, efficient and powerful. They can see been as an evolution of Unix pipelines, generalised to work with streams of generic records and to be embedded into Hadoop components such as Search, Flume, MapReduce, Pig, Hive, Sqoop.



Apache Projects


, ,


Apache projects




Big Data and Hadoop Information Repository


, , ,



This blog is to make ease for developers to quickly navigate and find the relevancy of search at one place on most of the links and documents related to Hadoop and eco system projects. I shall keep updating this blog on all my recent search and study.

Post me if you find any issue with this blog.

Introduction to Big Data and Hadoop

Hadoop Open Source Projects

Monte Carlo Technique

Recommendation Engine

Hadoop and Enterprise Data Warehouse

Big Data – Case Studies

Big Data Analytics in Public Sector

Apache Camel – Tutorial

Apache Camel – Getting Started (Abhishek Jain)

Apache Camel – Getting Started (Kai Waehner)

Machine Learning Solution Project List

Big Data Certifications

R Integration with Tableau