Tags

apache, bigdata, camel, chukwa, Falcon, Hadoop, hdfs, integration framework, mongodb, Morphline, Mule, salesforce, spring, talend, WSO2

Found few ways of extracting data from Salesforce to HDFS/HBase/MongoDB or other destinations.

The best was chosen from the below were, Apache Camel as Integration framework and Talend Open Studio for Big Data as Integration tool.

Kindly let me know if you find any changes to be done.

Tools & Frameworks Analysed/Evaluated for the need.

Sl. No	Open Source Framework	Open Source Tool
1	Apache Camel	Talend Open Studio for Big Data
2	Spring Integration	Mule Studio Open Source
3	Chukwa	WSO2 Developer Studio
4	Cloudera Morphline
5	Hortonworks Falcon

1. Chukwa

Chukwa is an open source data collection system for managing large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

http://chukwa.apache.org/

2. Hortonworks – Falcon

Falcon is a data lifecycle management framework for Apache Hadoop that enables users to configure, manage and orchestrate data motion, disaster recovery, and data retention workflows in support of business continuity and data governance use cases.

Falcon’s goal is to simplify data management on Hadoop and achieves this by providing important data lifecycle management services that any Hadoop application can rely on. Instead of hard-coding complex data lifecycle capabilities, apps can now rely on a proven, well-tested and extremely scalable data management system built specifically for the unique capabilities that Hadoop offers.

For example consider the challenge of preparing raw data such that it can be consumed by business intelligence applications. In addition to this routine use case suppose you also want to replicate data to a failover cluster that is smaller than the primary cluster. In this case you probably only want to replicate the staged data as well as the data presented to BI applications, relying on the primary cluster to be the sole source of intermediate data.

http://hortonworks.com/blog/project-falcon-tackling-hadoop-data-lifecycle-management-via-community-driven-open-source/

3. Mule ESB

Mule ESB is a lightweight Java-based enterprise service bus (ESB) and integration platform that allows developers to connect applications together quickly and easily, enabling them to exchange data. Mule ESB enables easy integration of existing systems, regardless of the different technologies that the applications use, including JMS, Web Services, JDBC, HTTP, and more.

The key advantage of an ESB is that it allows different applications to communicate with each other by acting as a transit system for carrying data between applications within your enterprise or across the Internet. Mule ESB includes powerful capabilities that include:

Service creation and hosting — expose and host reusable services, using Mule ESB as a lightweight service container
Service mediation — shield services from message formats and protocols, separate business logic from messaging, and enable location-independent service calls
Message routing — route, filter, aggregate, and re-sequence messages based on content and rules
Data transformation — exchange data across varying formats and transport protocols

http://www.mulesoft.org/what-mule-esb

4. Apache Camel

Apache Camel ™ is a versatile open-source integration framework based on known Enterprise Integration Patterns.

Camel empowers you to define routing and mediation rules in a variety of domain-specific languages, including a Java-based Fluent API, Spring or Blueprint XML Configuration files, and a Scala DSL. This means you get smart completion of routing rules in your IDE, whether in a Java, Scala or XML editor.

Apache Camel uses URIs to work directly with any kind of Transport or messaging model such as HTTP,ActiveMQ, JMS, JBI, SCA, MINA or CXF, as well as pluggable Components and Data Format options. Apache Camel is a small library with minimal dependencies for easy embedding in any Java application. Apache Camel lets you work with the same API regardless which kind of Transport is used – so learn the API once and you can interact with all the Components provided out-of-box.

Apache Camel provides support for Bean Binding and seamless integration with popular frameworks such as Spring, Blueprint and Guice. Camel also has extensive support for unit testing your routes.

The following projects can leverage Apache Camel as a routing and mediation engine:

Apache ServiceMix – a popular distributed open source ESB and JBI container
Apache ActiveMQ – a mature, widely used open source message broker
Apache CXF – a smart web services suite (JAX-WS and JAX-RS)
Apache Karaf – a small OSGi based runtime in which applications can be deployed
Apache MINA – a high-performance NIO-driven networking framework

http://en.wikipedia.org/wiki/Apache_Camel

http://camel.apache.org/

5. WSO2 ESB

The WSO2 Enterprise Service Bus (ESB) is a simple, lightweight and high performance enterprise service bus (ESB), based on the Apache Synapse enterprise service bus, providing enhanced management and development/configuration support and SOA Governance capabilities. [neutrality is disputed] Ebay uses WSO2 Enterprise Service Bus as one of the key elements in its transaction software, which continuously executes $2,000 worth of transactions per second.

The WSO2 Enterprise Service Bus supports the creation of Proxy Services graphically, which allows users to easily create virtual services on the ESB layer to front existing services. Existing services are SOAP, Plain Old XML (POX)/REST services over HTTP/S, as well as SOAP or legacy services over JMS, Apache VFS file systems (e.g. s/ftp, file, zip/tar/gz, webdav, and cifs), Mail systems (such as Post Office Protocol (POP3), Internet Message Access Protocol (IMAP), and Simple Mail Transfer Protocol (SMTP)), Financial Information eXchange (FIX), Hessian, and Advanced Message Queuing Protocol (AMQP).

http://en.wikipedia.org/wiki/WSO2

http://wso2.com/products/enterprise-service-bus/

6. Talend Big Data

Talend’s open source approach and flexible integration platform for big data enables users to easily connect and analyze data from disparate systems to help drive and improve business performance. Talend’s big data capabilities integrate with today’s big data market leaders such as Cassandra, Cloudera, Hortonworks, Google, Greenplum, Mapr, MongoDB, Teradata and Vertica, positioning Talend as a leader in the management of big data.

Big Data Integration

Landing big data (large volumes of log files, data from operational systems, social media, sensors, or other sources) into a big data platform such as Apache Hadoop, Google Cloud Platform, Netezza, Teradata or Vertica is a cinch with the breadth of big data components provided by Talend. A full set of Talend data integration components (application, database, service and even a master data hub) is available, so that data movement can be orchestrated from any source or into almost any target.

NoSQL connectivity to MongoDB and Cassandra is simplified through pre-built graphical connector components.

Big Data Quality

Talend provides data quality functions that take advantage of the massively parallel environment of Hadoop, allowing you to understand the completeness, accuracy and integrity of data as well as to remove duplicates. Hadoop data profiling allows you do collect information and statistics about big data to assess data quality, repurposing and metadata. Additional functions include standardization, parsing, enrichment, matching, survivorship and monitoring of ongoing data quality.

Big Data Manipulation

Talend supports Apache Pig and HBase so you can perform basic transformations and analysis on massive amounts of data in little time. With these scripting languages you can compare, filter, evaluate and group data within an HDFS cluster. Google Big Query supports allows you to interactively analyze very large datasets. Talend speeds development and collaboration by providing a set of components that allow these scripts to be defined in a graphical environment and as part of a data flow.

Big Data Project Governance and Administration

Governance of a big data project is very similar to any integration project; however, big data projects sometimes lack necessary management functions. Talend presents a simple, intuitive environment to implement and deploy a big data program with the ability to schedule, monitor and deploy any big data job. Also included is a common repository so developers can collaborate and share project metadata and artifacts.

http://www.talend.com/solutions/big-data

7. Spring Integration

Extends the Spring programming model to support the well-known Enterprise Integration Patterns. Spring Integration enables lightweight messaging within Spring-based applications and supports integration with external systems via declarative adapters.

Using the Spring Framework encourages developers to code using interfaces and use dependency injection (DI) to provide a Plain Old Java Object (POJO) with the dependencies it needs to perform its tasks. Spring Integration takes this concept one step further, where POJOs are wired together using a messaging paradigm and individual components may not be aware of other components in the application. Such an application is built by assembling fine-grained reusable components to form a higher level of functionality. WIth careful design, these flows can be modularized and also reused at an even higher level.

In addition to wiring together fine-grained components, Spring Integration provides a wide selection of channel adapters and gateways to communicate with external systems. Channel Adapters are used for one-way integration (send or receive); gateways are used for request/reply scenarios (inbound or outbound). For a full list of adapters and gateways, refer to the reference documentation.

The Spring XD project builds on Spring Integration, where Spring Integration modules are assembled into an XD Stream.

http://projects.spring.io/spring-integration/

http://projects.spring.io/spring-xd/

8. Morphline

Cloudera Morphlines is an open source framework that reduces the time and skills necessary to build or change Search indexing applications. A morphline is a rich configuration file that makes it easy to define an ETL transformation chain that consumes any kind of data from any kind of data source, processes the data and loads the results into Cloudera Search. Executing in a small embeddable Java runtime system, morphlines can be used for Near Real Time applications as well as Batch processing applications.

Morphlines are easy to use, configurable and extensible, efficient and powerful. They can see been as an evolution of Unix pipelines, generalised to work with streams of generic records and to be embedded into Hadoop components such as Search, Flume, MapReduce, Pig, Hive, Sqoop.

http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

https://github.com/cloudera/search

amalanjagan

~ hadoop, big data

Tag Archives: mongodb

Extracting Salesforce data