Hi-Speed Data Integration using Open Source

– Siva Akkiraju

Getting meaningful Data insights from disparate and silo’ed systems is a dream for many organizations that is seldom realized. Resource and budgetary constraints, coupled with prohibitively priced commercial products, limits the ability of the CIOs and Data Processing Managers to successfully complete data integration and get a relevant dashboard that the program needs.

A few years ago, open source meant settling for a lower quality solution that barely gets the job done. But now, open source solutions are as robust as other commercial grade solutions for medium enterprises with on par user experience. We intend to present of technical blogs in our Open Source series, starting with setting up a high volume and high-speed data integration infrastructure using open source platform.

This setup uses Ubuntu Linux, Apache Hadoop cluster for Data Store and Apache Nifi cluster for ETL and Data Integration. All the software used is available freely for internal use and you can setup Hi-Volume and Hi-End Data Integration solution using these tools. You can use virtual servers for the setup or re-purpose your old servers.

Hadoop - Hi Volume Data IntegrationAn HDFS cluster consists of a single Name Node, a master server that manages the file system namespace and regulates access to files by clients. A secondary Named Node will also supplement the master named node. The other Nodes of HDFS cluster are Data Nodes/Slaves that can process large volumes of data.  Hadoop File system breaks the data into multiple files and maintains copies across Data Nodes. Name Node maintains address and location of information across the nodes. Hadoop Client Interacts with Data Nodes pushing and pulling the data.

HDFS high-availability capability permits the main metadata server (the Name Node) to failover to a backup in the event of failure. Apache NiFi provides powerful, highly configurable, and scalable directed graphs of data routing, transformation, and system mediation logic. It has web-based user interface that provides seamless experience between design, control, feedback, and monitoring.

WATI has implemented this setup to integrate multiple agencies across counties. Please reach out to info@wati.com for in-depth information on using open source solutions in your organization.

Menu