Big Data – Extract-Transform-Load (ETL) 002 (click to download)
My last blog (Extract-Transform-Load (ETL) Technologies – Part 1) discussed the purpose, function, and some of the inherent benefits of ETL technologies for moving data from source applications into target reporting and analytic environments. Hopefully from that discussion one can gain some general understanding of ETL; what it is, how it can be used, and why we would want to use it. Now let’s dive a bit deeper and discover who some of the key vendors are in the ETL Tools marketplace.
ETL tools have been around for some time, have evolved, matured, and now present us with productive environments for Big Data, Data Warehouse, Business Intelligence, and analytics processing. The ETL data processing cycle itself can be simple or highly complex. Most vendor offerings address the diverse requirements well enough although some do better than others. The challenges with ETL tools come from the other important considerations, like:
- Parallel Processing
- Data Virtualization
- Handling Keys (Primary/Foreign)
- Attribute Mapping
While the first three on this list are essential, personally I think that programmability could easily be at the top of the list. Obviously my developer bias is showing! d;) Since all of the vendors below offer a graphical user interface for creating ETL code they pass my first test for programmability. It is very nice to finally paint the code with pictures rather than typing out syntax. Having a GUI design tool for ETL programming beats out PHP, Perl, or SQL scripting any day of the week and twice on Sunday.
There is however some important vendor differences under the hood that one must consider. Some ETL tools generate actual compiled code (like Java) while others generate intermediate code that must be processed further. This intermediate code is proprietary and requires interpretation by the ETL engine impacting the execution performance to some degree. While this can be acceptable, I prefer to work with tools that generate code any skilled programmer can read and understand.
A comprehensive integration platform that promotes code standardization, unifies collaboration between business and IT roles, and provides capabilities that handle the high volume and wide variety of today’s business data. The Informatica Platform has eight distinct technologies designed to be a true industrial strength ETL solution. These include:
- Complex Event Processing
- B2B Data Exchange
- Cloud Data Integration
- Enterprise Data Integration
- Application Lifecycle Management
- Data Quality
- Master Data Management
Informatica PowerCenter is the ETL tool that empowers an IT organization to implement highly scalable, high-performance data processing maps using a graphical interface that generates proprietary intermediate code. This mapping code, when coupled with a defined data workflow plan can then be scheduled for a variety of execution strategies.
Widely accepted as an industry front runner, Informatica boasts high productivity and low cost. In truth, this may be just the opposite. Perhaps high productivity at a cost is more accurate; in terms of experienced developers, administrators, and license fees. Companies who choose to use Informatica usually have very large IT teams and budgets.
The Informatica Cloud Integration version however does offer easy to use ETL programmability without the need for an extensive team or the hosting an IT infrastructure. While very cost effective, its features are simply a subset of the PowerCenter that may not provide all the required functionality. Yet while basic it may be a very suitable alternative for many projects at a much lower cost.
SQL Server Integration Services, aka: SSIS is an enterprise-level data integration and transformation solution from Microsoft. Widely used by customers to solve a variety of complex business problems this ETL tool does it all. As part of the SQL Server evolution, SSIS has its roots in the BCP (Bulk Copy Program) and DTS (Data Transformation Services) which have been around MS SQL Server for many years. SSIS works in direct correlation with SSAS (SQL Server Analytics Service) and SSRS (SQL Server Reporting Service) but can be used for standalone ETL processing.
MS SQL Server Management Studio provides the graphical interface for the creation of ETL ‘packages’ against MS SQL Server or other data sources. These packages are executed manually or can be scheduled. Fundamentally, under the hood, SSIS is creating T-SQL queries coupled with the process logic defined in the package. With the many robust components that are dropped into the package the developer can easily craft sophisticated ETL processes.
Microsoft’s overall Business Intelligence stack places SSIS on the front line for data access and processing. As discussed in my previous blog moving data from a source to a target can quickly become highly complex. Understanding the goal of that data is critical to how the ETL process should be constructed. Mastering this stack is something else altogether; perhaps another blog?
Talend has recently surged as an open source ETL tool that offers a wide variety of data integration components and a streamlined graphical user interface. Offering freely downloadable flavors of the Talend Open Studio for Big Data, Data Integration and ESB; these fully functional versions allow for the design, development, of deployable execution of real data processing capabilities. The subscription based Talend Integration Suite presents the enhanced version providing the additional enterprise functionality and features you might expect for more comprehensive projects. These include:
- Collaboration (with integrated SVN)
- Automated Deployment (scheduler)
- Load Balancing (multiple job processing servers)
- Monitoring (administration console)
- Data Quality Profiling
- Master Data Management
- Big Data connectors (ie: Hadoop, Cassandra, MongoDB)
- Context Variable Management (Global/Local)
As an Open Source technology, Talend’s latest release now boasts the first and only ETL vendor to support 100% Map Reduce functionality. Additional support for Hadoop, Cassandra, MongoDB, Hive, and even PIG is available. Expect more to come!
Talend’s architecture demonstrates a practical approach to the complexities involved in deployment and execution of the data integration process. Highly scalable, this architecture is not only easy to understand but through the Talend Administration Center is a straightforward to configure and maintain.
Unlike other tools (which separate these two tightly interwoven functionalities), as an Eclipse-based editor, Talend Integration Studio provides a heterogeneous application and data integration viewpoint for crafting ETL processes. Talend code generation options for either Perl or Java eliminates proprietary intermediate code and streamlines job execution. Talend also boasts more than 400 connectors providing virtually unlimited connectivity to enterprise systems, Big Data, databases, software packages, mainframes, files, web services, and the like. The tMap component delivers highly complex transformation capabilities allowing for multiple data source integration, lookups, error handling, and a wide variety of data processing options. I have been using Talend for over three years now and admittedly find it the most productive ETL tool at a more competitive price.
Actual product offerings from Talend are packaged in segments designed to focus on project needs. These include:
- Big Data
- Data Integration
- Business Process Management
- Data Quality
- Enterprise Service Bus
- Master Data Management
These open source products are available as free use, non-supported downloads (https://www.talend.com/products/talend-open-studio) and offer considerable features and functionality for developing working ETL processes.
Talend ‘On Demand’ is the one of the first ETL vendors to offer an Open Source data integration solution in the cloud (the Informatica Cloud offering is not Open Source). This SaaS model is free for up to two users and presents a great starting point for smaller projects or prototyping.
Coupling data integration with embedded analytics, Pentaho sets itself apart from most of its competitors. This platform offers business users an interactive and visualized approach that can be deployed in the cloud or on-premise. Focused on the business analytics aspects of data integration, Pentaho can manage very large volumes and variety of data. Visualization, data mining, reporting, and interactive dashboards simplify the development process that provides rapid deployment platform for a broad set of users.
Pentaho Data Integration (PDI) can extract data from complex and diverse data sources offering high performance ETL capabilities, including:
- Rich graphical visual design environment
- Broad connectivity to any type or source of data
- Enterprise scalability and performance
- High performance multi-threaded processing
- Data Profiling and Quality
- Open and Standards-based Architecture
- Big Data integration for Hadoop, NoSQL, and Analytic databases
- Integrated reporting, data visualization/analysis, and predictive analytics
- Salesforce and SugarCRM plug-ins
ETL developers will appreciate the PDI designer which offers an intuitive, drag and drop GUI, a robust library of pre-built components, data transformation mappings, content versioning and locking, and an integrated debugger. With the centralized monitoring, scheduler, and security, PDI delivers a reliable solution for data migration and synchronization, application consolidation, business intelligence, and data warehousing.
The PDI Community Edition (PDI CE), also known as ‘Kettle’, is available as a free download (http://community.pentaho.com/projects/data-integration/). This self-supported open source software provides many of the key features and functionality found in the Enterprise Edition (PDI EE). Along with Kettle you can find ‘Mondrain’ the Pentaho Analysis Services Community Edition which, as an OLAP Server, claims to enable business users to analyze large amounts of data in real-time. I am unable to dispute or confirm this claim, however in my experience, achieving even near-real-time is difficult making it (IMHO) somewhat dubious.
To wrap up it is fair to point out that there are several other ETL contenders in today’s marketplace. For practical reasons this blog does not dive into them further, yet let’s at least give some of them mention. They include:
A relative new comer to the crowded ETL marketplace, Astera’s Centerprise Data Integrator is focused on easy to use, easy to manage, enterprise-level integration tool for everyone. As a 100% managed .NET code product, Astera provides a reliable, scalable, and secure platform for flexible, reusable, high-performance processing; at a cost. However, it is perhaps the only ETL tool that supports EDI transactions.
Clover’s slick graphical designer helps visualize data flows while offering a wide variety of flexible components for complex data integration requirements. As an open-source software product, their Community Edition gets you started right away, yet its proprietary intermediate process layer may not provide the most robust development environment.
Offering a wizard-driven, “configure, not code” ETL approach, Adeptia eliminates complexity thus reducing data integration costs while improving deployment and ROI for business users. Adeptia’s low cost commercial product provides many features; however is not open-source source software and thus only trial period evaluations are available.
So there you have it. More than you might ever want to know about ETL and Data Integration vendors. Yet, I think ETL data processing and all its complexities cannot be ignored. My hope is that this information, while incomplete at its best, and perhaps slightly inaccurate in its technical reflection at its worse, still edifies you, the gentle reader, in a constructive way.
I look forward to any comment, question, or debate… ETL ON ~ Dudes!