Big Data – Extract-Transform-Load (ETL) 001 (click to download)
*~:~* Happy New Year *~:~*
My last blog (Column Oriented Database Technologies) discussed the differences between Row and Column oriented databases and some key players in this space. Concepts and technologies on Big Data have been discussed in previous blogs (Big Data & NoSQL Technologies & NoSQL .vs. Row .vs. Column). From these blogs one should surmise that deciding upon the best database technology (or DBMS vendor) really depends on schema complexities, how you intend to retrieve your data, and how to get it there in the first place. We’re going to dive into this next, but before we do it is imperative that we briefly examine the differences between OLTP and OLAP database designs. Then let’s leave OLTP details for a future blog as I expect most readers already know plenty about transactional database systems. Instead we’ll focus here on OLAP details and how we process Big Data using ETL technologies for data warehouse applications.
Generally the differences between OLTP and OLAP database applications center upon how frequently data must be stored and retrieved, the integrity of that data, and how much of there is and its growth. OLTP database schemas are optimized for processing transactions by an increasing number of users while OLAP database schemas are optimized for aggregations against an increasing amount of data and exponential query permutations. Design considerations involve normalization, indexing, datatypes, user load, storage requirements, performance, and scalability. We will need to defer the many interesting details on these considerations for a future blog.
On-Line Transactional Processing (OLTP) applications, like Customer Relationship Management (CRM), (ERP), Corporate Financials (AR/AP/GL), e-Commerce, or other enterprise systems use traditional SQL queries that are either embedded within application code (not a great practice in my humble opinion), or within stored-procedures (a much better practice) in order to store and retrieve data. Where data integrity and performance is critical, an OLTP database is the most appropriate. OLTP schema designs are generally highly normalized, comprehensive structures that are optimized for fast transactional data processing. Typically they represent the source data that feed a data warehouse.
On-Line Analytical Processing (OLAP) applications however focus on Data Warehouse and Business Intelligence systems where the volume of data can grow quite large. Performance and scalability become the driving factors and data integrity is generally inherited from the source system reducing its importance. Typically OLAP schema designs follow well known modeling practices like Ralph Kimball’s Star Schema or Dan Linstedt’s Data Vault (read my blog Data Vault – What is it?). Perhaps the best way to think of an OLAP system is to think of the wizard behind the curtain who as all answers, for everyone, all the time. Boy if were only true! Instead, OLAP schema designs are generally refactored data structures based upon source OLTP systems that are optimized for fast query processing.
And so we arrive at the crux of this blog. How do we get data from the source OLTP system into the target OLAP data warehouse in a practical, efficient way? And what do we need to do to that data to conform and resolve the clearly different schema designs? So along comes ETL, or Extract-Transform-Load; fundamentally understandable, but potentially very hard to do once you dive into all the complexities involved; like peeling an onion or melting the wicked witch! Let’s examine what is really involved.
Extracting data from a source dataset, transforming that data in potentially many particular ways, and loading the resulting data into a target dataset is the essence of ETL. Understanding that is simple. Yet consider that the source data may originate from several different files, tables, views, or databases; furthermore these are potentially varied in structure, location, and host systems (ie: Oracle, MS SQL Server, MySQL, etc.). Also consider that transformations can include a myriad of different requirements from normalization to de-normalization of source data, lookups from other datasets, merging, sorting, truncating, datatype conversion, inner joins, outer joins, matched and/or unmatched records, etc. These requirements are anywhere from simple to daunting. Consider the data load; perhaps there is one target, perhaps many; data might reside in a data warehouse, and/or data marts; maybe it’s a Star Schema, maybe not! Finally the entire ETL data process may be a full data set, or an incremental one. OMG! ETL data processing permutations are endless and one thing is certain, getting it right is critical!
What should we do then to deal with these comprehensive and complex data processes?
We use tools of course!
Some may build their ETL data processing with SQL scripts, or maybe embedded SQL in another scripting language like PHP or Python. Others may actually craft programs using high level languages like C#, Java, or Visual Basic. These solutions are fine, of course, but they present a cumulative burden of development, maintenance, and cost that can easily exceed the alternative of using tools designed specifically for ETL data processing. Not that there is no burden if you use ETL tools, but a greatly reduced burden, in my humble opinion. There are several ETL tool vendors out there all providing the various functionalities needed. I found this survey result, which is about as I would expect.
Keep in mind however that there are several other key features that should be considered when looking at ETL tools, including:
- Task Scheduler & Triggers
- Environment Control (ie: DB Connections, Global Variables)
- Load Balancing
- Admin Console
- Project, User, Task Management
- Integrated Source Control
- Distributed Processing
- Task Execution Analysis
- Monitoring & Logging
- Real-Time execution logging
- File and Database error capture
Using ETL tools with the appropriate features and usability is very important, yet there is one more aspect I submit must be considered before crafting any ETL process. Data Warehouse Schema design! All the objects involved within a Data Warehouse are really driven by the overarching architecture. This aspect is so easily overlooked, or under-valued. Many just slap something together and hope for the best. Maybe this contributes to the almost 80% failure rates of first DW/BI project efforts. Let’s take a step back then and look at this from 35,000 feet for a moment.
Ok, we all know that an OLAP Data Warehouse commonly supports analytics and reporting, often referred as Business Intelligence. We typically understand what our data sources are and that we will use ETL data processes to move data through a system that in turn must be a sustainable, pliable, expanding DW/BI solution. Yet what is the architecture of that system? Do you notice we started from a comfortable position of factual information of what is known and quickly culminated in a place of genuine uncertainty about what will be constructed? The good news is that we also do know what the results should be.
Consider carefully then goals and requirements of the business and the metrics involved and then decide what the end data stores should be. From there you can fill in the gaps; ask yourself these many questions:
- What are the business metrics and how should they be defined?
- What permutations are involved? (ie: reporting periods, frequency, filters, etc.)
- Should the target data stores be column, or row based?
- Should Star Schemas or Data Vaults be used? and when?
- What, if any, Data Marts would be employed?
- Do they inter-relate or are they stand-alone?
- How much source data exists already?
- How big will the target data grow? and how fast?
- What are the expected complexities of the transformations involved?
- How about hardware and software requirements?
- Optimizations? Scalability? Functionality? ~~ you get the picture …
I surmise that anyone who stops to think about the architecture involved, many more questions will easily come to mind. The more questions you can consider and answer regarding any DW/BI system architecture the better; and sooner is better than later. One principle I always try to employ in any design I craft, is that change is inevitable so DW/BI system architectures must be pliable and extensible. Achieve this in your solution and you’re at least half way there. Just remember to click your heels together three times first, close your eyes, and repeat: there’s no place like home!
One final good practice common to many DW/BI system architectures aims at staging data first, before it goes into the data warehouse. Often called an ODS, or Operational Data Store, having a pre-emptive place to prepare, or stage, data before it moves into the data warehouse can be a tremendous advantage to any DW/BI system architecture. It does introduce an additional ETL step, but usually quite worth it. A decent ODS will contain a reduced dataset from the source with few to no transformations thus eliminating data you know you don’t need. Then the next ETL step does the real work in transformation and eventually loading into the data warehouse.
ETL data processing provides an essential methodology for moving data from point A to point B. Crafting these processes can be straight-forward to highly complex. I often wonder how we all got by with SQL scripting before ETL tools came along; today when faced with a data migration and/or date warehouse population task, I break out these tools and get to work. Best advice: Follow the yellow brick road…
As this topic is quite extensive, my next blog, “Extract-Transform-Load (ETL) Technologies – Part 2”, will focus on ETL vendors so we can examine who the key players are and what they offer.
As always, don’t be afraid to comment, question, or debate… I learn new things every day!