call us toll-free +1 855 855 3600
 
  • Home
  • >
  • Blog
  • >
  • Extract-Transform-Load (ETL) Technologies – Part 2
DB Best Chronicles
 
Talks on Big Data, Mobile Apps, Web and Software Development

Extract-Transform-Load (ETL) Technologies – Part 2

Posted by | On June 10th, 2013 | In Big Data, Database Migration | Tags: , ,
 

Big Data – Extract-Transform-Load (ETL) 002 (click to download)

My last blog (Extract-Transform-Load (ETL) Technologies – Part 1) discussed the purpose, function, and some of the inherent benefits of ETL technologies for moving data from source applications into target reporting and analytic environments.  Hopefully from that discussion one can gain some general understanding of ETL; what it is, how it can be used, and why we would want to use it.  Now let’s dive a bit deeper and discover who some of the key vendors are in the ETL Tools marketplace.

ETL tools have been around for some time, have evolved, matured, and now present us with productive environments for Big Data, Data Warehouse, Business Intelligence, and analytics processing.  The ETL data processing cycle itself can be simple or highly complex.  Most vendor offerings address the diverse requirements well enough although some do better than others.  The challenges with ETL tools come from the other important considerations, like:

Read the rest of this entry »

Extract-Transform-Load (ETL) Technologies – Part 1

Posted by | On December 31st, 2012 | In Big Data, Database Migration | Tags: , ,
 

Big Data – Extract-Transform-Load (ETL) 001 (click to download)

*~:~* Happy New Year *~:~*

My last blog (Column Oriented Database Technologies) discussed the differences between Row and Column oriented databases and some key players in this space.  Concepts and technologies on Big Data have been discussed in previous blogs (Big Data & NoSQL Technologies & NoSQL .vs. Row .vs. Column).  From these blogs one should surmise that deciding upon the best database technology (or DBMS vendor) really depends on schema complexities, how you intend to retrieve your data, and how to get it there in the first place.  We’re going to dive into this next, but before we do it is imperative that we briefly examine the differences between OLTP and OLAP database designs.  Then let’s leave OLTP details for a future blog as I expect most readers already know plenty about transactional database systems.  Instead we’ll focus here on OLAP details and how we process Big Data using ETL technologies for data warehouse applications.

Generally the differences between OLTP and OLAP database applications center upon how frequently data must be stored and retrieved, the integrity of that data, and how much of there is and its growth.  OLTP database schemas are optimized for processing transactions by an increasing number of users while OLAP database schemas are optimized for aggregations against an increasing amount of data and exponential query permutations.  Design considerations involve normalization, indexing, datatypes, user load, storage requirements, performance, and scalability.  We will need to defer the many interesting details on these considerations for a future blog.

      OLTP-DW

Read the rest of this entry »

Column Oriented Database Technologies

Posted by | On July 24th, 2012 | In Big Data, Web & Software Development | Tags:
 

Column Oriented Database Technologies (click to download)

My recent blog (Big Data & NoSQL Technologies) discussed various NoSQL technologies and market vendors.  Today let’s dive into column-oriented databases and why they should play an important role in any data warehouse whose focus is on aggregations or metrics (and whose isn’t?).

So you are all probably familiar with row-oriented databases.  Tables of data where rows of fields (also called columns) represent the structural storage and the corresponding SQL queries that select, insert, update, and delete that data.  Most database vendors like Oracle, Microsoft, Sybase, Informix, and many others all base their technology on this ANSI standard.  Column-oriented databases are indeed what you might surmise; tables of data where columns of data values represent the structural storage.  What you might not expect is that on the surface many column-oriented databases look and feel like row oriented databases also using SQL queries in much the same way.  Creating tables, storing data, querying them are all pretty much identical.  They may appear similar, but two principal things to understand is that the significant differences under the hood, in particular, physical storage and query optimization.

As noted in my previous blogs on NoSQL, there is also a column-store technology out there.  Let’s not confuse that with column oriented databases.  They are different.  Since several NoSQL column-store vendors were highlighted before, we will focus instead on the column oriented database vendors here.

First, some key benefits to column oriented databases:

Read the rest of this entry »

The Data Vault – What is it? – Why do we need it?

Posted by | On June 25th, 2012 | In Big Data | Tags:
 

The Data Vault – What is it? – Why do we need it? (click to download)

As mentioned in my last blog (Big Data & NoSQL Technologies) understanding and using the right tools for the job is essential.  Database systems and applications are no exception. For Big Data or Small Data, knowing and using the right data modeling and methodology is also critical. Today, business uses the Internet to make money yet harvesting the data in an efficient way often develops into a considerable IT challenge. The ongoing debate behind the many issues involved are not the focus of this blog however, the concepts presented here are.

Data Warehouse and Business Intelligence (DW/BI) Systems have become a solution of choice for industries to address the many critical business questions ingrained in this vital data. Providing an essential data integration process supporting a variety of reporting requirements, DW/BI Systems often involve significant construction effort, on-going maintenance, and reengineering effort when data structures and business rules change. And they do change; presenting us with problem #1.

Building effective and efficient DW/BI Systems can quickly become a daunting and difficult technical ordeal even for veteran engineering teams. Several integrated technologies are required from database systems, data transformation (ETL) tools, various programming languages, administration and reporting software to high performance networks and computers having very large storage capacities. In many cases even the sheer volume of data can be overwhelming; Yup, problem #2.

Once reporting is delivered to diverse user communities that consume the data simplified usability is expected yet technological realities today often require highly trained end-users. The design, creation, delivery, and support of a robust, effortless DW/BI System for intelligent use, sadly, are complex. You guessed it; problem #3.

Furthermore, as upstream systems change, and as DW/BI technology plows ahead, and as the dynamic complexities involved prevail, we also find that every so often new data sources need to be added to the mix. These are usually unpredicted and unplanned for. The integration impact can be enormous of requiring complete regeneration of the aggregated data; hence, problem #4.

Read the rest of this entry »

Big Data and NoSQL Technologies

Posted by | On June 15th, 2012 | In Big Data | Tags:
 

Big Data & NoSQL Technologies (click to download)

As a follow-up to my last blog (NoSQL .vs. Row .vs. Column) let’s take a closer look at Big Data and the emerging ‘NoSQL’ technologies in today’s marketplace.  As a refresher, two points to reiterate first:

  • NoSQL means ‘Not Only SQL’, as opposed to ‘Not SQL’, as many perceive
  • There are three main variations of NoSQL out there:
    • Key Value
    • Document Store
    • Column Store

It is these variants that we can now examine in more detail, plus have a first look at some of the vendors playing in this field.  As organizations increasingly capture large amounts of both structured and non-structured data, the NoSQL phenomena has come at a perfect time.  Database architects today are wise to consider several factors in choosing the right tool for the job as Big Data is found everywhere:

  • Daily Weather
  • Energy geophysical
  • Pharmaceutical drug  testing
  • Telecom traffic
  • Social media messaging
  • On-line gamming
  • Stock Market
  • Court documents
  • Email/text messaging

So what is Big Data?  This new industry ‘catch phrase’ currently has about as many definitions as there are people talking about it.  Perhaps the best definition may be:

Massive datasets that are organized, manipulated, and managed by tools, processes, procedures, and storage facilities

Realistically today’s Big Data will be tomorrow’s little data, as it is growing at an increasing pace.  Big Data offers a competitive advantage presenting a formidable opportunity, yet it also presents a daunting challenge to IT experts as a transformative technology.  Making Big Data complexity even more complex is the many commonly used document/data storage formats.  Read about them in another blog (Episode 1- Introduction and definitions) posted by my associate, Julius Gabby.

Read the rest of this entry »

NoSQL .vs. Row .vs. Column

Posted by | On May 30th, 2012 | In Big Data | Tags:
 

The hype and disinformation that grudgingly prevails in the data warehouse world today brings me to raise the debate to a rational level. Let’s set aside 3NF and STAR schemas for a moment and the many flavors of analytics along with all their technologies. Let’s temporarily ignore e-commerce, database migrations, business intelligence, and data collection and processing systems. Instead let’s look at three different data storage methodologies. These are:

  • NoSQL - very new, lots of hype, and which really means ‘NOT ONLY SQL’
  • ROW – your traditional record database, well known and loved
  • COLUMN – still relatively new, widely misunderstood, yet still feels like normal SQL

To look at these three together I think we must first look at them separately.  So here goes…

The ROW based database storage methodology is one most of us are already familiar with.  Depending upon your vendor of choice (like Oracle, Microsoft, MySQL, DB2, etc…) DDL and DML syntax creates tables that stores and retrieves records. Largely based upon some form of key, be it natural or surrogate (let’s debate the many issues of schema design another time). The relational data model thrives upon the ROW based database and is widely used for many OLTP and OLAP system and/or applications.  Highly efficient in complex schema designs and SQL queries, ROW based database engines offer a tried and true way to build solid solutions. We should not throw this away, I won’t!

The COLUMN based database storage methodology has been around for a while as an alternative to ROW based databases from various new vendors (like InfoBright, Vertica, Sybase IQ, etc…). Generally the DDL and DML syntax is similar to ROW based databases, yet under the hood things are usually radically different, and much more efficient for processing aggregations. This is the main thing that sets it apart from ROW based engines.  Some of these column based technologies also provide high data storage compression which allows for a much smaller disk footprint.  In some cases as much as 10/1 over their row based counterpart.  We should adopt this where appropriate, I am!
Read the rest of this entry »