My last blog (Extract-Transform-Load (ETL) Technologies – Part 1) discussed the purpose, function, and some of the inherent benefits of ETL technologies for moving data from source applications into target reporting and analytic environments. Hopefully from that discussion one can gain some general understanding of ETL; what it is, how it can be used, and why we would want to use it. Now let’s dive a bit deeper and discover who some of the key vendors are in the ETL Tools marketplace.
ETL tools have been around for some time, have evolved, matured, and now present us with productive environments for Big Data, Data Warehouse, Business Intelligence, and analytics processing. The ETL data processing cycle itself can be simple or highly complex. Most vendor offerings address the diverse requirements well enough although some do better than others. The challenges with ETL tools come from the other important considerations, like:
In 1999, when I was in my first year of university, if I wanted to check my e-mail I had to come into a lab. I used to come into the class, open browser, type address and then go out to grab some coffee. Usually when I came back I was really happy to see that at least half of the page had been loaded.
Today people are not so patient. They used to get response from the web-sites at least in a few seconds and if your application is not that fast, you have a problem. In fact, “few seconds” is a very optimistic term. Nowadays we measure response time with milliseconds.
If you want to create something really popular in the Web (and your application operates big amount of data), sooner or later you will face the need to implement full text search engine over your system.
In this post I will try to describe and compare two technologies – Microsoft SQL Server Full Text Search and Apache Lucene. The first one is an embedded SQL Server feature and the second one is a third-party software library, originally developed in Java and ported to many other platforms, including .NET.
How to build a super fast search for your web application.
My last blog (Column Oriented Database Technologies) discussed the differences between Row and Column oriented databases and some key players in this space. Concepts and technologies on Big Data have been discussed in previous blogs (Big Data & NoSQL Technologies & NoSQL .vs. Row .vs. Column). From these blogs one should surmise that deciding upon the best database technology (or DBMS vendor) really depends on schema complexities, how you intend to retrieve your data, and how to get it there in the first place. We’re going to dive into this next, but before we do it is imperative that we briefly examine the differences between OLTP and OLAP database designs. Then let’s leave OLTP details for a future blog as I expect most readers already know plenty about transactional database systems. Instead we’ll focus here on OLAP details and how we process Big Data using ETL technologies for data warehouse applications.
Generally the differences between OLTP and OLAP database applications center upon how frequently data must be stored and retrieved, the integrity of that data, and how much of there is and its growth. OLTP database schemas are optimized for processing transactions by an increasing number of users while OLAP database schemas are optimized for aggregations against an increasing amount of data and exponential query permutations. Design considerations involve normalization, indexing, datatypes, user load, storage requirements, performance, and scalability. We will need to defer the many interesting details on these considerations for a future blog.
In 2007 I worked for a company which provided data warehousing solutions. Our biggest client database was about 2-3 TB which was not so much even five years ago, but because of extremely complicated ETL and complex reports we had a lot of troubles.
We spent hundreds of hours monthly on code optimization, but every 30 minutes saved on ETL and processing were compensated with monthly data growth and we had to start over and over.
Finally it comes the time when we couldn’t improve anything else in programming code and company decided to drastically upgrade the hardware. But extremely expensive servers and disk storages gives us only several percent performance improvement. We thought about migration to another platform but preliminary estimations showed that every traditional RDBMS has the same bottleneck – disk storage. Maybe, if it was happening today, we probably would turn to fundamentally new technologies.
The whole software development industry is highly innovation-oriented, but RDMBSs always are perceived as something very conservative. I still have record about ANSI-92 SQL knowledge in my CV and almost all fundamental works from 70s are up to date, but in spite of that data management software industry is very fast developing and great new products comes to market each year.
When migrating Oracle tables to SQL Server, the problem of sequence conversion is one of the most common. SQL Server 2008 does not have any object even distantly resembling an Oracle sequence, and some tradeoffs are inevitable when you make decisions about the mechanisms to implement its behavior in the target system. In SQL Server 2012 Microsoft introduced sequences, but several incompatibilities with Oracle implementation still exist.
If you ask any experienced SQL Server developer: “Can I build a view on temporary table?” the answer will be “Of course not. And why on Earth you need it?” Well, Oracle developers can tell you why. Temporary tables are very useful when you have to separate data between users/sessions and creating view on them is just convenient way to store some logic. SQL Server developers used to deal without it but when they are trying to move database from Oracle they will have to find workaround because it’s pretty common pattern in Oracle.
Let’s take a look at how it works.
First thing that comes to mind is to use regular table instead of temporary. To emulate data separation between sessions we can add “spid” column with default value @@spid (system function that returns the session ID of the current user process). Than we need to create view named exactly like original temporary table (of course we have to name our regular table differently) and add where condition “spid = @@spid”. So select from the view will return only that part of data, which was populated in current connection. Next challenge here is that in Oracle data from temporary table is automatically deleted on rollback/commit. Read the rest of this entry »
You probably heard about autonomous transactions in Oracle. In a few words, using just one pragma directive in a PL/SQL block you can isolate it from the callers’ context, so it becomes independent transaction. It’s quite useful if you want organize data audit and is pretty widely used in production databases. As you might guess these autonomous transactions do not have direct equivalent in Microsoft SQL Server. The only way to isolate a Transact-SQL block from a transaction context is to open a new connection. There are several options here, so let’s start with SSMA approach.
Assume that we have some logging logic in a stored procedure named “dbo.LogError” and we want to commit it even in case when main transaction will be roll backed.
My recent blog (Big Data & NoSQL Technologies) discussed various NoSQL technologies and market vendors. Today let’s dive into column-oriented databases and why they should play an important role in any data warehouse whose focus is on aggregations or metrics (and whose isn’t?).
So you are all probably familiar with row-oriented databases. Tables of data where rows of fields (also called columns) represent the structural storage and the corresponding SQL queries that select, insert, update, and delete that data. Most database vendors like Oracle, Microsoft, Sybase, Informix, and many others all base their technology on this ANSI standard. Column-oriented databases are indeed what you might surmise; tables of data where columns of data values represent the structural storage. What you might not expect is that on the surface many column-oriented databases look and feel like row oriented databases also using SQL queries in much the same way. Creating tables, storing data, querying them are all pretty much identical. They may appear similar, but two principal things to understand is that the significant differences under the hood, in particular, physical storage and query optimization.
As noted in my previous blogs on NoSQL, there is also a column-store technology out there. Let’s not confuse that with column oriented databases. They are different. Since several NoSQL column-store vendors were highlighted before, we will focus instead on the column oriented database vendors here.
First, some key benefits to column oriented databases:
As mentioned in my last blog (Big Data & NoSQL Technologies) understanding and using the right tools for the job is essential. Database systems and applications are no exception. For Big Data or Small Data, knowing and using the right data modeling and methodology is also critical. Today, business uses the Internet to make money yet harvesting the data in an efficient way often develops into a considerable IT challenge. The ongoing debate behind the many issues involved are not the focus of this blog however, the concepts presented here are.
Data Warehouse and Business Intelligence (DW/BI) Systems have become a solution of choice for industries to address the many critical business questions ingrained in this vital data. Providing an essential data integration process supporting a variety of reporting requirements, DW/BI Systems often involve significant construction effort, on-going maintenance, and reengineering effort when data structures and business rules change. And they do change; presenting us with problem #1.
Building effective and efficient DW/BI Systems can quickly become a daunting and difficult technical ordeal even for veteran engineering teams. Several integrated technologies are required from database systems, data transformation (ETL) tools, various programming languages, administration and reporting software to high performance networks and computers having very large storage capacities. In many cases even the sheer volume of data can be overwhelming; Yup, problem #2.
Once reporting is delivered to diverse user communities that consume the data simplified usability is expected yet technological realities today often require highly trained end-users. The design, creation, delivery, and support of a robust, effortless DW/BI System for intelligent use, sadly, are complex. You guessed it; problem #3.
Furthermore, as upstream systems change, and as DW/BI technology plows ahead, and as the dynamic complexities involved prevail, we also find that every so often new data sources need to be added to the mix. These are usually unpredicted and unplanned for. The integration impact can be enormous of requiring complete regeneration of the aggregated data; hence, problem #4.
On the SQL Server YouTube channel, Dmitry Balin (@dbalin), CEO and co-founder of DB Best Technologies, guides you through the new Microsoft SQL Server 2012 Migration Assistant (SSMA) that supports migration to SQL Server 2012. 22-minutes may be considered long for a video tour, but we’re in the big data space so it’s well worth your time.