call us toll-free +1 855 855 3600
 
DB Best Chronicles
 
Talks on Managing Data and Applications Anywhere

Using Amazon Redshift performance tuning tips to reduce query times by 800%

Posted by | On December 14th, 2017 | In AWS, AWS SCT, Data Warehouse | Tags: , , , ,
 

Often database administrators experience decreases in query performance with Amazon Redshift as the size of data increases. The problem becomes even more significant when we start talking about regularly executed queries. Design time decisions that were made with the given the statistics at the start of the project may not make sense with the new data and query patterns. So, this means that the Amazon Redshift cluster requires some optimization. See how these Amazon Redshift performance tuning tips can improve query execution times when users start noticing performance issues.

Amazon Redshift performance tuning tips

Amazon introduced a feature called Redshift Optimization for the AWS Schema Conversion Tool (SCT)  with the November 17, 2016 release. You can now use the Schema Conversion Tool the optimization of existing Amazon Redshift deployments! This dramatically reduces the manual effort of addressing performance issues with your Redshift deployments.

Here’s why you should consider taking advantage of the Redshift Optimization option in the AWS Schema Conversion Tool.

Preparing the environment

We decided to use Hammer DB to run the TPC-H data warehouse test to see how well the Redshift Optimization option can improve query performance.

NOTE: We are only using HammerDB for comparison purposes to illustrate how changes in the configuration of Redshift impacts performance. This cannot be considered as an official TPC-H benchmark run that you can use to compare with other database platforms.

To run the test, we installed HammerDB on a Windows Server 2012 R2 with a m4.xlarge instance on Amazon EC2 to generate the data and run the queries. We created a Redshift database using a ds2.xlarge.

In HammerDB, we selected the TPC-H configuration which creates 8 tables for a modified run against Amazon Redshift with 13 queries out of the standard 22 queries used for an official TPC-H run.

To build the schema in Hammer DB as shown below, we created all tables with ‘even’ distribution style and without sorting keys initially.

HammerDB create schemas

After creating empty database schemas, we used the application’s Datagen option to generate the data files with 1000 scale factor. The result was a set of CSV files for each table that were approximately 1.1 terabytes in size.

On the next step we packed these generated data files into the archives, using the following command:


gzip.exe -r D:\HammerDB\*.tbl

This reduced the size of the initial data set to around 320 GB.

We utilized the S3 browser to copy these archive files to the Amazon S3 located in the same region as the Redshift server to minimize network latency. We then created specific scripts to upload the data to Redshift using the COPY command. It was nice that Redshift uses the LZOP compression for uploaded files that resulted in a data set of just 200 GB.

Now everything was set to start the HammerDB benchmark.

What happens if your initial optimization strategy results in poor performance

In the first run, we configured the TPC-H driver to run with just one user and one query set for the 13 queries as shown below.

HammerDB Settings

The first run took a little over 27 hours to complete! Here are the results for each of the queries in the first run. Note, there was no warm up run prior to running the benchmark.

Hammerdb Log @ Sun May 28 18:15:41 -0000 2017
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
Hammerdb Log @ Sun May 28 18:16:23 -0000 2017
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
Vuser 1:Executing <strong>Query (1</strong> of 13)
Vuser 1:10 rows returned in 355.67 seconds
Vuser 1:Executing <strong>Query (2</strong> of 13)
Vuser 1:1 rows returned in 229.854 seconds
Vuser 1:Executing <strong>Query (3</strong> of 13)
Vuser 1:1 rows returned in 253.078 seconds
Vuser 1:Executing <strong>Query (4</strong> of 13)
Vuser 1:1 rows returned in 4925.949 seconds
Vuser 1:Executing <strong>Query (5</strong> of 13)
Vuser 1:1 rows returned in 29.741 seconds
Vuser 1:Executing <strong>Query (6</strong> of 13)
Vuser 1:100 rows returned in 1104.394 seconds
Vuser 1:Executing <strong>Query (7</strong> of 13)
Vuser 1:100 rows returned in 31989.634 seconds
Vuser 1:Executing <strong>Query (8</strong> of 13)
Vuser 1:0 rows returned in 0.829 seconds
Vuser 1:Executing <strong>Query (9</strong> of 13)
Vuser 1:0 rows returned in 1.075 seconds
Vuser 1:Executing <strong>Query (10</strong> of 13)
Vuser 1:0 rows returned in 26868.327 seconds
Vuser 1:Executing <strong>Query (11</strong> of 13)
Vuser 1:0 rows returned in 26372.198 seconds
Vuser 1:Executing Query (12 of 13)
Vuser 1:100 rows returned in 6070.607 seconds
Vuser 1:Executing <strong>Query (13</strong> of 13)
Vuser 1:3480 rows returned in 343.742 seconds
Vuser 1:Completed query set in <strong>98545</strong> seconds
Vuser 1:Geometric mean of query times returning rows is 846.62011

From the perspective of our extensive experience, we’ve noticed that this benchmark usually runs much faster on the hardware of the m4.xlarge cluster by choosing the appropriate distribution style and sort keys. To see what the Redshift optimizer suggests for each of the problematic queries, you can run the following query:

SELECT distinct
    query,
    substring(event,0,200) as event,
    substring(solution,0,200) as solution
from stl_alert_event_log
    where query in 
    (3990653,3990668,3990683,3990699,3990907,3990909,3990961,
        3992326,3992327,3992328,3993487,3994626,3994986)
    order by query;

Here’s what we discovered:

Redshift query plan

What the Amazon Redshift optimizer does is to look for ways to minimize network latency between compute nodes and minimize file I/O latency when reading data.

The disk storage in Amazon Redshift for a compute node is divided into a number of slices. The number of slices per node depends on the node size of the cluster. When you load data into a table, Amazon Redshift distributes the rows of the table to each of the node slices according to the table’s distribution style that can have the following values: ‘even’, ‘key’ or ‘all’.

For an even distribution style, the node distributes data evenly regardless of the values in any particular column. The ‘all’ distribution option, makes a copy of each table will be distributed to every node. The ‘key’ distribution option distributes rows according to the values in one specified column. The idea is to pick a distribution key column that groups commonly accessed data used in a WHERE clause that keeps data together within node slices.

When you create a table, you can define one or more of its columns as sort keys. When data is initially loaded into the empty table, the rows are stored on disk in sorted order.

Amazon Redshift performance tuning tips to optimize cluster storage

These Amazon Redshift performance tuning tips using Redshift optimization requires several steps to optimize the Redshift Sort Keys and Optimization Strategy for you cluster storge.

Duplicating the original cluster. We created a snapshot of the original Redshift cluster in the AWS console. We created a new Redshift cluster and used the default option of the AWS console to restore the cluster from the snapshot. We now have the original and the newly created clusters as the source and the target respectively in the AWS SCT project. Moreover, we wouldn’t need to migrate the data because we have already restored it from the backup snapshot.

Collecting statistics from the original database. We used the offline statistics collection, and then uploaded the files to AWS SCT. However, SCT provides an option to collect statistics online, directly from the tool.

Running the optimization. Amazon recommends using different clusters as the source and target for the optimization project. Meanwhile, SCT allows you to select the migration strategy when you launch the optimization: you can opt for ‘Migration to a copy’ or ‘Migration to a clean slate’. In our case, we selected the first option because we have already created the copy of the original cluster.

To launch the optimization, we selected the ‘Run optimization’ option in the pop-up menu of the SCT. We used the default optimization strategy.

Redshift Optimization Project Settings

According to Amazon, the optimization parameters by default are set based on the best practices. So, we’ll have a chance to check the optimization result based on the default values.

Redshift Optimization 2

We’ve set speeding up the ‘expensive’ or ‘heavy’ queries as the main goal of the optimization procedure.

The results of optimization included SQL scripts to recreate tables. Here’s an example of the script for one of the tables.

CREATE TABLE IF NOT EXISTS public.nation(
    n_nationkey numeric(18,0) NOT NULL,
    n_name character(25),
    n_regionkey numeric(18,0),
    n_comment character varying(152)
)
DISTSTYLE KEY
DISTKEY
(
    n_nationkey
)
SORTKEY
(
    n_name
);

As you can see, the generated SQL code now includes the sort keys and the distribution keys.

The results show more than 8 times performance increase!

In AWS Schema Conversion Tool, you may save the SQL script of the new tables or apply it directly to the target cluster. We selected the second option and applied the code to the target. This operation took around 2 hours, and now we could run the HammerDB benchmark once again, using the same parameters as in the first run on the non-optimized database.

Here are the results of the HammerDB benchmark compared to the first run.

optimization results

As you can see, SCT dramatically improved performance of the ‘expensive’ or ‘heavy’ queries in the new cluster where the tables included the sort and the distribution keys. The HammerDB benchmark completed in about 3 hours. This was an 8.2 times performance gain!

It was great to see that SCT with the new Redshift Optimization feature made good decisions on distribution styles and sort keys on the queries that performed poorly. The fact that there were three queries that ran relatively slower was expected.

NOTE: You can’t always expect an 8 times performance increase using these Amazon Redshift performance tuning tips with Redshift Optimization. These results are based on a specific benchmark test and won’t reflect your actual database design, size, and queries. However, the Redshift Optimization eliminated the manual steps just about all the labor-intensive steps outlined in the Amazon Redshift tutorial for Tuning Table Design.

Summary

As you can see, using these Amazon Redshift performance tuning tips with the AWS Schema Conversion Tool Redshift Optimization feature greatly reduces the time needed to tune query performance with good results. In our case, we showed how the Redshift Optimization feature improved the performance of queries by 8.3 times by providing recommendations for distribution types and sort keys based on historical information of query execution.

It’s well worth the minimal effort to run the Redshift Optimization against your existing Amazon Redshift databases. If you do, please share your results with us!

If you have any questions about this great feature in AWS SCT, feel free to contact DB Best at info@dbbest.com. DB Best has migrated more databases using SCT than any other partner in the world. So, if you want to learn more about how DB Best can help you with migration projects, check out our Jumpstart program at https://landing.dbbest.com/aws-migration.

Related posts

Here are some other blog posts covering our Amazon Redshift experience:

 

Webinar 2.8 Followup (Episode 11 – Compression)

Posted by | On December 12th, 2017 | In Kalen's Continuum, Training | Tags: , ,
 

Thanks to everyone who attended! You can get the slides and demo files here:
Webinar 2.8 – Data Compression

 

CI Record

 

If you missed the session, or would like to watch the recorded session, the recording is available here.

 

Don’t miss the next webinar on December 19th: Columnstore Index Storage!   You can register here,

 

In addition, I am publishing a list of the upcoming topics. Please be aware they are subject to change or re-ordering. It’s not complete at this point, but it does show the planned topics for the next several months.  I will be updating it periodically

Webinar 2.7 Followup (Episode 10 – Partitioning)

Posted by | On December 5th, 2017 | In Kalen's Continuum, Training | Tags: , ,
 

Thanks to everyone who attended! You can get the slides and demo files here:
Webinar 2.7 – Partitioning

 

Partition Switch

 

If you missed the session, or would like to watch the recorded session, the recording is available here.

 

Don’t miss the next webinar on December 12th: Data Compression!   You can register here.

 

In addition, I am publishing a list of the upcoming topics. Please be aware they are subject to change or re-ordering. It’s not complete at this point, but it does show the planned topics for the next several months.  I will be updating it periodically

 

Webinar 2.6 Followup (Episode 9 – Fragmentation)

Posted by | On November 28th, 2017 | In Kalen's Continuum, Training | Tags: , ,
 

Thanks to everyone who attended! You can get the slides and demo files here:
Webinar 2.6 – Fragmentation

Fragmentation

If you missed the session, or would like to watch the recorded session, the recording is available here.

 

Don’t miss the next webinar on December 5th: Partitioning!  Registration is available here.

 

 

In addition, I am publishing a list of the upcoming topics. Please be aware they are subject to change or re-ordering. It’s not complete at this point, but it does show the planned topics for the next several months.  I will be updating it periodically

Webinar 2.5 Followup (Episode 8 – Nonclustered Indexes)

Posted by | On November 21st, 2017 | In Kalen's Continuum, SQL Server | Tags: , ,
 

Thanks to everyone who attended! You can get the slides and demo files here:
Webinar 2.5 — Nonclustered Indexes

nonclustered index leaf

Because this week is a holiday week in the US, there is NO HOMEWORK!

If you missed the session, or would like to watch the recorded session, the recording is available through this link.

Don’t miss the next webinar on November 28th: Fragmentation! You can register here: https://www.eventbrite.com/e/webinar-kalen-delaneys-sql-server-internals-webinars-episode-9-tickets-39714618493

In addition, I am publishing a list of the upcoming topics. Please be aware they are subject to change or re-ordering. It’s not complete at this point, but it does show the planned topics for the next several months. I will be updating it periodically.

Webinar 2.4 Followup (Episode 7 – Clustered Indexes)

Posted by | On November 14th, 2017 | In Kalen's Continuum, SQL Server | Tags: , ,
 

Thanks to everyone who attended! You can get the slides and demo files here:
Webinar 2.4 — Clustered Indexes

Clustered index btree

If you missed the session, or would like to watch the recorded version,  here is the link to the video.

Don’t miss the next webinar on November 21st: Nonclustered Indexes!  You can register here. https://www.eventbrite.com/e/webinar-kalen-delaneys-sql-server-internals-webinars-episode-8-tickets-39713832141

In addition, I am publishing a list of the upcoming topics. Please be aware they are subject to change or re-ordering. It’s not complete at this point, but it does show the planned topics for the next several months.  I will be updating it periodically.

Homework solution/discussion:

Read the rest of this entry »

Webinar #2.3 Follow-up (Episode 6)

Posted by | On November 7th, 2017 | In Kalen's Continuum, Webinar
 

Thanks to everyone who attended! You can get the slides and demo files here:
Webinar 2.3 — Large Datatype Storage

And here is the link to the video.

Pages for large datatypes

 

Don’t miss the next webinar on November 14th: Clustered Indexes! You can register here!

————————————–
Homework discussion for Webinar #2.2:
Calculate the maximum number of IAMs a single object, including its indexes, can have.

Read the rest of this entry »

How to Upgrade Microsoft Access 97 Jet Database Engine Applications – Part 1

Posted by | On November 1st, 2017 | In Customer Story, Database Modernization, Database Upgrades, Microsoft Access | Tags: , , ,
 

A leading provider of engineering software has a popular application deployed at thousands of customer locations. They have a plan in place to rearchitect the solution in a couple of years to use a modern application architecture with a SQL Server backend. The problem they faced was that the application used the Microsoft Access 97 Jet database engine. New customers who tried to deploy the solution on newer versions of the operating system were blocked from being able to install the software because of the inherent security risks using the old database technology. So we had to find the way of removing the dependency on the legacy data access technologies and upgrade Access databases to the latest version. In this video, we demonstrate our first approach on how to upgrade Microsoft Access 97 Jet database engine applications to the latest version of the Jet engine that runs on the latest versions of Windows.

Read the rest of this entry »

Why Learn About SQL Server Internals?

Posted by | On October 30th, 2017 | In Kalen's Continuum | Tags: ,
 

I’ve been writing and teaching about SQL Server internals for over 25 years. While there is usually a lot of interest in my books and classes, there are always people who ask, frequently after about the first day of my 5-day class, “What’s the point? Why is this information useful to know?”

What is “Internals”?

First of all, we need to answer the question “What is Internals?” For SQL Server, the short answer is “The way SQL Server works internally!” Internals education looks to answer some of these questions:
• How does SQL Server manage memory, processors, i/o and other physical resources?
• How is data physically stored, written, retrieved and cached?
• How are queries optimized and processed?
• How are query plans cached and managed?
• What is the transaction log used for and how it is managed by SQL Server?
• What does SQL Server do to protect data during concurrent access by multiple processes?

The diagram shown below is one I frequently use to show the various internal components of SQL Server. The engine is divided into four layers.

SQL Server Architecture
Read the rest of this entry »

Webinar #2.2 Follow-up (Episode 5)

Posted by | On October 24th, 2017 | In Kalen's Continuum | Tags: , ,
 

Thanks to everyone who attended! You can get the slides and demo files here:
Webinar 2.2 — Allocation Information

If you missed the session, or would like to watch the recorded version, you can get a link to access the video here.

Webinar #2.2 Homework:
Read the rest of this entry »