Blog: Validating AWS EC2 SQL Server deployments using benchmark tools

As cloud computing becomes more and more popular and demanded among our customers, quite often we at DB Best need to answer the simple question: “How can I get the best value for running my SQL Server workloads on AWS EC2 instances?”. Issues like mistakenly using older generation EC2 instances or not optimizing your EBS volumes can prove to be costly. Likewise, not all hardware supporting the AWS infrastructure is created equal, despite the fact that they try to keep everything running at peak performance levels. Given the variability that can take place, we are extending AWS database migration service to include an option of validating AWS EC2 SQL Server deployments using HammerDB.

In this blog post, we will discuss the process we use for running the HammerDB TPC-C benchmark for validating SQL Server 2017 instances on AWS EC2 instances. We’ll show how we evaluate results against our baseline database for giving the deployment a go or no-go rating before releasing the deployment to our customer. We use a binary rating since our deployments are typically done using tools like AWS CloudFormation Template. Rather than wasting time trying to debug a known deployment, if the system performs significantly less than the baseline, we terminate the deployment and create a new one to reevaluate. This process ensures that our customers are getting the most value from their deployment of SQL Server on EC2 instances.

Overcoming challenges with existing cloud benchmark approaches

The use of benchmarking of AWS infrastructure is not a new idea. We first got the idea after hearing about the way Netflix tests their EC2 instances for production use. Netflix has been a pioneer in making their services resilient and performance with tools open-source tools like Chaos Monkey. Chaos Monkey literally terminates instances and services in production to make sure systems fail over or degrade gracefully. To learn more on how Netflix benchmarks EC2 instances, check out Brendon Gregg’s presentation from re:Invent 2017 – How Netflix Tunes EC2 Instances for Performance.

State of the art benchmarking approaches

In addition, academic papers like Cloud Benchmarking for Performance by Blesson Varghese, et al. and how they use bonnie++, lmbench, and sysbench tools together to evaluate EC2 instance performance. The challenge with sysbench is that it only works for MySQL and PostgreSQL. It also is constrained to on-line transaction processing (OLTP) like transactions. However, the approach presented in the paper is one that we based our approach on for benchmarking with HammerDB.

Using HammerDB benchmark for SQL Server deployments

What we like about HammerDB is that it works with all the popular databases including SQL Server. The open source community has also optimized HammerDB to take advantage of In-Memory OLTP and In-Memory Clustered Columnstore Indexes for new releases of SQL Server. HammerDB also works for both OLTP and data warehouse (DW) benchmarks based on the TPC.org TPC-C and the original TPC-H specifications.

We would like to make it very clear that HammerDB is not an official benchmarking tool that can be used to publicly compare benchmark results between database products or platforms.

However, we have found that HammerDB is a great tool that provides consistent results on a target platform that we want to deploy SQL Server solutions to for our customers. This way we can make sure they are getting the expected performance for their production deployments. If there is a weakness in the platform infrastructure or a miss-configured SQL Server, HammerDB will catch the problem system.

We aren’t the only ones using HammerDB to evaluate how SQL Server runs on EC2 instances. While we were in the process of establishing our baselines for EC2 instances, AWS published a whitepaper on November 1, 2018 titled SQL Server Performance on AWS. The goal of their paper was to provide guidance to SQL Server customers what instance types would be suitable for their workloads based on transaction throughput requirements.

The approach that we will show in this paper is using HammerDB with SQL Server 2017 for TPC-C using disk-based tables. We do this because most of our customers simply want to do a lift and shift of their existing databases to AWS before trying out other optimizations.

Configuring the benchmark for SQL Server 2017 on EC2

In this section, we’ll describe the configurations and settings that we use for establishing baselines for a single EC2 instance and running the tests.

AWS EC2 Architecture

At a high level, we use one EC2 instance that has our preconfigured HammerDB and another preconfigured SQL Server instance that we create based on the desired instance type for a location and availability zone as shown below.

AWS EC2 Architecture

This example shows a deployment of placing both EC2 instances in a cluster placement group that can be later used for a deployment of SQL Server Availability Group replicas. We perform our deployments using AWS CloudFormation Templates with the desired parameters. The advantage here is that it is easy to terminate the environment using the template for our baseline testing and for a “no-go” environment.

EC2 instances and storage used for this blog post

In this blog post, we are sharing our results for the US-West-2 (Oregon) using On-Demand instances. Here is a summary of the EC2 systems tested for the Windows Server 2016 Datacenter Edition operating system. We aren’t including the price of SQL Server licenses in our cost analysis because customers can transfer licenses from on-premises to AWS using their negotiated pricing from Microsoft. When we benchmark, we use a custom AMI using SQL Server 2017 Developer Edition. The pricing information provided is based on rates as of November 13, 2018.

Instance ClassInstance NamevCPUECUMemory (Gib)VM w/Windows $/moMax IOPS / MBpsNetwork Perf (Gbps)
Smallm4.xlarge41316$280.326,000 / 93.75High
Small M5m5.xlarge41616$274.4818,750 / 437.5Up to 10
Mediumr4.4xlarge1653122$1,314.0018,750 / 437.5Up to 10
Larger4.16xlarge64195488$5,256.0075,000 / 1,75025

With EC2 instances, how you configure your EBS drives can make a big difference as we will demonstrate later in the post. As such, we’ll show results for six different drive configurations using gp2 storage.

Number of DisksBase Disk SizeDrive Unit Cost Per MonthIOPS per diskMB/secTotal storagePrice per month
5×2 and 7256 GB$25.6010,0001601.75 TB$179.20
5400 GB$40.0010,0001602,000 GB$200.00
10200 GB$20.0010,0001602,000 GB$200.00
12 TB$204.8010,0001602 TB$204.80
21 TB$102.4010,0001602 TB$204.80

When striping disks, we baseline test striping volumes with Windows Disk Manager and Windows Storage Spaces.

Configuring SQL Server for benchmark

In configuring SQL Server for the benchmark, we aren’t necessarily trying to optimize SQL Server to run the benchmark best. We started with the guidelines that AWS has posted in their white paper Best Practices for Deploying
Microsoft SQL Server on AWS
. We also considered Microsoft and AWS offerings that we used for our customers. We covered them in our blog post Best Practices For Moving SQL Server To AWS: Amazon EC2 and RDS.

Rather than use AWS Marketplace templates for SQL Server 2017, we create our own image based on a snapshot of an installed version of SQL Server 2017 with the latest CU. In our case, CU12 build 14.0.3045.24.

Disk drive configurations for data and log files

We use various drive configurations that cover the 1.75 TB to 2 TB in size to allow for plenty of growth. In addition, we can use the baseline results to check to see if they were correctly configured for the specific VM type to check for errors. Here are the different configurations.

  • 5×2 – Five gp2 256 GB drives striped for the data volume and two gp2 256 GB drives striped for the log volume
  • 7 – Seven gp2 256 GB drives striped for both data and log volumes
  • 10 – Ten gp2 200 GB drives striped for both data and log volumes
  • 1 – One gp2 2 TB drive for both data and log volume
  • 2 – Two gp2 1 TB drives striped for both data and log volumes

Since AWS by default does not offer ephemeral drives by default, we placed the TempDB files on the data drive.

Server Configuration

For SQL Server, we use the default values for a SQL Server installation. Specifically, we don’t change the value for Maximum Server Memory. There is guidance from Brent Ozar, Microsoft Data Platform MVP, that suggests that you should set aside 4 GB of memory for the OS or 10%, whichever is greater for VMware deployments dated November 27, 2012. However, hypervisor technologies and SQL Server have improved over time. We encourage checking out Microsoft’s help topic Performance guidelines for SQL Server in Azure Virtual Machines. We have found that the majority of the guidelines can be applied to AWS EC2 deployments of SQL Server.

Database configuration

For the TPC-C benchmark, we use a TPC-C database size of 1000 warehouses generated by HammerDB. The result is a database approximately 80 GB in size. The distribution of records is as follows:

TPC-C benchmark results

To avoid any file growth issues, we oversize the database using eight 20 GB data files within the primary filegroup along with one 20 GB file for the transaction log.

Other configurations include:

  • Creating TEMPDB on the data volume using default sizes and the recommended files number of files based on the size of the VM.
  • Setting MAXDOP=1. We found that not doing this can cause deadlocks at the higher user counts.
  • Setting Recovery Mode on the database to Full. Although we are not taking transaction log file backups during the benchmark, this is likely the way we would configure databases for our customers with OLTP workloads.

Automation approach for validating AWS EC2 SQL Server deployments

Our goal in this process is to make it easy to deploy and validate the EC2 instances for our baseline values and database servers. Here is the overall flow for the automation of the benchmark process.

  • Use an AWS CloudFormation template to configure the HammerDB driver system and SQL Server test VM for a specific Account, Region, Availability Zone, VPN, Subnet, Instance Type, etc.
  • Run automation script for the SQL Server test VM which completes the SysPrep installation of SQL Server and restores the TPC-C database from a backup file with MAXDOP=1 and Recovery Model=FULL
  • Wait for the SQL Server test VM to acknowledge that it is ready.
  • Execute the script that executes HammerDB using its command line interface (CLI) with an autopilot run with 3, 5, 8, 13, 55, 89, 144, 233 virtual users. Each test sequence takes 10 mins with a 2-minute ramp-up time and 3 minutes of executing time. The remaining 5 minutes are to let the system stabilize for the next run of virtual users.
    NOTE: We use Fibonacci series for virtual users ratios that it provides between increasing user counts. It provides greater weighting of results at the lower end to represent the variable workloads that take place over a month. There always seems to be some type of burst at the end of a month represented by 144 and 233 user runs. The rest of the month transaction processing is covered by the lower end values of 3, 5, 8, and 13.
  • Copy the HammerDB log files with the results to S3 storage and then processed to extract the TPM metric for each of the virtual user runs
  • Compare the results to our benchmarked results to determine a thumbs up or terminate.
  • If the benchmark result is significantly better than our recorded value, we decide whether to use it as the new baseline. AWS is constantly working to improve the performance of their systems, so we need a way to easily update our baseline.
  • For the thumbs up system, we drop the test TPC-C database and then release the VM to our customer for the actual database deployment.

HammerDB configuration specifics

The following are the specific command line parameters we use to run the TPC-C runs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
dbset db mssqls
dbset bm TPC-C
diset connection mssqls_server
diset connection mssqls_tcp TRUE
diset connection mssqls_authentication SQL
diset connection mssqls_pass
diset tpcc mssqls_checkpoint TRUE
diset tpcc mssqls_driver timed
diset tpcc mssqls_rampup 2
diset tpcc mssqls_duration 3
diset tpcc mssqls_allwarehouse TRUE
diset tpcc mssqls_timeprofile TRUE
vuset logtotemp 1
vuset UNIQUE 1
vuset nobuff 1
vuset timestamps 1

Evaluating baseline results

During our baseline testing, we discovered several configurations changes than can represent a significant savings in operating costs for running SQL Server 2017 on Windows Server 2016. In this section, we’ll cover that our baseline testing showed for various scenarios.

Yes, newer generation EC2 instance types can lower your costs

In this example, we are evaluating the older m4.xlarge instance with the newer m5.xlarge Nitro instance. To understand how we determine the operational cost savings, we will walk through the data from the performance runs to come up with a value that we call Cost Per One Billion Transactions Per Month (CPOBTPM). To compute CPOBTPM, we do the following:

  • We compute the total EC2 On-Demand instance cost per month (CPM) by adding together the price per month for the instance type and gp2 storage for the specific AWS Region
  • For each of the virtual user TPC-C Transactions Per Minute (TPM) runs, we multiply the recorded TPM value by 525,600 minutes per year divided by 12 months per year divided by 1 billion. This gives us a value of Transactions Per Month in Billions (TPMB)
  • For each of the virtual user runs, we compute CPOBTPM by dividing CPM by TPMB
  • Finally, we average the 10 runs for the CPOPTPM to determine the overall CPOPTPM value for the result

Let’s see how these two instances types compare for our Small instance class representing a system with 4 hyper-threaded cores and 16 GB of RAM.

In this example, we are comparing the results with the following configurations:

  • 5×2 and 7 disk drive configurations using Windows Storage Spaces for disk striping
  • HammerDB client and the SQL Server instance in a cluster placement group within availability zone 2B

Here are the baseline values that we collected for the different virtual server runs.

Comparing m4.xlarge versus m5.xlarge TPM(K)

It looks obvious that the newer m5.xlarge performs better, so let’s walk through the next step to compute TPMB.

Comparing m4.xlarge versus m5.xlarge TPMB

Now we compute CPOBTPM for each of the virtual users. The CPM for the m4.xlarge instance with 1,792 GB of gp2 storage is $453.68. The CPM for the m5.xlarge costs less per month with a value of $459.52.

Comparing m4.xlarge versus m5.xlarge CPOBTPM

Here is what the average CPOBTPM looks like for the two instance types.

Comparing m4.xlarge versus m5.xlarge Average CPOBTPM

Conclusion – Nitro EC2 instances are a great value

Overall the newer generation m5.xlarge costs 24% less than the m4.xlarge system. The conclusion here is obvious. If the AWS region you are currently running in supports the new Nitro instances, you should consider migrating your existing instances to the newer and less expensive Nitro instance.

Based on this data, we exclude non-Nitro instances for the specific AWS region where an equivalent Nitro instance exists in our CloudFormation templates.

Disk striping matters, with limits

In this next example, we will look at how striping different disk configurations impact operational costs for the Small, Small M5, Medium, and Large instance classes.

In these scenarios, we are looking results baselined with the following criteria:

  • Disks striped with Windows Storage Spaces.
  • HammerDB client and the SQL Server instance in a cluster placement group within availability zone 2B

Here are the TPM(K) results for each of the instance classes.

AWS Small Striping drives TPM(K)

AWS Small M5 Striping drives TPM(K)

Note, for the Small M5 instance type, we used a data + log volume with 1 gp2 (2TB) and 2 gp2 (1 TB) drive configurations. It’s interesting to see that for 3 and 5 users, the results looked ok. However, you can clearly see, that some level of striping with 5 or more disks yields better performance than larger drives.

AWS Medium Striping drives TPM(K)

AWS Large Striping drives TPM(K)
The following chart aggregates the test results for the Large, Medium, Small, and Small M5 instance types.

Comparing disk configurations for large, medium, small, and small m5 cost per one billion transactions per month

As you can see, it’s a toss up across the configurations between the 10×200 and 7×256 disk volume configurations. This is especially true with the high performing EC2 instances. Note, the trick is not to go too small for the EBS disks used to create striped volumes. For EC2 instances that support EBS bursting, AWS advises that volumes greater than 170 GiB up to 214 GiB deliver a maximum throughput of 160 MiB/s if burst credits are available.

Location can impact performance results

In our original baselining exercise, we were curious to understand if the HammerDB client had to be in the same cluster placement group as the test server. To test this assumption, we ran tests for the “best” configuration results for the following configurations with the HammerDB client in US-West-2B and test systems in 2A, 2B (not in the same placement group), and 2C:

  • SmallM5 – 5×2 gp2 (256 GB)
  • Medium – 7 gp2 (256 GB)
  • Large – 10 gp2 (200 GB)

Here are the charts from our baseline runs for the average CPOBTPM.

AWS small m5 AZ differences CPOBTPM

AWS Medium AZ differences CPOBTPM

AWS Large AZ differences CPOBTPM

As you can see, for each instance type we tested, the systems running in availability zone 2A were consistently less performant and would have been flagged as a NO_GO instance. For the other availability zones, not running the HammerDB client in the same Availability Zone would have been accepted as a GO instance.

In talking with engineers at AWS, they told us there are times when different availability zones may have intermittent latency issues. Fortunately, AWS has tools to detect and repair these issues before customers may even know the problem existed. As such, we cannot conclude that Availability Zone US-West-2A will always have issues. Instead, if we have repeated issues during our benchmark validation with a specific Availability Zone or Region, we will look at hosting the database solution in another zone.

Conclusion

By having baseline values for different EC2 instance types and disk configurations, we can establish a way to determine based on the cost per month for one billion transactions if the AWS infrastructure is suitable for our customers. By using a dollar figure, it’s relatively easy to understand the impact of having poor performing infrastructure.

Broader applications of this technique

There are other scenarios where we can use HammerDB to create baseline results for scenarios like:

  • Comparing results against an Always On Availability Group deployment
  • Testing results of SQL Server running in Docker containers on Windows or Linux
  • Comparing results against single and replicated Amazon RDS for SQL Server for PaaS versus IaaS cost analysis
  • Testing of DW workloads against the baseline for SQL Server 2017 on IaaS

Our benchmarking framework is designed to work against all of these scenarios.

Key benefits of this approach

In summary, our customers benefit in the following ways:
• The virtual machine and storage configurations meet or exceed the baseline configuration for their desired throughput in transactions per minute and value.
• For customers who subscribe to our managed service, we use our DBMSys platform to continually monitor the EC2 Instance and perform health checks to validate that the instance is running within our customer’s KPIs.
• With the combination of our Data Management Optimization (DMO) analysis and benchmark baseline, we can give our customers a better understanding of the cost per transactions per month up front to show the immediate value of running SQL Server on AWS.

If you are interested in how you can benefit from our benchmarking validation, please contact us today!