What do you do when your data science team needs to perform read-only data warehouse queries against your mission-critical online transaction and batch processing database? Our customer Getty Images had exactly this problem with their production AWS deployment of SQL Server 2014 AlwaysOn on EC2 instances. Their data science team was performing read-only queries over 4 terabytes of historical data. These queries are ad-hoc and look at multiple months of historical data for analysis. The queries would consume all available I/O on the production databases environment and impede the performance of the primary workload. This blog talks about how we helped them solve this problem by nightly copying and restoring their daily backup files to an Amazon Linux 2 server using the i3.8xlarge EC2 instance with SQL Server 2017 for Linux.
Our customer tried using Resource Governor to control the performance hit by the competing workloads. Although the primary workload was able to maintain their mission-critical recovery time and recovery point objective service level agreements, the data science team was frustrated due to slow query performance. They were running with SQL Server 2014 AlwaysOn Availability Groups on EC2 instances with passive secondary replicas. The “obvious” solution would be to convert the passive replica to a read-only replica, but not without cost and risk. Here were some of the solutions we considered before deciding on a different route:
- The automatic failover of the primary AlwaysOn system was at risk due to possible data loss of uncommitted transactions on the replica.
- The licensing costs associated with using a read-only replica for the secondary workload.
- The I/O throughput would have the customer pay for a large EC2 host and exceed their monthly budget.
- Due to the size of the databases and tables being queried, we could not create any covering indexes due to bloating the size of the production databases.
DB Best, together with Getty Images, asked the data science team a simple question. If we could deliver great performance but with a one-day delay in the data, are you interested? Here is the solution we came up for deploying a cost-effective and performant solution using SQL Server 2017 on Amazon Linux 2 with the i3.8xlarge instance featuring 32 vCPUs, 244 GB of RAM, and 4 x 1,900 GB NVMe SSD drives for temporary storage.
Why use SQL Server on Linux? We’ve been using SQL Server 2017 on Linux for several projects with Getty and our benchmarks for data warehouse type queries tend to show faster performance over Windows Server with a lower price point. As of September 2, 2019, the best price/performance TPC-H data warehouse results using SQL Server 2017 use Linux as the operating system!
Step-by-step for getting SQL Server 2017 running on Linux with Active Directory
- First, we needed to clone the permissions for the new server. In this case, the end-users can use the same authentication method regardless of the platform where they access data (Windows or Linux). So, we created and set up read-only access via a new NT security group for the data science team on the source SQL Server 2014 AlwaysOn environment. We also removed the NT group after setting up the permissions.
- Then, we installed and configured an Amazon Linux 2 on an i3.8xlarge EC2 server via Terraform. This server comes with four 1.7 terabytes of ephemeral NVMe SSD disks and only utilizes EBS storage for its boot volume to reduce costs. We tagged this EC2 server to the data science team.
- After that, we modified the boot and user data scripts to build – actually to rebuild – a RAID 0 logical volume across the four 1.7 terabytes of NVMe SSD disks. The new volume was approximately 6.8 terabytes. And we formatted it as an xfs filesystem. Overall, we needed to fit around 4.5 TB of data. This was the total weight of all data files for all databases. This perfectly fits into the i3.8xlarge server
- We created a bash script to install and configure SQL Server 2017 on Linux. The script completes Standard edition in about 2.5 minutes and configures the necessary permissions on the Linux filesystem including NVMe SSD xfs volume.
- The primary database environment used the Enterprise edition of SQL Server on a Windows AMI host. We installed SQL Server on Linux as a secondary application on the Linux AMI. Using Standard edition on Linux AMI allows for lowering their runtime and licensing costs. We intentionally traded off losing 8 vCPUs and 128 GB of RAM with Standard Edition knowing that the NVMe throughput would make up the difference and more.
- Then we created a bash script to install and configure Microsoft Active Directory Authentication on Linux. The script completes the process in just 1 minute.
- After that, we installed and configured the read-only CIFS mounts to access the database backup files from the source SQL Server 2014 HA environment.
- We also used rsync to create a bash script that allows for copying the latest database backup files. You can opt for FULL and DIFFERENTIAL copy from the source SQL Server 2014 HA environment to a restore folder under the NVMe SSD xfs volume.
- Then we scheduled a cron job to perform rsync using the script from above. A sample cron job looks as follows:
00 04 * * 0-5 /var/opt/mssql/red_october/sdrdw_copy_rename_backups.sh >> /var/opt/mssql/red_october/sdrdw_copy_rename_backups.log 1>&1 2>&1
- We then run post-installation SQL Server scripts to setup administrative and data science access. We used NT security groups for the SQL Server 2017 on Linux environment.
- Finally, we set up a SQL Server Agent job to restore the user databases stored on the restore folder.
Now the data scientists run their queries with a dramatic improvement in their processing and runtime. The reasons for this improvement include:
- The primary and secondary workload are separated logically across different systems with no real-time reliance of data. This separation later resulted in the consolidation of 2 SQL Server primary database environments into a new system.
- The data scientist queries run against the NVMe SSD xfs volume. It provides up to ~6500 GB/sec I/O throughput and ~1,500,000 IOPS per second for their ad-hoc queries.
- Also, the data science team agreed to the once a day restore process given the amazing results and lower cost!