Blog: A better way of deleting historical data from a large highly concurrent SQL Server table meet GDPR compliance

One of our customers, a global digital media company, came to us with an interesting problem. As part of their General Data Protection Regulation (GDPR) compliance program, our client identified two tables that contained hundreds of millions of rows of historical user login data. GDPR states that you can process personal data while there is a contractual obligation with your client. However, once you terminate the contractual obligation and you have no legitimate interest with the data, deleting historical data protects against potential fines from the European Union (EU).

SQL Server table GDPR compliance

The first table with 800 million records included login data since the company had started. However, our client never really processed much of the data. To remedy this, we truncated the table, dropped the table from the database, and then removed the code that transferred the login records from the transaction table.

The second table, called  SiteActivity.UserLogins, presented an interesting challenge. This table contained 560 million rows, of which 500 million rows were older than their privacy statement limit of 90 days. Since GDPR dictates that you must store data for the shortest time possible, 500 million records were in violation of the GDPR regulation and our client was vulnerable to significant fines. To make matters worse, every time a user logged into their system, their system wrote a record to the UserLogins table.

Problems deleting historical data from a large highly concurrent database

So, what could possibly go wrong with the following simple SQL statement?

1
DELETE FROM SecuritySchema.tbApplicationUsers WHERE LastLoginDate <= GETDATE() - 90

Essentially, SQL Server always acquires an exclusive lock on the table and holds onto it until it completes the transaction. This means that no one can actually log into their system until SQL Server completes the deletion of 500 million rows. In addition, the SQL Server fully logs the action in the transaction log. To make the problem even more interesting, the database participated in an Always On Availability Group cross-region replication on AWS with EC2 instances. In essence, the application would die a certain death as EBS drives supporting the database for the primary and secondary replicas exceed their IOPS thresholds!

As part of their research on the topic, our developers came across a blog post titled Deleting Historical Data from a Large Highly Concurrent SQL Server Database Table that looked promising. The author pointed out, deleting row by row using a WHERE clause with the primary key avoids blocking issues. SQL Server processes the delete statement very quickly and avoids the long blocking problem. The proposed solution by the author generates a large SQL script that SQLCMD executes that contains one DELETE statement per row.

Since we had a pre-production environment, our development team gave the approach a shot. Soon, it became clear that the author never tested this approach with 500 million delete statements. Our team estimated it would take almost 2 days to complete the operation. To make matters worse, our team still needed to create a new non-clustered index for the remaining 60 million rows. This could have resulted in blocking issues again for new transactions.

Another approach for deleting historical data

One of our solution architects proposed a solution that could potentially avoid the blocking problem, limit extra disk activity, and reduce network traffic between the primary and secondary replica servers.

The solution goes something like this:

  • Create a clone _Temp table with the exact same structure, including the column names, data types, and foreign key constraints.
  • For this empty _Temp table, create the indexes identical to the original table and then add the brand new non-clustered indexes (per requirement).
  • Run an insert with a select statement, so as to insert the last 90 days’ worth of data from the current table into the new _Temp table.
  • Rename the tables to switch the places. The current production table becomes the _Old table and the _Temp table becomes an active production table.
  • Take the delta data that came into the _Old table while we were renaming it and insert it to the new production table.

By using this approach, inserting 60 million records into the _Temp table along with the rename process only took us 30 minutes. Since we created all of the indexes on the empty table, we were in business as soon as we completed with the insert statement. We experienced a brief hiccup during the rename sequence. For one second during the rename, there was one failure that took place. However, the application and retry logic and all the transactions made it into the system. As soon as we renamed the _Temp table to the original table name it became active and started accepting new inserts from the application. The only thing we had left was the data that found its way to the production table after the inserts were complete, but before the tables were renamed. That data we inserted into the new active table and we were all done.

A closer look at the procedure

FYI, we suggest you dive deeper into the technical process. We prepared the step-by-step guide with the explicit code samples of the second scenario that proved to be the most efficient on our side. So follow the White Rabbit.

1. Create the stored procedures copying data to temp and from old tables

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
IF OBJECT_ID ('[dbo].[PreRenameMigration_tbApplicationUsers]') IS NULL
EXEC('CREATE PROCEDURE [dbo].[PreRenameMigration_tbApplicationUsers]
@StartDateTime datetime
,@EndDateTime datetime
AS
'
)
GO
ALTER PROCEDURE [dbo].[PreRenameMigration_tbApplicationUsers]
@StartDateTime datetime
,@EndDateTime datetime

AS
BEGIN
/*------------------------------------------------------------------------------
PROCEDURE: PreRenameMigration_tbApplicationUsers

PURPOSE:
To insert existing records from SecuritySchema.tbApplicationUsers to what will be the new table, SecuritySchema.tbApplicationUsers_Temp

EXAMPLE:
DECLARE @EndDate datetime = GETDATE()
Exec PreRenameMigration_tbApplicationUsers @StartDateTime = '1900-01-01 00:00:1.000', @[email protected]

-----------------------------------------------------------------------------*/

SELECT @StartDateTime AS 'Start Date', @EndDateTime AS 'End Date\Start Date for Post inserts'

INSERT INTO SecuritySchema.tbApplicationUsers_Temp
SELECT * FROM SecuritySchema.tbApplicationUsers
WHERE LastLoginDate BETWEEN @StartDateTime AND @EndDateTime

END
GO

IF OBJECT_ID ('[dbo].[PostRenameMigration_tbApplicationUsers]') IS NULL
EXEC('CREATE PROCEDURE [dbo].[PostRenameMigration_tbApplicationUsers]
@StartDateTime datetime
,@EndDateTime datetime
AS
'
)
GO
ALTER PROCEDURE [dbo].[PostRenameMigration_tbApplicationUsers]
@StartDateTime datetime
,@EndDateTime datetime

AS
BEGIN
/*------------------------------------------------------------------------------
PROCEDURE: PostRenameMigration_tbApplicationUsers

PURPOSE:
To insert existing records from the now renamed SecuritySchema.tbApplicationUsers_old to the new renamed table, SecuritySchema.tbApplicationUsers

EXAMPLE:
Exec PostRenameMigration_tbApplicationUsers @StartDateTime = '2018-09-26 13:00:00.000', @EndDateTime='2018-09-26 13:32:00.000'
-----------------------------------------------------------------------------*/

SELECT @StartDateTime AS 'Start Date', @EndDateTime AS 'End Date'

INSERT INTO SecuritySchema.tbApplicationUsers
SELECT * FROM SecuritySchema.tbApplicationUsers_old
WHERE LastLoginDate BETWEEN @StartDateTime AND @EndDateTime

END
GO

2. Create a table with the right constraints and indexes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
IF OBJECT_ID('[SecuritySchema].[tbApplicationUsers_Temp]') IS NULL
BEGIN
CREATE TABLE [SecuritySchema].[tbApplicationUsers_Temp](
[LastLoginDate] [datetime] NOT NULL,
[ApplicationID] [INT] NOT NULL,
[UserName] [nvarchar](40) NOT NULL,
[IpAddress] [nvarchar](50) NOT NULL,
[Success] [bit] NOT NULL,
[SessionID] [nvarchar](64) NULL,
[StatusID] [INT] NULL)
END
GO

IF NOT EXISTS(SELECT * FROM sys.sysobjects WHERE name = 'tbApplicationUsers_Temp_StatusID_FK')
BEGIN
ALTER TABLE [SecuritySchema].[tbApplicationUsers_Temp] WITH CHECK ADD CONSTRAINT [tbApplicationUsers_Temp_StatusID_FK] FOREIGN KEY([StatusID])
REFERENCES [dbo].[tbStatusDescription] ([StatusID])
END
GO
IF EXISTS(SELECT * FROM sys.sysobjects WHERE name = 'tbApplicationUsers_Temp_StatusID_FK')
BEGIN
ALTER TABLE [SecuritySchema].[tbApplicationUsers_Temp] CHECK CONSTRAINT [tbApplicationUsers_Temp_StatusID_FK]
END
GO

/* Clustered Index */
IF NOT EXISTS(SELECT * FROM sys.indexes WHERE name = 'IX_LastLoginDate' AND object_id = object_id('[SecuritySchema].[tbApplicationUsers_Temp]'))
BEGIN
CREATE CLUSTERED INDEX [IX_LastLoginDate] ON [SecuritySchema].[tbApplicationUsers_Temp]
(
[LastLoginDate] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
END
GO

/* First non-clustered index */
IF NOT EXISTS(SELECT * FROM sys.indexes WHERE name = 'IX_LastLoginDate_Application_UserName' AND object_id = object_id('[SecuritySchema].[tbApplicationUsers_Temp]'))
BEGIN
CREATE NONCLUSTERED INDEX [IX_LastLoginDate_Application_UserName] ON [SecuritySchema].[tbApplicationUsers_Temp]
(
[LastLoginDate] ASC,
[ApplicationID] ASC,
[UserName] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
END
GO

/* Second non-clustered index */
IF NOT EXISTS(SELECT * FROM sys.indexes WHERE name = 'IX_LastLoginDate_UserName' AND object_id = object_id('[SecuritySchema].[tbApplicationUsers_Temp]'))
BEGIN
CREATE NONCLUSTERED INDEX [IX_LastLoginDate_UserName] ON [SecuritySchema].[tbApplicationUsers_Temp]
(
[LastLoginDate] ASC,
[UserName] ASC
)WITH (PAD_INDEX = ON, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80)
END
GO

/* Third non-clustered index */
IF NOT EXISTS(SELECT * FROM sys.indexes WHERE name = 'IX_UserName_LastLoginDate' AND object_id = object_id('[SecuritySchema].[tbApplicationUsers_Temp]'))
BEGIN
CREATE NONCLUSTERED INDEX [IX_UserName_LastLoginDate] ON [SecuritySchema].[tbApplicationUsers_Temp]
(
[UserName] ASC,
[LastLoginDate] ASC
)WITH (PAD_INDEX = ON, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80)
END
GO

3. Grant permissions to this table.

1
2
3
4
5
GRANT INSERT ON object::SecuritySchema.tbApplicationUsers_Temp TO Role_UserManager
GRANT REFERENCES ON object::SecuritySchema.tbApplicationUsers_Temp TO Role_UserManager
GRANT SELECT ON object::SecuritySchema.tbApplicationUsers_Temp TO Role_UserManager
GRANT UPDATE ON object::SecuritySchema.tbApplicationUsers_Temp TO Role_UserManager
GO

4. Run insert with the select via proc to insert last 90 days data into this table.

1
2
3
4
DECLARE
@StartDate datetime = GETUTCDATE() - 90
,@EndDate datetime = GETUTCDATE()
EXEC PreRenameMigration_tbApplicationUsers @StartDateTime = @StartDate , @EndDateTime = @EndDate

5. Monitor the Availability Group replication traffic, the size of the database transaction log, and the Splunk IOPs usage.

This process was important to make sure we didn’t fill the transaction log and ensure that they stayed within their IOPS limits with io1 storage.

6. Rename the tables with user login information. Run the following program one line at a time.

1
2
3
4
5
6
7
8
9
IF OBJECT_ID('[SecuritySchema].[tbApplicationUsers]') IS NOT NULL
AND OBJECT_ID('[SecuritySchema].[tbApplicationUsers_Old]') IS NULL
EXEC sp_rename [SecuritySchema.tbApplicationUsers], [tbApplicationUsers_Old]
GO

IF OBJECT_ID('[SecuritySchema].[tbApplicationUsers_Temp]') IS NOT NULL
AND OBJECT_ID('[SecuritySchema].[tbApplicationUsers]') IS NULL
EXEC sp_rename [SecuritySchema.tbApplicationUsers_Temp], [tbApplicationUsers]
GO

7. Verify that the new records are getting inserted into the new table via the service.

1
2
3
4
5
-- This will provide the enddate for proc PostRenameMigration_tbApplicationUsers
SELECT LastLoginDate, COUNT(*) FROM SecuritySchema.tbApplicationUsers
WHERE LastLoginDate > '' -- Coming back from the output of proc PreRenameMigration_tbApplicationUsers.
GROUP BY LastLoginDate
ORDER BY LastLoginDate

8. Run insert with the select via proc to insert delta data into the table which might have been missed during the table renames.

1
2
3
4
5
DECLARE
@StartDate datetime = '' -- Coming back from the PreEndDate from the output of proc PreRenameMigration_tbApplicationUsers.
,@EndDate datetime = '' -- Coming from the first date from script Check_Service_Integrity.sql.

EXEC PostRenameMigration_tbApplicationUsers @StartDateTime = @StartDate, @EndDateTime = @EndDate

9. Drop all the stored procedures from the database now that the archive is complete

Like any good DBA, you want to remove the stored procedures you no longer need.

Becoming a perfect solution

Based upon the smooth experience in pre-production, we went ahead with the complex approach in production. Just in case, we ran the transaction log cleanup SqlAgent job every 2 minutes, as opposed to every 15 minutes, while importing the data into the _Temp table. Thus we were sure that the database transaction log would not fill up.

To make sure they don’t have the problem again, they run a SQL Agent job every day during their maintenance cycle to use the simple DELETE statement to remove the much smaller amount of data.

This is how the DB Best team achieved SQL Server Database GDPR compliance by deleting millions of rows of customer data, with ZERO downtime, or business processes disruption.

Though it is a story of just one out of hundreds of our successful collaborations. Stay tuned to know about all of them! Contact DB Best to start your own data management or application development project.