When you tell SQL Server to create a regular clustered index on a table, it will completely rebuild your table. SQL Server will take your data rows and rearrange them on new pages. Your table’s pages become the leaf level of the clustered index, so sometimes we call the table a clustered table.
People usually think of clustered as meaning ‘sorted’, and for the most part I’ll go along with this. But sorting can be logical or physical, and there has been much confusion and misinformation, even from Microsoft’s own documentation.
This document says:[If you want to return your data in sorted order] it can be a good idea to cluster, that is physically sort, the table on that column to save the cost of a sort operation every time the column is queried.
It also says:
After the row with the first value is found by using the clustered index, rows with subsequent indexed values are guaranteed to be physically adjacent.
Well, this certainly seems to imply that the sorting is physical sorting. Right after index creation, it may well be that the data is physically sorted. But what happens when you start adding rows so that new pages need to split, or deleting rows so that pages are removed? SQL Server does not reorganize the whole table just because you need to insert one new row that won’t fit.
Your data is guaranteed to be logically sorted, not physically. This means that SQL Server maintains pointers or other means of specifying data location that allows SQL Server to retrieve the data in order efficiently, but the data is not necessary stored on contiguous pages or in contiguous bytes on a page.
SQL Server keeps track of the order of pages by NEXTPAGE and PREVPAGE pointers within a page header. For example, if you have a clustered index on the LastName column in the Person table, there might be a page containing the LastName values from Price to Rodriguez and in the header the NEXTPAGE value might be a page number for a page that contains Rodriguez to Sai. The NEXTPAGE value might be one more than the page number for the page containing Price to Rodriguez, but it might not be. There is no guarantee. In the table I just looked at, Price to Rodriguez were on page 12027 and Rodriguez to Sai were on 12028. But, the next page after that was 12064, which contains LastName values Salah to Smith. So for some pages the logical order is also physical and for some it is not.
Figure 1 shows a section of a clustered table that is both physically and logically sorted. Each page has a pointer to the next page and a pointer to the previous page in sorted order. The page numbers are consecutive as we follow the pointers and the data values increase.
Figure 1: Clustered index pages that are both logically and physically in order
Figure 2 shows a few pages from a clustered table that is not physically sorted. The pages are still logically sorted in that the next pointer from each page points to the next page in sorted sequence. But the page numbers are not consecutive.
Figure 2: Clustered index pages that are only logically sorted
In addition to the pages not guaranteed to be physically contiguous, the rows on a page are also not guaranteed to be physically in order. On the page that has Rodriguez to Sai, Sackhoff might actually occur in the bytes immediately after the page header (in the first ‘slot’), to be followed by a row with Rosenthal. The rows on the page can actually be placed in any order, but SQL Server maintains a set of offset pointers at the bottom of the page in a structure called a ‘slot array’ that allow it to access the rows in sorted order. Again, the ordering is logical, meaning we can get to the rows in order via pointers or offsets, but not necessarily physical.
Figure 3 shows a page from a clustered table. All pages have a 96-byte header so the first row always starts at byte offset 96. If we assume each row is 100 bytes, the next row will start at 196, the next at 296 etc. The slot array shows the logical order of the rows. The row at offset 396 is the first one in sorted order (Rodriguez), the row at 196 is the second in sorted order (Rosenthal). To retrieve the rows in order, SQL Server will follow the offsets in the slot array, starting with the one at the very end of the page.
Figure 3 Rows on a page of a clustered table
Another common misconception that people have about clustered indexes, is that because the data is sorted, it is assumed the data will always be returned in that sorted order. So for the table with the clustered index on LastName, we should always get the data back in LastName order. You may execute a query 1000 times and get sorted results back every time, but this does not guarantee it will be returned in the same order for the 1001st execution. If you have to get data in a specific order, you need to use the ORDER BY clause.
Here is an example you can try in the AdventureWorks2014 database. (You can download a backup of the database here and restore it on any SQL Server 2014 or SQL Server 2016 instance.) I’m going to make a copy of the Person.Person table and build a couple of indexes on it. The clustered index is a composite index on LastName and FirstName. However, when I select just those 2 columns, they do not come back in order.
SELECT * INTO dbo.Person
CREATE CLUSTERED INDEX name_index ON dbo.Person(LastName, FirstName);
CREATE UNIQUE NONCLUSTERED INDEX ID_index ON dbo.Person(BusinessEntityID);
SELECT LastName, FirstName
Figure 4 shows the first few rows of my results:
Figure 4 Names are returned in no discernible order
Neither the LastName values nor the FirstName values are sorted. The order appears random.
But even if I couldn’t give an example that returned rows out of order, and every single query I ran returned the rows in clustered index order, we need to remember that it is not guaranteed, and another time we might NOT get the expected order. The only way to be sure we get the rows in the order we want is to use ORDER BY.
Next week, we’ll look at clustered COLUMNSTORE indexes and will need to rethink yet again exactly what ‘clustered’ means. (The week after that, I’ll show you how I actually look inside the pages to see the NEXTPAGE and PREVPAGE pointers, and the actual order of rows on a page.)