A Clustered Index is not another term for Primary Key, and more thought should be put into the key columns of the index than always allowing them to default to the PK.
First of all, the primary key is the main way you uniquely identify a row in a table enforcing data integrity, while the clustered index is the order in which a table is stored and retrieved. Although they are typically the same column(s), they are completely separate concepts. The are the same so often that this should be your default, but they are different often enough that you should remember that’s only a default and not a rule.
Let’s have an exception in mind as we go through the details.
- We have an Orders table which has OrderID as the PK, and 11 other columns including OrderDate, CustomerID, and Status.
- 90% of our querys are filtered by OrderID; these queries are typically already narrowed down to a single record, but sometimes we have a small handful of OrderID’s in a table where we’re getting more than one a time.
- When we query by a CustomerID we could be getting up to 5% of the table, and only getting four columns (CustomerID, OrderID, OrderDate, Status)
- Throughout the day, several people run queries to see what orders were placed for a day, week, or month. These queries vary in other columns they are filtered by and how many columns they return, and several return pratically every column in the table.It seems very obvious to pick OrderID as the clustered index. It’s an identity column, it’s added sequentially, it’s the Primary Key, and 90% of our queries are filtered by it. What more could we hope for, right?
It’s not what I’d pick though
OrderID is the Primary Key, and we’re not going to even consider changing that. However, did you notice that our queries that can filter by OrderID are only pulling back a couple records each, typically only one record? That means the performance of those queries wouldn’t be noticeably hurt if they had to do some key lookups to get the rest of the information. While it’s true that a key lookup will add about 4 reads per row (assuming index depth of 4), a handful of rows means that will add up to 20 or 40 reads total. That’s not an issue.
CustomerID may seem like a logical choice. We could be pulling back 5% of the table for a single customer, and that’s a lot. However, the screen in our app and standard reports only require 4 columns, so we’ll just make a covering nonclustered index on this. Since it’s consistently only using 1/3 of your columns, it’ll be quicker as a nonclustered index anyways.
OrderDate is a little different. It’s rather common to pull back a lot of records, and it’s not too rare to ask for a lot of columns when we do that. This means we have four choices.
- Narrow Nonclustered Index – Do key lookups to get other columns. They’re only 4 reads each, but that’s 200,000 reads if they query returns 50,000 rows.
- Covering Nonclustered Index – It would have to include pratically every column to avoid key lookups.
- Clustered Index (or Table) Scan – Just let it read the entire table. It may be cheaper to read it this way even if you have a nonclustered index because those key lookups add up when you get too many rows.
- Make this the first key field of the clustered index – Quick seek with all your columns. There are other things to keep in mind, but this is looking good.
Ok, so we want to look into making OrderDate our clustered index. It’s not a unique column, so if this was our only key field it would not be a unique index. That’s something you want to avoid because all indexes are unique, it’s just a matter of if you make it that way or if SQL Server has to do it behind your back.
Here’s a rabbit hole that Lisa Jackson-Neblett (b) started me on a couple years ago while I was attending one of David Pless’s (b|t) classes.
A nonunique clustered index will get a new, hidden 4-byte column called the uniquifier that you can’t do anything useful with. Its value is a 0 when there are no duplicates, then 1 for the first duplicate, and so on. Ken Simmons (b|t) gets into the details with his post Understanding and Examining the Uniquifier in SQL Server. There, with that column added on, now it’s unique!
A nonunique nonclustered index will use the clustered index’s key fields to make its key unique. See this in action in Kendra Little’s (b|t) blog post How to Find ‘Secret Columns’ in Nonclustered Indexes. That means if you made OrderDate the only key field on your nonunique clustered index, then making a nonunique nonclustered index with the only explicit key field CustomerID would have three key fields, in this order: CustomerID, OrderDate, Uniquifier.
A unique nonclustered index like we’re talking about making on OrderID to enforce the primary key would still need the clustered index’s key, just not as key fields. Instead it would add OrderDate and Uniquifier as included columns.
Make it Unique
While there are times having a nonunique clustered index is a good idea, this isn’t one of them. In fact, almost any time you think it’s a good idea it’s because you’re missing something.
In this case it’d be easy to add a second key column of OrderID and call it unique, eliminating the 4-byte uniquifier by adding in what’s probably a 4-byte integer field making the clustered index key the same size. Even if it was a bigger company that needed a big int for the column, at least you have a useful column.
The nonclustered index on CustomerID will now have the key columns CustomerID, OrderDate, and OrderID. That’s not that big of a deal because you were planning on including those two columns anyways. You’ll still declare the index with OrderID, OrderDate, and Status as included columns. SQL Server will promote OrderID and OrderDate to key columns, but you want them on your definition so they’re still in the index if someone changes the clustered index down the road. The net effect is that this nonclustered index just got 4-bytes smaller for included columns because you made the clustered index unique.
The nonclustered index on OrderID will still have just the one key field because it’s unique, and it will include OrderDate automatically. Just like the index on CustomerID, the included columns are 4-bytes smaller now.
Now that you have these indexes, when you query by OrderID you have to do key lookups on the few rows you return. It’s more expensive than it was before, but the time this adds is usually not an issue because you’re talking about so few extra reads.
Since you’re sorted by OrderDate and it’s typical to want to know about your recent orders more than historical orders, you’re also being very efficient with your memory. It’s an advanced piece to look at, but worth mentioning. The first 10,000,000 leaf-level pages of the clustered index are historical data that is rarely read (probably not in cache), but the last 100,000 pages are more current and read a lot (in cache). Not only that, but each page has many orders that we need to have in cache, and that difference adds up. Yes, it’s true we would have had the same effect if we left OrderID as the clustered index, but it’s good to know we didn’t hurt anything.
With a clustered index as an incremental identity column it’s always adding data to the end of the table, so page splits are rare. Yes, it’s true that adding another page at the end of the table counts as a page split, but it’s not the kind that slows us down. Changing the clustered index like this adds this as a concern, making it more likely that this index will need to be cleaned up by our index maintenance task.
We made the first key field OrderDate. While it’s not guaranteed to be the last row in the table, I’d expect that to be somewhat normal which would avoid fragmentation. If, however, you have a single order in there with an OrderDate of 2099-01-01, you’re doomed. Well, not doomed, but every new page added will split the last page 50/50 and write to the second page. Not only did it have to do more work to split the pages instead of just creating a new one, but it also left a 50% full page while having you start out on a page half way to requiring another page split.
This isn’t a deal-breaker, but it’s another cost we have to keep in mind.
What was that again?
So the primary key wasn’t the best clustered index in this case because of how the table is queried. By the numbers it’s queried more by the primary key, but we had to look at what the queries were doing as well. Nothing beats knowing your data.
Although I just talked you through this without running the scripts and testing every piece, it does not mean you can make a change like this without testing (I’m not allowed to, I wrote it into my mental contract with myself). Get a typical load and run it in non-prod, make this change in non-prod, then run it in non-prod again. Feel free to measure twice and cut once, I heard that’s a good idea.
They Can’t All Be Clustered
You only have one clustered index on your table because, well, it IS your table. However, you can have lots of nonclustered indexes. I will say that if you can tell me how many nonclustered indexes you can have then you’re doing it wrong, but you’re probably doing it wrong if you don’t have any, too. Look into my next post, Indexing Strategy, to start to get an idea of what you want to do with your nonclustered indexes.
The more you learn about indexing the more intelligent your decisions are going to be, so keep learning. Know the details of what’s in an index, why it’s there, and how it affects you. Practically everything you do with SQL Server is either modifying or retrieving data from an index.
Here are a couple things to think about.
The clustered index is the table itself ordered by the key columns, and you can pretty much think of it like a nonclustered index that automatically includes every other column in the table. Although you can have a table without a clustered index, it’s typically a heap of trouble you don’t want.
Because all of the columns in the clustered index are at least in the leaf-level pages of all nonclustered indexes, any update to any of the key columns of the clustered index will be an update to all of the nonclustered indexes. I’ve seen times where this was an issue, I’ve seen times where it wasn’t an issue, and I’ve seen many more times where the clustered index key just isn’t on columns that get updated. In this case there may be rare times when an order date is changed, but nothing to worry about. Besides, most indexes on this table would want the OrderDate column in there anyways.
I may have left you with the impression that data is physically stored in the order of the key field, but that’s not how SQL Server does it. Read Gail Shaw’s (b|t) post Of clustered indexes and ordering to see SQL Server is really doing. Basically, SQL Server knows which order to retrieve pages, although they’re not stored in that order because fragmentation happens. Also, the rows on each page aren’t actually stored in order, either.
A couple days after I originally released this post Matan Yungman (b|t) released When Should You Use Identity as a Clustered Index Key?, which is viewing this from the inserts side. It’s different from my discussion, but still the Clustered Index. View things from as many sides as possible if you want the best answer possible…read his post!
I’m sure there’s more to say about the key columns of a clustered index I’m not thinking about right now. Let me know in the comments below and I’ll add to this list as needed.
Pingback: Indexing Strategy | Simple SQL Server
Another example when you (usually) don’t want to use the identity column as clustered index would be a detail table as a OrderPositions table, because you’ll usually would query this table filtered by the OrderID and only very seldom by the OrderPositionID.
Of course you should be aware that this could lead to page splits. Orders and their positions are usually inserted in a limited time window so this is not a big deal here, but if you have a CustomerPhoneNumbers table which is / should be clustered by the CustomerID you would face this more often and should consider to use a lower fill factor (depending how often new Phone numbers are inserted)
These are some good points, and it all goes to show that you need to be able to find patterns, understand how the data is used, and know some of how SQL works in the background to really make a database efficient.