Indexes – Understanding basic types and their components

The biggest problem developers and newer DBAs have with understanding indexes is that you don’t realize when you’re using the exact same thing away from your computer.  Pick up any reference style book and you have one clustered index and one nonclustered index. The clustered index is also split into the b-tree (table of contents) and the leaf levels (actual book, which is also why you can only have one clustered index).  The nonclustered index in the back of the book is, however, very basic.

Here’s how the clustered index to table of contents comparison works.  The table of contents will tell you exactly what page to start on for a specific subject in the book.  The b-tree of a clustered index is the table of contents, and will tell SQL Server exactly what page to start at to look at a specific value or range of values you’re looking for.  Then when you turn to that page in the book you can flip to the next page to continue reading until you found everything you’re looking for.  SQL Server does the same thing where one page tells you where the next logical page is, which is typically the next physical page on the disk.  If you get too many instances where the next physical and logical pages aren’t the same then it’s time to rebuild the index to fix your fragmentation.

The key columns in the clustered index are best viewed in the context of a phone book.  The clustered index there has two key fields, last name and first name, in that order.  The “in that order” is more important than people newer to indexing would guess, and here’s why.  If I asked you to find every person with the last name “Hood” in a phone book, you’d flip through until you found the H’s and find what you were looking for rather quickly.  On the other hand, if I asked you to find every person with the first name “Steve” then you’d get mad at me and I’d have an impression of a phone book on my head.  SQL Server uses more technical terms, where it will seek (clustered index seek) for the last name, and scan (clustered index scan) the entire table for the first name.  Luckily, SQL Server rarely gets mad and just does the clustered index scan reading the entire table.

A nonclustered index follows the same rules for the key columns, and if you’re not searching by the first key column then you’re reading the whole index.  A nonclustered index, however, is a bit different from a book in that it has more information.  In the book example you have a single key field and it automatically includes the clustered index key (the page number) as a key field so you can look up the rest of the information.  In SQL Server, your nonclustered index can have more than one key value and can include more information than just the clustered index key field(s).  The clustered index key will always be implicitly added to every nonclustered index, and SQL Server will use that information to both look up the any fields not in the nonclustered index and ensure each row is actually unique, even in a nonunique index.  Since you’re duplicating all the key fields of a clustered index like this, it really makes you think twice before making a wide clustered index that has multiple key fields. Kendra Little did an amazing job at taking a deep dive into this on her post “How to Find Secret Columns in Nonclustered Indexes“.

Why would you want more than one key field?  If you have a query where it’s looking at the employee table for the entire federal government and you constantly run queries looking for people according to what part of the government they work for, their last name, and gender, then a single key field on a nonclustered index will leave you searching through hundreds of thousands of records.  On the other hand, if you have three key fields in the order I listed above, you’d easily get down to the mere hundreds of records that work for the Army, have a last name of Hood, and are male very quickly.

In addition, you can include fields in a nonclustered index.  There’s two reasons for this.  First, you may be able to filter down your results even further.  Second, you may not have to flip through your clustered index (key lookup for those looking at your execution plans) to get the rest of the information you require.  Say in the example above you only had the key fields for the department and last name in your index, but you included the gender column.  The nonclustered index would still store the gender, but it wouldn’t store it in order.  Therefore SQL would have to read through all the records where the department and last name were the same, but at least you would be able to filter it out without going back to the clustered index.

If the only thing returned by this query outside of those fields was the salary of the employee, you could also include that field in the index.  This is called a covering index, because the index covers all of your needs.  That means that your entire query would show up in an execution plan as an index seek, no key lookups.  This is awesome, and here’s why.  In my book context above, this is the difference between looking at the index in the back of the book and seeing that the information you need is on pages 2, 5, 8, 11, 45, 88, 128, 224, and 455, or looking in the back of the book and seeing all of the information is right there.

At this point you may be thinking this is the best thing ever, include everything, make everything a key field.  Picture what that would do to the size of a book if absolutely everything was included in the index, it’s be as big as you’re about to make your SQL tables.  Not only that, but while a book may have static values after it’s published, a SQL table keeps on changing.  Every time you change a field that’s in an index (key field or included field) you have to change that index as well.  If you have 5 indexes that have the salary column and you want to give someone a raise, you’re really doing 6 updates.  One clustered index, and 5 nonclustered indexes.

I think all of us have heaps out there…did I ignore them in this post?  No.  Every table has a clustered index as far as I’m concerned, it’s just handled implicitly (heap) or explicitly (clustered index).  A heap has a RID, which is a unique number for each row in a table, and it’s implicity included in every nonclustered index on a table.  If anyone can point out an advantage to a RID over a clustered index on a bigint identity column, I’d like to hear it.  The only difference I can point out is that you can use the bigint column to reference that row from another table, where you can’t do that with a RID.

Also, another pet peeve of mine, every clustered index is unique.  Even if you create one by saying “CREATE NONUNIQUE CLUSTERED INDEX”, it’s unique.  SQL does this by including a uniquifier (it’s a real word, even if spell check disagrees) that has a 0 for the first value and is incrementally higher for each duplicate value.  This, like a heap, puts an overhead on SQL Server that neither you nor your developers can take advantage of.  My answer, make sure every table you care about has a unique clustered index, and make sure you care about every table that has over 50 rows.