Indexing Strategy

What do I care about when I’m playing with indexes? That’s easy. I want as few indexes as possible efficiently referenced by as many pertinent, well-tuned, consistently written queries as is reasonable. It’s explaining that last sentence that’s the hard part.

The thing that will jump out to most people is that my goal doesn’t mention a specific, single query that you want to run great.  Although that’s something I hope to achieve, it only becomes a priority as a last resort.  I’m more concerned with already having that data in memory because the index is being used by many queries, and also having fewer indexes to make data modifications more efficient. There’s more to it than that, but the detail belongs in the body of this post, not the intro.

If I was writing how to make a perfect index for a single reference to a table in a single query, this post could be done rather well in a couple paragraphs. Even though I’m focusing only on OLTP (ruling out columnstore indexes), in 99.999% of environments (ruling out in-memory hash indexes), and not getting into details of filtered indexes or indexed views, there’s still a lot to consider to the point that the first few paragraphs will just be what we’re going to keep in mind. I hope you didn’t have other plans today…

Does this advice apply to you?

It depends! Just kidding, I hate that (non) answer.

There are two targets audiences for this.  While it’s useful to everyone, you’d have to hit both of them for this to be perfect for you. First target is the person; this is written for someone who is comfortable working with indexes for single queries but wants a better view of the big picture. Second target is the database, which is a rather typical in-house OLTP database.

  • More data than you have memory
  • Writes throughout the day, especially in your larger tables
  • Read and write performance matter more than disk space
  • No extreme write loads, such as truncating and repopulating your largest table (easy fix, disable and rebuild your indexes around this action)
  • You have the ability to make indexing changes (this goes beyond what you can typically do with vendor databases)
  • Hopefully, you have the ability to make some code changes

If you or your database aren’t the perfect audience, don’t worry about it.  The most important things to know are what to keep in mind and how those things are interconnected.  The real goal is making more intelligent decisions for your databases, not fitting into a specific mold.

PreReqs!!!

Because this is an advanced look at the fundamentals of indexing strategy, you shouldn’t take offense if you have to do some prerequisite work for all of this to come together. If I give you a severe headache consider reading this stuff first, and the book on the list is well worth a second read cover-to-cover.

PreReqs:

Even with that I’ll probably still give you a headache (a common theme for me), and you’ll probably have questions. Keep in mind that some of the greatest compliments you can give someone are asking them a question and offering them large sums of cash. Although I don’t hand out my LLC’s address to send checks to unless I’ve actually looked over your indexes, I do have a free comments section below that I’d like to see used more often.

Something to consider

Here are all the things we’ll be considering. This is a great list, but nothing is ever going to be 100% all-inclusive or perfect.

Tune your queries: The ideal set of indexes for a poorly written query may be 100% different than the same query after it’s refactored.  We don’t want to constantly tinker with indexes, so this process is best if you tune your most expensive queries first.  An old, unpublished concept for this post had creating an index as a 13 step process with steps 1-11 avoiding indexes through tuning, step 12 making an index, and step 13 seeing if you could delete any other indexes.  That really wasn’t too different from Brent Ozar’s (b|t) Be Creepy method.  Indexing is not the only answer, and not the first answer either.

Query Importance: Some queries just need to complete, some need to run decently well, and some need to run as close to instantly as possible. Knowing how critical your query is will weigh in on how you index, but none of them, not even the most critical queries, will have their performance be the only deciding factor on what indexes you should have. Some outliers will require you to go back and create a specific index for them as a last resort, but there’s no reason to concern ourselves with last resorts before we get started.  If an index would work, even somewhat inefficiently, and it’ll already be in cache, do we want to create another index that will fight it for space in cache?  Don’t create another index to get your most critical query to 0.1 seconds when it’s running at 0.2 seconds and the business is happy with anything under 1.0 second.

Query Consistency: Are you querying the table the same way in all of your queries? If not, do you know you are stressing me out? Do you not care?  Using the same columns to join where possible matters, even if you could live without one of them in some cases because consistent queries mean index reusability.

Query Frequency: Some queries run five times a second, some run once a year. They aren’t even close to the same, and shouldn’t be treated the same.

Query Timing: It is different from frequency. Once-a-day is not just once-a-day. An “8-5 is critical” shop where all your users are sleeping at 3 AM means that we care much less about collateral damage from a 3 AM query. If we do a table scan on a huge clustered index at 3 AM that kicks everything out of cache it might not matter much, but do that same thing at 3 PM and we may want to consider an index even if it’s only used for a single query.

Query Justification: That query timing example threw up a red flag to me. Do we need to run that query at all? Does it need to run in prod, or is there a reporting database I can run it against? Should I consider making a reporting database? Does it need to run at 3 PM? Question the outliers that would change your indexing strategy for both if they need to run and if they could use a little T-SQL help.

Insert / Update / Delete performance: The more indexes you have, the slower your data modifications will be because they have to be written more than once. Wider indexes will be more overhead due to updates hitting it more often, larger index to maintain, and fewer rows per page of data.

Reusability: How many queries can use this index, and how will each of them use it? Is a query using it differently because it’s not referencing the table consistently or because it’s legitimately doing something different? This part is HUGE, and is really going to be a major focus. I didn’t give you a hard time on that query consistency point asking if you cared for no reason!

Memory usage: How much memory is being used, where is it being used, and why? Is that memory being used to fulfill multiple queries (see Reusability, which references Query Consistency, which goes back to Query Tuning)? Could we use less memory if we had a descent index? Is the query that requires all that memory justified and timed properly? These points are starting to mix together and reference themselves, aren’t they? Is indexing strategy an infinite loop?!?!? Yes, yes it is.

Key Lookups: For the queries that use this index, is this a covering index? If not, what would it need to be a covering index? We’ll look at these things: how critical is each query, how often is that query run, how many key lookups does it do, how wide are the total columns we would need to add to be covering, how often are each of those columns updated, what other queries would take advantage of having those columns in there, and is there any filtering being done on those columns?

Maintenance: It’s easy to see having fewer, more narrow indexes would make index rebuilds, index reorgs, and database backups quicker and easier. How about key column order and compression?

TDE: What’s this feature doing in an indexing article?

SQL Edition: Index compression is going to be the biggest one for us today. Online rebuilds can make a big difference, too, but it rarely makes a big difference in what indexes you want.

Pick a table, any table

We’re not going to change the entire database at once.  Partially because it’s overwhelming to you, but mostly because it’s lower risk that’s easier to troubleshoot and roll back if needed. So we’ll pick a single table that we want to have run more efficiently, make a change or two to it, then do it again with either the same table or a different one.

I’m not consistent on how I pick tables.  Although I usually pick one of the ones that’s the largest in the buffer pool that I haven’t made as efficient as I could already, which you can find using the query in  my Cleaning Up the Buffer Pool post.  However, that’s not always how I pick a table.  I’ll also start this off with a query that I wish was running faster, run it with SET STATISTICS IO, TIME ON to see what tables are getting hit in the slowest part, and work on a table that’s getting hit inefficiently here.  There’s no magic to it, just pick a table and reassure the other tables that they’ll get their turn later.

It looks like an infinite loop at first glance because I’ll keep picking tables and loop back to revisit table later, but it’s better to think of it as an upward spiral. That first trip around will give you all kinds of awesome, the second trip will add on to that, and each trip around yields less of an improvement. You could stop wherever you wanted if it wasn’t addictive.

Also as the size of your data changes, the queries hitting your database change, and more, it’s unreasonable to expect your indexing strategy to stay 100% the same.  This isn’t a job where you can ever say you’re really done, just in a better place than you were last week.

I have a table, now what?

At the times of day you want your database to perform great, what’s happening with your table? This may be anything that ever runs on the database for some places, and it may be anything that runs between 8 AM and 5 PM for others.

I’m being as all inclusive as possible by looking at everything that touches the table, so this won’t be as quick and easy as you’d think. Yes, my target audience for this post can create an index for a single query in minutes while I typically spend well over an hour on a single table; how fast you make it through this project isn’t my primary concern.

Once you picked a table to work on look in the proc cache to see what references the indexes on the table.  My query to do that in the same Cleaning Up the Buffer Pool post is good for this, but the one in Querying the Plan Cache is better for viewing an entire table at once. This has every cacheable plan that ran since the last restart of services and hasn’t been forced out of memory. Remember this is a really CPU intensive query that will take several minutes to run and needs to run against prod at a time of day you care about to provide what you need. If you have an extremely busy hour or two, run this as soon as things start to calm down.

Note, there were three different ways something could have avoided you seeing it in the proc cache, and that wasn’t counting if you turned on the typically recommended setting “Optimize for ad-hoc workload” that Kimberly Tripp (b|t) wrote about where you can miss the first run of ad-hoc queries in favor of keeping your memory cleaner. It’s also possible that a query is getting a different plan each time it gets compiled due to stats changing or parameter sniffing, but that affects us a little less since we’re going to make index changes that will change those plans anyways.

The proc cache query is also only capturing index usage. It does capture RID lookups, but not table scans against a heap. You’ll have to look at the modified scripts I put in Querying the Plan Cache to see the table scans because they’re stored differently in the XML.

For now, let’s focus on how things can sneak past our cache and how we can find them.

1 & 2: Was not run since the last restart of services or was forced out of memory. It can be in cache, it’s just not there right now. For that, we’re going to check back on the cache multiple times, and we’re also going to make one or two index changes at a time which will again have us checking back multiple times.

3: Uncacheable plans can happen for several reasons, with the most popular in my experience being temp tables where data was loaded into it then an index was created on the temp table. Whatever the reason, start up extended events or a trace and watch for sql_statement_recompile to help hunt them down. Take this list and search for references of your table to know which ones are relevant. To get bonus points (I’m not keeping score), find out why it’s not getting along with your cache and see if it’s something that should be fixed.

To make things a little more difficult in that step, you’ll also have to look for references to views and functions that reference the table. The views and functions will show up in my Proc Contains Text query, and you’ll have to iterate through that process again.

Keep in mind, this will never be perfect and 100% all-inclusive. I know I said that before, but I need some of the important details repeated to me before they sink in and I have to assume there are others like me. It will be very rare for this to pick up an ad-hoc query that runs for year-end processing. You can use your imagination to find 10 other ways you’ll miss something and still be shocked when a new way pops up.

However, we have enough to move forward, and we’re going to accept the rest as acceptable risk. If you don’t accept some risk you’ll never accept any rewards, it’s just a matter of reducing the risk and knowing enough to make an intelligent decision.

Now that you know what’s running, how is each one referencing the table? Looking at the proc cache, you’ll see predicates and seek predicates, which you’ll combine on a list. You’re going to have to run the stuff that didn’t make it into the proc cache manually on a test server and add them to the list as well.

This is completely overwhelming to do all of it.  The more you do, the more accurate your results will be, but it’s not actually reasonable.  Do what’s reasonable, understand that you’re trading off some level of accuracy for time, and also understand that if you don’t make that tradeoff then you’ll never have time for anything else…not even going home at night.

Here’s what the list could use:

  • Proc or name of SQL Batch
  • How important is it
  • How often does it run
  • When does it run
  • Predicates and Seek Predicates (let’s just call them predicates)
  • Equality columns
  • Range columns
  • Inequality columns
  • Column’s returned
  • Rows returned

If there was a RID or Key Lookup on a reference to a nonclustered index, add the output columns and predicate (not the seek predicate for this case only) from the lookup on here as well.  The seek predicate on a lookup is just the clustered index key or RID bookmark added as hidden key columns on your nonclustered index; they will not help you here.

Now look at this list and look for consistencies. What equality predicates are used a lot? You should be able to find different groups of equality predicates that can accommodate most of your queries, and those are going to be the key columns you’ll consider for your indexes. The first key column is going to be the column all of the queries you want to use this index have in common as an equality column, then iterate through them as the columns are used less and less.

This is not the traditional “order of cardinality” advice I’m sure you’ve heard when creating an index for a specific query, but we’re designing an index for your database, not your query. I’ll go one step further and say if it’s a toss-up between what’s the first key column, consider making it one that’s added sequentially such as DateAdded or ID on tables that see more updates because that will reduce page splits and fragmentation.

A query can take advantage of the chain of key columns starting with the first one. The chain can continue after each equality use. An inequality or range can take advantage of a key column as well, but the first one of these is the end of your chain. Once the chain is broken, everything else can be useful, but only as unordered values that don’t matter if they’re key columns or included columns.

You can stop putting in key columns when either queries stop being able to take advantage of them being ordered or the values you’re getting are either unique or close enough. These key columns aren’t free as Paul Randal (b|t) points out in his post On index key size, index depth, and performance.  If a key column is not very useful, then it’s very useful not to have it there.

I should note that if you’re using an index to enforce uniqueness then it will use all the key columns and none of the included columns to do so.  Based on the last paragraph you don’t want any key columns after it’s unique anyways, so don’t even consider that.  However, included columns aren’t used to calculate uniqueness, so you can make this a covering index if it helps you without hurting the unique constraint.

This process, like any other indexing process, isn’t going to be perfect. You’ll have to weigh your decisions with queries that are more critical or are called more often carry more weight in your decision.

Now that you have your key columns figured out, look at the queries that use more than just those columns. When they reference this index how many rows are they going to return where they have to get more information from the table itself through a lookup? How wide are those columns, and how many other queries are going to do the same? This is the balancing act between adding included columns and accepting key lookups. A key lookup is going to be a nested loop operation making separate calls to get the missing columns from the clustered index (or heap, for those who wish to anger me), so 10,000 key lookups is 10,000 separate calls in a loop. If you have to add a large number of columns to eliminate 10 key lookups then it’s almost never worth it. If you have to add one small column to eliminate 1,000,000 key lookups then it’s almost always worth it. Notice I didn’t use determinate language on those…you’ll have to find your balance, test it, and balance it again.

Some things like adding a column to avoid key lookups may make more of a difference to you than the user, but that doesn’t make it less important. For instance, I just said adding a small column to eliminate 1,000,000 key lookups is almost always worth it. If those 1,000,000 key lookups were from a single execution of a query then the user would probably notice, they might even buy you lunch if you’re lucky. If it was a single key lookup on a query run 1,000,000 times that day it’s still a drop in CPU utilization and a potential 1,000,000 pages from the clustered index that didn’t have to be read into cache. You’re doing this because it adds up to a better running server, not always because it’s noticed.

Your goal here is making an index as reusable as is reasonable. Here are the reasons you’re doing that:

  • Every index will fight to be in cache (assuming you don’t have vastly more memory than databases to fill it), an index that is reusable by many queries will be more likely to already be in cache and that space in cache will be more versatile.
  • Every index is another write process in an Insert, Update, and Delete, you’re trying to cut down on those.
  • Every index needs to be maintained, you’re cutting down on that, too.
  • Every index adds disk space, backup size, backup duration, restore durations, etc..
  • If you use TDE, every time a page is read from disk into memory it is decrypted. A reusable index tends to stay in memory more, reducing the number of times the CPU has to decrypt it. See, I TOLD you this belonged in an indexing strategy post!

Nothing’s free, so here’s what you’re giving up:

  • The index isn’t going to be the prefect index for most queries. Test the performance of your critical queries, but also keep in mind that these indexes are more likely to be in cache which could eliminate physical reads from the execution of those queries.
  • These indexes will tend to be wider than the query needs, which is basically restating that this isn’t going to be the perfect, most efficient index for a query. It also means that physical reads on these indexes will tend to be more expensive as there are fewer rows per page.  Again, keep in mind they’re more likely to be in memory because you’re going with fewer indexes that are shared by more queries.

Once you decide on an index or two to add, there are a couple things to consider.

  • What indexes don’t you want anymore? If a query could use another index slightly more efficiently, it will.  However, if it’s close enough then you want to get rid of that other index because of all those benefits of reusability I just mentioned (weren’t you paying attention?). It’s not a question of if a query would use the other index, it’s a question of if you want it to use it.
  • Some queries “should” use this index based on the key columns, but instead of it showing up as a seek predicate it shows up as a predicate. In these cases either your chain of key columns was broken (if column 2 wasn’t an equality column, column 3 will not be a seek predicate) or this column is not being referenced in a SARGable way.
  • Test in non-prod, not prod. Then test it again.
  • Know you’re accepting risk and understand everything involved the best you can. Have a healthy fear of that risk while also understanding that you took a risk just driving to work this morning.

Once these changes go through keep an eye on how they’re being used over the next couple weeks. If you’re in a rush to make a big impact, start a second table as the first change or two are in progress on the first table. Just don’t get too many changes in motion for a single table at once as that’s typically adding more risk and hiding which changes had positive and negative impacts. This is a process, and the longer it takes you do go through it the better the chance is that you’re doing it right.

If I’m doing this process for someone else who wants consistent improvement without taking on too much time or risk at once, then I like to get on their servers once or twice a month, find one or two changes to suggest, then have those go through testing and implementation.  Afterwards review the results and come up with the next suggestion.  It’s hard to be that patient as a full-time employee, but try.

Cluster It

All of that was talking about nonclustered indexes, but you get to pick a clustered index for your table as well.  Keep in mind this is a bigger change and involves more risk, but it’s also a bigger reward.

So, what do I care about that’s special for a clustered index?

  • Uniqueness
  • Key width
  • Width of columns being queried
  • Column types being returned (some can’t be in nonclustered indexes)
  • Number of rows being returned

The size of your key columns on your clustered index is the MINIMUM size of the key columns on a nonunique nonclustered index, and it’s also the MINIMUM width of the page level of any nonclustered index.  You need to keep that in mind.

However, just because your table has an identity column in it doesn’t mean that’s the best clustered index.  What is the best clustered index is going to vary wildly from table to table; there’s not always going to be a clear answer, and the answer will partially depend on how the table is queried.  I get into that a lot more in my last post, Picking a Clustered Index.  Yes, I wrote that post specifically to keep this one shorter…with mixed results.

If a table is often queried by a relatively small column that’s not unique, but the queries tend to pull back most of the columns in the table and a large number of rows then it’s worth considering using this as part of the clustered index key.

If you don’t then you’re faced with two solutions; you can make a really wide nonclustered index to cover these queries, or you can let the queries decide if they want to do a ton of key lookups or just scan the clustered index.  These don’t sound like fun to me.

You still have to worry about the integrity of your data, so if you’re dropping the unique clustered index with a single column to do this then you almost definitely want to add a unique nonclustered index with that single key column to maintain your data integrity.

Compress It

Index compression is an Enterprise-ONLY feature.

Compression is a very big point to hit on here, even if I’m only giving you the compressed version.  It makes your data smaller on disk (less I/O), smaller in memory (less need for I/O), and actually tends to lower your CPU usage instead of raising it.  I get into a lot more detail in my Data Compression post because I didn’t want to have too much space dedicated to a feature not everyone can use here.

Don’t Forget the Outliers

Go back to that list you made of all the queries hitting a specific table. Were some of the queries different than the rest? There are usually a couple, and that doesn’t necessarily mean there’s an issue. However, I look at these to determine if they are using the table properly.

Are they are joining on all the fields they should be. Sometimes you can get the correct results by joining on 3 of the 4 fields you technically should, so why join on the 4th? Well, index reusability is one of those reasons, because it may not be able to use the proper index because someone skipped a column that happens to be the first key field of the perfect index for this query.

Is the query SARGable? Sometimes you’re joining or filtering on the right fields, but something is written in a way that SQL couldn’t do a direct comparison.

Are you returning too many columns? I’ve seen queries returning 20 columns (or using *, which is a move obvious version of the same thing) to populate a screen that uses 3 of them, and on the SQL side you have a DBA trying to figure out if they should add included columns to an index to make that run more efficiently. The most efficient for this and many other examples is refactoring, not reindexing.

Remember, your goal is to make your server run more efficiently, and tweaking indexes is simply one of your tools. While you’re going through this process keep your eyes open towards how other tools can be used.  SSMS is never going to come up with a warning telling you that you should read a book or two by Itzik Ben-Gan (b|t) or Kalen Delaney (b|t), but I would welcome that change.

Does this negate my previous advice?

If you follow my blog at all, which is suggested in my very biased opinion, you may have seen me talk about Unused and Duplicate Indexes, but I make no mention of them here. Did I forget about them?

No, I did not. This is designing every index you want to have on your table in a reusable way. If that index was not on the list then you’ll want to consider getting rid of it. It’s two ways of looking at the same thing. A complete understanding of both of these methods will help you make intelligent indexing decisions and go as far as you need to for the situation you’re in.

Talk to me

This isn’t a short or easy process, and perhaps I could have worded some of it better.  I enjoy what I do, both writing this post and playing with indexes, and having someone think enough of me to ask me questions on this makes it all the more enjoyable.

I may be rewriting parts of this post as I find ways to reword concepts better, especially as I finalize and tweak my presentation with the same name for which this post is my guide. That presentation will make its debut at SQL Saturday Cleveland on February 6, 2016.

If you feel you can help me improve, please don’t hold back.  I’d rather feel that I’m improving than falsely believe I’m infallible.

Free SQL Server Training (not a sales pitch)

My company cut the training budget this year, and it’s not that uncommon for that to happen anymore. However, I’m still getting my week of training in, and more, without paying much.

Blogs of people who talk too much – FREE

There’s so much for free online, but you’re reading my blog so I don’t have to dive too deep here. Follow a couple blogs in an RSS reader, and make sure you sign up for freebies they may offer like Paul Randal giving away training!

Personally, I try to read a majority of everything written by:

You can include me on this list if you’re feeling generous. However, they’re all smarter than me, and I’ll understand if you don’t.

Local User Groups – FREE

I know most DBAs don’t go to the local user groups. Some never go, some go when the topic is right, but you can benefit from these every time.

Even the presentations that are outside of your normal job duties are valuable. If someone’s teaching SSIS and you don’t use it, knowing what you can learn from an hour-long presentation will put you in a situation to have more intelligent conversations when these topics come up. Read over job descriptions, they expect DBAs to know everything, and here’s your chance to know a little about it.

Also, you can learn from every presentation. I don’t care if you could give that presentation, people in a field known for introverts could use practice and examples of how to present ideas to others. You’ll see good, you’ll see bad, and you’ll be better because of it. Also, one or two small details you might pick up on a topic you know extremely well will make a big difference.

Speaking of our field being known as introverts, this is your opportunity to step out of your comfort zone and present. That’s what I did. Some present because they were born to do it, and I started presenting because I was born to blend into a cube farm! I’ll be presenting at my local user group in April, have already done a SQL Saturday this year, and would like to do at least two more SQL Saturdays, which drags me (kicking and screaming) into improving my soft skills where a typical DBA, like me, is behind the curve.

Our last presentation by Bill Wolf was an excellent presentation on avoiding common coding mistakes. It wouldn’t have been too much of a stretch for me to make a similar presentation, but I still learned a couple things and met some good people. The next meeting on March 25th will be about the interdependency between SQL Server and .NET, which would be amazing for any DBA or .NET developer to go to, even if Dell wasn’t giving away free laptops as door prizes!

Finally, the presentations aren’t the real reason you’re going there, are they? You’re getting to know the other DBAs in the area, and you’re sitting there talking person-to-person to the ones that want to share everything they know with you. You get to know them, they get to know you, and when one of you wants a new job or new coworker then you have a better chance at finding a good fit that you can be comfortable with from the start.

If you’re not sure where your local user group is or if you even have one, start by looking at this link.  Most groups are affiliated with PASS, and that means they’ll be listed there.

SQL Saturday – $10 Lunch

SQL Saturday is a cross between the local user group and the PASS Summit. In fact, the last one I was at has been referred to as a mini PASS Summit because of the number of high-quality speakers there, including multiple names from the list of bloggers I mentioned above.

The cities that host these events typically do so as an annual event, and there’s probably one not too far from you each year. I live in Pittsburgh, PA, which hosts one just a couple minutes from my house. I’m also trying to make it to ones reasonably close to me such as Cleveland, OH a couple weeks ago that’s about 2 hours away, Philadelphia, PA coming up in June that’s a 5 hour drive where I have family I can stay with. I think I’ll skip Columbus, OH because it’s the week after Philly and I have three small kids. Although I won’t make it this year, I may go to Richmond, VA, which is a 6 hour drive where I also have family to stay with.

It’s been a cold winter here in Pittsburgh, and now I’m starting to think about how much I miss my relatives in Texas, California, Florida and South Carolina. It knocks this idea out of the free or low cost training blog post, but that doesn’t mean it’s a bad idea.

While my company did cut the budget, they still paid mileage and gave me Friday off as a travel day. Even without a budget, still ask to see what you can get.

Virtual PASS Chapters – FREE

Pass also does virtual online events. I know I’m risking making this sound like a PASS commercial doing the local PASS chapter, the PASS SQL Saturdays, now the Virtual PASS chapters, but it’s hard to talk about free or low cost SQL Server training without talking about these people. So, why would I avoid this sounding like a commercial for them? They deserve to be known, and with all this free junk they can’t have much of an advertising budget!

Anyways, the virtual chapters each have their own focus, and I’m a member of the PowerShell, Performance, Database Administration, and I think a couple others. They do presentations about once a month, and I try to watch them when I have time. Worst case, they usually make the recordings available later.

There are also a couple “24 hours of PASS” events throughout the year where there are 24 back-to-back sessions. The next one on the schedule is the Russian version, which I’ll pass on.

BrentOzar.com – FREE

In addition to their blogs, the Brent Ozar Unlimited team (Brent’s just a member, not their leader!!!!) does weekly webinars that are usually very useful to me. I’d say I watch 3 of them a month, missing one either due to my schedule or because it’s a rare one that I don’t feel bad about missing. They put up the recordings later, although I have to admit I usually don’t watch them if they aren’t live. I know I don’t make any sense, but you’ll get used to it.

Twitter – FREE

Seriously, if you can fit your question and #sqlhelp in 160 characters or less, you’ll have someone pointing you in the right direction within minutes. They’ll point you to the posts that you didn’t know what key words to Google to find, but keep in mind that they’re doing this for free. Always be grateful, and understand when they say that digging into the issue would mean a trip to the consultant.

The End – Sad, but still free

So, yeah, my company cut my training budget. That means that so far this year (March 3rd here) I had to stick to learning from Steve Jones, Grant Fritchey, Erin Stellato, Stacia Misner, Tom LaRock, and others in-person. Then I watched free videos by Jes Borland, Jeremiah Peschka, Brent Ozar, Kendra Little, and Doug Lane. I have a free month of the SQL Skills Pluralsight classes with a code that Paul Randal personally emailed me, with which I’m in the middle of a class by Jonathan Kehayias right now. I’m planning on taking full advantage of this trial with classes by Joe Sack, Erin Stellato, Paul Randal, Bob Beauchemin, and Kimberly Tripp, although I’ll end up running out of time and buying a subscription for $30 / month.

So, is a lack of a training budget stopping you? You can say it is, but it’s not.

I gave you a lot and I missed a lot. This is not all-inclusive, and a quick look around the internet will prove that point. There are still Free e-Books and many other resources out there. I have to start skipping stuff or this post will never make it out to you, and what good would that be?

Hey, look!!! There’s a section right below here marked “Comments” that’s perfect for your addition to the list. Maybe it can be all-inclusive!!!

Indexing Fundamentals

First, lets understand what the types of indexes are. If you take a reference book, you have the two types of indexes. The book itself is a clustered index, it contains all the data and has the key field of page number. The index in the back is your nonclustered index, it’s in order by the key field of subject, then has the implied second key field in there for page number, which nonclustered indexes in SQL Server have as well. On top of that, the table of contents at the beginning of the book is just like a b-tree built on every index that helps you get to the right page faster.

The key fields I mentioned are the fields everything is ordered by, just as you would expect a book to be ordered by page number. Nonclustered indexes also have implied key fields, which are all of the clustered index’s key fields that weren’t explicitly made key fields already, which is a great reason to keep your clustered indexes as narrow as possible in most cases, which means they have fewer columns.

If you have a phone book then you have a clustered index with the key fields last_name, first_name, address, phone_number. That means it’s ordered by each one, so any duplicate last names are still in order by first name, and so on. It’s wide for a clustered index, but doesn’t have any nonclustered indexes to worry about, so wide is fine here.

However, the important thing to notice is that if I asked you for people with the last name “Hood”, you’d find them rather quickly. If I asked you whose phone number is 412-555-6789, you’d cringe. The same thing happens in SQL Server where it will do the work you ask of it, but it won’t be cheap, quick and easy. This is why the internet has reverse phone number lookups where they have a nonclustered index on phone number which lets you find the name quickly by just knowing the number.

Indexes can also have included columns which are stored in the index, but no sorting is done on them. These fields don’t even exist in the b-tree for finding the information, only the page level where everything is stored. To simplify the process, this is where you put fields that you have in your SELECT clause but not your WHERE clause since don’t need it sorted to search quicker, you just need it to fulfill your query’s needs.

How are they used, and what do they cost?

A nonclustered index is used exactly the same way you read a reference book. You look up the key field in the back of the book and see what information is there. If you covered all the information that you needed just by looking at that index then you’re done. However, if you want to know more that wasn’t listed in the index then you take the key field of the clustered index (the page number) and do a key lookup by flipping to those pages and reading more about the subject at hand.

If you noticed, reading all that information in the back of the book seems pretty simple and efficient, especially when all of the information is there so you don’t have to touch the book itself. If you don’t have to touch the book then that’s called a covering index, which can be accomplished by adding the fields you want to retrieve to an index as either included or key columns. Depending on how many of these key lookups you have to do and how much extra information it has to get from the clustered index, it may make sense to make your index a covering index.

However, none of this is free. Every index you make has to be maintained. If you ignore filtered indexes, which I’ll get to soon, every time you insert or delete rows on a table then you have to write to each index. So if you have 6 nonclustered indexes, it doesn’t matter if you have a clustered index or a heap (merely a lack of a clustered index), you have to do one write to the table itself and one write to each nonclustered indexes, so 7 writes for that one row.

Add on top of that maintaining those indexes for fragmentation, extra storage needed on disks and backups, and more data that can possibly be held in the cache, and this stuff really starts to get expensive.

How many indexes should I have?

There’s no magic number of indexes you should have on any single table. Too few makes it hard to find the data, and too many makes a maintenance nightmare with data modifications taking longer. What indexes and how many will vary wildly, but there are methods to figure out what’s appropriate for each table. Chances are, many of your tables will benefit from both dropping indexes and creating them.

Consolidate what you have

First, look at the indexes themselves. If you have indexes that have the same first key field as the clustered index then you can consider removing those indexes with little to no detrimental effects. This is regardless as to if these indexes are being used or not.

Now, look at what indexes are never used. This can be done by looking at sys.dm_db_index_usage_stats to get you up to a month of data (assuming a monthly reboot for patching), or, my preferred method, by querying your tables you have set up that monitor this DMV. If an index is never being used it means it’s never helping you, but it is extra weight that’s adding up and slowing your database down. Just make sure you have enough data to say that the index really isn’t needed before relying on this alone.

Also, look for patterns in the indexes themselves by comparing the first couple key fields. If the couple key fields are the same, these are candidates for having the indexes combined into a single index. Depending on how quickly those key fields make the data semi-unique, you may be talking one to three key fields matching before I’m itching to reduce the number of indexes. You can do this by taking the columns that are in the second index that aren’t in the first and adding them to the first index as included columns, then drop the second index. Queries that used to hit either one of these indexes should run slower by an unnoticeable amount, yet you’re only storing data, doing updates, and maintaining a single index instead of two.

When you’re going through this process you just have to be careful that you understand the implementations of dropping a unique index. It is rarely a good idea to get rid of a unique index unless that uniqueness is enforced by other means as well.

Take this example.  I have a heap with a unique index on Col1 that gets scanned, but no seeks.  Also, there’s a NC index on Col2 that has a lot of seeks then is obviously doing a lot of lookups to get more information from the heap.  In this case I happen to know that the “unique” part of the index on Col1 is because that’s an identity column, so I’m not too concerned about maintaining the unique constraint here.  After seeing the usage stats over a month I’m able to restructure this table.  Both nonclustered indexes were dropped, a new unique clustered index was created on Col4 and Col1.  User Lookups were eliminated since it’s all just a clustered index now, we saved about 700 MB on disk, and the compressed equivalent on backups as well.

Heap_With_Unique_NC_Identity

Take another example where you can see that an index is duplicated.  No, none of these are 100% the same, but the second index is the unique (is_unique field), clustered (* for included columns, it’s just my way of doing it) has a single key column on Col1, which is the same as the first nonclustered index.  Seeing the usage like this, I’m going to drop the first index, saving 435 MB of space in my database.  This will have a negative impact in that it will move 626 index scans done throughout the month to an index that’s twice the size, and that’s a problem I’ll revisit later when I’m tuning queries that have high physical reads.

Notice you can also see two issues on this table that I’ll need to look into.  First, I have a nonclustered index that’s larger than my clustered index, something is wrong here.  Perhaps I should have put the fill factor on here to show that wasn’t the issue.  The real answer is that I’m working on a table that somehow got missed on the index maintenance, and I now know that this is possible and needs to be fixed.  Another issue is that there is a filtered index on here that has about 0.1% fewer rows than the unfiltered indexes, which means that you’re getting no benefits while alienating queries that may have used this otherwise.

Unused NC2

And one last example.  The 2nd and 3rd indexes both have a single key field on Col2 and just vary on the included columns.  This is a simple task of combining the two indexes by making one of them include Col1, Col2, Col6, Col7, Col8, and Col9, then dropping the other.  That saves us about 200 MB on disk, and avoids us having duplicated data in the cache.  Ask a SAN or backup admin and they’ll tell you the disk savings is great, but ask an unrealistically informed user and you’ll realize that cache savings is what makes a difference dropping your physical reads, upping your PLE, and making your app runs faster.

Then, there’s another simple task on this table of dropping the final index, which is nothing more than 112 MB of dead weight that was being updated an average of 30 times a day.  Not much to say here, they obviously didn’t use this as much as they thought they would.

Finally, you’re left with the info that User_Lookups is almost as high as the total number of seeks and scans on the nonclustered indexes.  This means that almost every query that hits the nonclustered indexes need to do a key lookup on the clustered index.  Knowing that, if I was going to get to that level of tuning, I would be looking into the queries that hit this table and determine if I want to add more included columns to actually cover the query or admit that I’ll never cover it efficiently and remove some included columns.  My answer here will vary partially depending on how many records are involved – key lookups are ok for 10 rows, but not so much for 1,000,000 rows.

Duplicated Index

None of these are hypothetical situations I forced AdventureWorks to do.  This is a real, production database, and I wrote two queries against it.  One to help me find these problems so I can put a change request in to fix them, and a second query that does the exact same thing while obfuscating the names to protect my employment.  So when you see the duplicated indexes or the sizes I’m working with, you can’t say that would never happen in production….it just did.

You reduced the load, now add to it

Now that you have fewer indexes in your database, it’s time to add to them. There are several ways to figure out what indexes you want to add, and I prefer to do everything manually. That means there’s no DTA (Database Tuning Advisor) here, it’s all scripts with proof that what I’m doing is making a difference.

Where are you finding the queries to tune

There are a couple approaches to finding the queries in most need of attention. Query Stats will give you the queries that take the most resources during your peak times, which will reduce the load on your server the quickest. Traces on duration will give you the queries that take the longest along with sample calls with all the parameters you need. Blocking monitors will give you the queries other queries are waiting on the most. Then, one of my favorites, job shadowing gives you what the pain points are for the users.

The first thing I look at is blocking, since that silently sits in the background making queries that could run extremely fast sit there and wait for a single slow query. If there is a lot of consistent blocking then take the head blockers and start tuning them. This won’t eliminate the fact that they’re grabbing locks, but it will cause those locks to be released quicker. Locks by themselves are not an issue and shouldn’t be eliminated, but the impact the locks have on other queries are an issue that is best to be handled this way. If you don’t have many locks that go too long, you can skip this and revisit it later. I say “too long” instead of giving you a number because it varies a lot; 5 seconds is too long for a web app while a minute isn’t much for a data warehouse.

Once that is out of the way, I’ll start diving into the most expensive queries on the server. They may not be the ones with the longest average run-time, but they add up to the most on the server. If you can cut the reads from 100,000 to 1,000 on a proc called every second then you’re doing more for the server than you would by tuning an hourly query that does 10,000,000 reads. Also, focusing on physical reads will help raise your PLE if that’s a concern on your server, which it usually is. Don’t stick on this for too long, you’re just trying to get the easy tuning out of the way and hopefully change the look of your Top-10 most expensive queries list a bit.

Now I move on to looking at the traces to see if any huge queries are running that are easy tuning opportunities. Again, skim through it, pick up the low hanging fruit and move on. Right now you’re still in the phase of reducing the load on the server, but you may not have hit what the users see yet.

Another place to look is in the cache to find out what indexes are taking up the most space there. If you know your data well enough then you should know what to expect from your cache. What you see there will probably surprise you, and you’ll probably see an Index_id of 0 or 1 (Heap or Clustered Index) sitting there taking 10 times what you would expect it to do. These are cases where there are clustered index scans taking place that you need to find. I was recently able to clear up 12 GB of cache on a server doing a full index scan on a 12 GB table where a 10 MB filtered index was able to do the job, which made PLE almost double.

So, now that you have the server running a little (or a lot) better in general, I get into the work that never ends. Get a hold of the people using the app, either by yourself or asking customer service to arrange it, and find out what they hate the most about the app’s performance. What you need here is for someone to write a list of when anything took a while, what screen they were on, and approximately how long it took. Don’t go requesting anyone to use stopwatches, you’re looking for “a couple seconds”, “about a minute”, or “I click OK then go get a cup of coffee.” Your goal here is to be able to correlate their pain points to something captured in your traces, giving you the procs they’re using along with real-life example calls to dive into.

Something you may find odd is that I’m diving into the part that affects the end users the most last, but I do it on purpose. The other pieces are things I skim through and do what I can in a week or so, but they’ll make the bigger impact on the server in general along with resolving some of the customer’s complaints in the process. In the process, you’re learning more about the server and how everything goes together so that when you start talking to your customers you can make a more immediate impact on their complaints.

After getting through list once I let my ADD take over, and man do I have ADD. The customers have an idea of what a DBA can do for them now to the point that they may even be contacting you directly when something annoys them, developers realize a SQL Server specialist can be a good person to lean on, and your server’s biggest pain point hurts a lot less. Once you get to this point, jump around on this list on your servers and take up a query or two when you feel like it.

Fixing Page Life Expectancy (PLE)

What is Page Life Expectancy (PLE), what makes it drop, and how can I manage memory better? Abusing disks slows many database servers, and it’s often something you can fix with tuning and not spending extra money on better disks or more memory. It takes a very long post to get through all of that, but if you stick with me through this then you’ll be looking at your servers from new angles with an effort that will be noticed by the Sys Admins, SAN Admin, the users, and your boss.

Before we being, there are some ground rules we need to get out of the way defining PLE and understanding there are external memory factors. There’s no shame in skipping that and going straight to the focus of this post. Personally, I find the next two sections a little dry, but I’m also my harshest critic!

Define PLE

Before we get too deep into it, lets make sure we’re on the same page on a couple things.

Page Life Expectancy is the number of seconds the average page of data has been in the buffer pool.  Keeping the data in memory gives SQL Server quicker access to it instead of making the long, slow trip to disk.  While none of the counters in SQL Server means everything on its own, this one can open your eyes and lead you towards issues that can be resolved.

Keep in mind that SQL reads the data pages from the buffer pool, always from the buffer pool. If the data you need isn’t there then SQL Server does a physical read to put it there. After that’s done it will do a logical read to use the page that’s now in memory. If you want to dive into the detail you can do it here, specifically with the reading pages and writing pages links on that page.

That physical read is going to disk, the slowest part of your server, to read the page(s) from disk, be processed by the CPU, possibly decrypting it if you’re using TDE, then placing it into memory. This makes PLE critical because, even if you ignore the extra load you just placed on the I/O system and CPU, you’re waiting for an additional, slower action to take place.

The speed of your disk does matter, but it also doesn’t really matter.  This is an EXTRA step with SQL Server doing a physical read THEN a logical read, not instead of a logical read.  Also, your persisted storage is the slowest part of your server, be it spinning disks, SSD, flash, or anything else.  The expensive stuff just isn’t AS slow.

You can find your page life expectancy in sys.dm_os_performance_counters using my post on OS Perf Counters. That post will also help you realize how much load you take off of your disks by raising PLE, then you’re really going to start to understand the burning drive picture you find there.  While it’s great that I have the code out there to get this yourself, the tracking of this counter should be coming from your monitoring software.

The problem is that many people see the 300 value for Page Life Expectancy you can still find documented is wrong, very wrong. If you have a server with 64 GB of memory with 56 GB allocated to SQL Server, that means you’re reading about 56 GB of data from disk every 300 seconds. If you look at Page 36 of Troubleshooting SQL Server – A Guide for the Accidental DBA by Jonathan Kehayias and Ted Krueger you’ll see an actual scalable version of this; PLE should be 300 for every 4 GB of RAM on your server. That means for 64 GB of memory you should be looking at closer to 4,800 as what you should view as a critical point.

The reason you see 300 written everywhere is because we were limited by 32-bit servers for so long while at the same time SQL Server documentation was really being developed well.  Those servers could only have 4 GB of memory, thus making this number make sense as well as the Buffer Pool / 4 GB * 300 formula.  I’ll go so far as to say that you should be cautious of anything that doesn’t have a formula to it because our servers keep getting faster, our databases keep getting bigger, and static values…don’t.

Jonathan Kehayias also put this in his blog post Finding What Queries in the Plan Cache Use a Specific Index, and that had a great series of comments going back and forth between him and Brent Ozar. Brent is discussing not putting so much faith in PLE and watching wait stats instead as different speed I/O systems can greatly sway the impact of low PLE or even make it so a high PLE just isn’t high enough. They’re both right, and if you can understand the intent of each one then you’ll be in a much more comfortable place when tuning your servers.

If you force me to take a side on that post, Jonathan wins out over Brent here. This is because Jonathan is talking about having a higher PLE, which in turn reduces load on the disks and typically reduces the amount of work that needs to be done to execute a query. Brent is basically implying that money may have already be thrown at the issue, which negates the current symptoms on what could be a less scalable solution. However, you can’t discount Brent’s logic here as the things that are limiting your server’s performance at this time lie in the Wait Stats, not this counter.

Quick Rant: We’re Not Alone

So, you want higher PLE to show that you’re using your resources better, and the only way PLE goes up is by leaving data in the data cache once it’s there. The problem is, there’s a lot going on that wants to work the other way. Even if you have 64 GB of memory dedicated to SQL Server to host a single 64 GB database it won’t all fit in memory, that’s because other things want to play as well.

First, your memory isn’t completely dedicated to the buffer pool. Memory pressure from the OS or VMs changing size can drop the size of the buffer pool has unless you’re using “lock pages in memory”, which is not recommended in most situations especially on VMs. If you’re not on a completely dedicated box, which means a physical box with no apps, no GUI, no RDP sessions, no SQL Server, no network card driver, etc., then this can happen. Wait a minute…you can’t uninstall SQL Server to make sure SQL Server’s buffer pool remains untouched, yet SQL Server uses memory outside of this area for any purpose that isn’t specifically assigned to the buffer pool and that memory usage can get out of control. If there’s memory pressure then some pages are forced out of data cache to drop the size of the buffer pool, leaving you open to more contention and a lower PLE.

Second, your buffer pool isn’t dedicated to your data cache as it also houses your proc cache, which are execution plans stored to be run again. These plans take a lot of CPU to create and reusing them saves you a lot of resources, so you can’t complain that you’re sharing space here. SQL Server determines the amount of the buffer pool to assign to each, and it’s possible for either side to have unnecessary data in it. I’m going to focus on the data cache here, so I’ll defer the abuse of proc cache conversation to Kimberly Tripp. There’s more to it than just her post, but this is a great place to start understanding what’s there and not in use.

Focus, Focus, Focus

Ok, so we know what PLE is and that external factors can affect it, but that’s not really why you’re here, is it? Are you here because you have your data cache and you want to manage it to keep your PLE going up…maybe a little. My guess is that you’re here because you heard that PLE is critical to your server’s performance and you want to increase performance by raising this counter the right way. You do that in two ways.

First, make sure you have as much room as possible. If your server can handle more memory it’s often cheaper to buy that much memory than it is to intervene on the database side. Tuning and purging scales well, will help a lot more than just memory management, and is the best thing you can do given the time budgeted to it, but it just isn’t as quick and easy as a memory upgrade in many cases. Even if you buy memory now and tune later, it will continue to help. 64 GB of memory (it’s 128 GB max on Standard Edition now), as Brent Ozar pointed out here, is only a couple hundred dollars here. FYI, I absolutely hate throwing money at issues, and I still listed this first.  Partially because it’s an easy fix to an often neglected issue, and partially because every fix is throwing money at the problem because your time isn’t worthless.

Second, keep as little in memory as possible to make efficient use of the space you have. Yeah, it’s that easy. Well, at least it’s that easy to say, but doing it takes a little more work. There are a list of ways you can trim down on the space you need.

  • Drop Unused Indexes
  • Merge Duplicate Indexes
  • Use Your Indexes – SARGability
  • Watch for Big Queries
  • Look in Your Proc Cache for Opportunities
  • Know What’s in Your Buffer Pool
  • Index Maintenance – Defrag
  • Index Maintenance – Statistics
  • Purge Your Data
  • Other

Through all of this, please remember that your initial thought may be to raise a counter, but is that your real goal? I’m betting you want your server to run faster, and PLE is merely one counter that helps put a value on performance. Servers with high PLE can run horrible, they just don’t need to keep running back to disk. Servers with low PLE can run great, they just need to go back to disk to get what they need.

A little bit ago I mentioned buying more memory, this will raise your PLE without making any queries really run more efficiently. Sure, they’re more likely to have their data in cache which is great, but they’re doing just as many logical reads and using just as much CPU to do their joins. The rest of this post isn’t like that, the rest is making queries more efficient in a way that just happens to make your PLE go up.

Drop Unused Indexes

This part’s tricky. SQL 2005-2008 R2 told you how much indexes were used since the last time SQL Server was started in the DMV dm_db_index_usage_stats. It’s still there in SQL 2012 and beyond, but these statistics get reset when you rebuild an index now. That means that best case scenario on a server that’s patched monthly, you have a month’s worth of data to go off of and you can’t even rely on that being the whole picture anymore. I get into more about how to keep track of this over time in my post Indexes – Unused and Duplicates.

This has a minor implication on PLE because as the pages of your unused indexes are updated SQL Server has to read part of the B+tree into memory to find the page, then it has to read the page to be updated into memory, it updates it in memory and marks it as a dirty page, then eventually writes it back to disk. The key here being that pieces are read into memory, AKA the buffer pool, the part of SQL Server you’re trying to clean up.

Merge Duplicated Indexes

This one’s not as tricky, but the performance issues are much worse. The problem is that you have two indexes, we’ll call them ix_A and ix_B, that are very similar. They have the same first two key fields, and stray a little after that. The differences mean that ix_A will be better for one query and ix_B will be better for another, and that means SQL Server will be dragging them both into cache. Sure, if you take the columns that are in ix_B that aren’t in ix_A and include them in ix_A then drop ix_B then the queries that used to use ix_B will probably need to do a little more work, but you’ll end up with a more efficient use of the cache. Your typical testing will show you the worst case scenario here, because you’ll run it each way multiple times and see what it’s like when the indexes are completely in cache. However, the real world is more friendly than that (that’s a first), and the query that used to use ix_B may actually run faster because now it’s using an index that’s more likely to be in cache, cutting down on physical reads. Going to the same blog post, Indexes – Unused and Duplicates, you can see details on how to help relieve this problem.

There’s a bit of an art to this at first, but you’ll get it down to a science with practice.

Use Your Indexes – SARGability

SARGability (Search ARGument capable) is it’s own topic, and it deserves at least one post of its own. You need to understand how indexes work, as stated in my post Indexing Fundamentals, and make sure that your queries can take advantage of indexes. If you look in a phone book for people with the last name like ‘%ood’ it’s no good because you need the first letter of the last name to take advantage of that index. If you look up people with the first name ‘Steve’ then that’s useless because the first key field was last name. Doing functions, implicit conversions, and many other techniques can cause you to need to scan an entire index.

In terms of PLE, that means you read the entire index into memory instead of just the pieces you need. That can add up to a lot of data being kicked out of cache to make room for a lot of physical reads from disk.

This isn’t an easy topic, and I’m not sure of any resources that say everything that needs to be said on the topic. While I have plans to write a post on it and will update this one when I do, it’s still an unwritten post. Until that time comes, Google “SQL Server SARGable” and see what pops up.

If you know how to read execution plans, there are two sections on a seek, scan, or lookup that you need to know here. Predicate and Seek Predicate. Predicate is what it had to scan for, while a seek predicate is what it could find efficiently using the fact that the key fields are sorted.

Watch for Big Queries

It doesn’t matter if you use Extended Events or Profiler Traces, you need to know when large queries are being run on your servers. If you want to watch this at a statement level then you’ll want to make the jump to Extended Events, but I’ve never seen it hurt to watch SQL Batch Completed and RPC Completed filtered only by duration over 10 seconds.

The point of this when it comes to PLE is that the queries that aren’t SARGable or don’t have an index to take advantage of will be doing a lot of scans. The bigger the scan the longer the query will take to run, and if you care about PLE then you’ll know what queries those are. Sure, some are legitimately doing enough work where they’ll always take that long to run, but most of the queries that take a while can be tuned. Be it changes to the query itself (should be your first instinct) or indexing changes, there’s often a way to have a query require less data in cache.

This is an extremely useful technique to use in correlation with watching for drops in PLE. If it drops then check to see what was running at the time. This shows you what queries were running, which would be most of your issues, but it could be outside of the query itself and actually be something like a large auto update statistics task being kicked off that you won’t see here.

Know What’s in Your Buffer Pool

Very few people do this, yet it’s so obvious once you start. You want to raise PLE, right? PLE is a measurement of how long stuff stays in your buffer pool, and drops because something else needed to be put in there. So, what’s in there that’s taking up all your space and how’d it get there? I felt like a fool that it took me years to come up with that questions, but then I realized that most people never ask it at all.

It was actually one of my biggest tuning revelations since I read Grant Fritchey’s Execution Plans book. And you always know it’s a great revelation when you find yourself yelling at yourself for not realizing it earlier…years earlier.

So, how do you do this? Query your cache to find the indexes that take up the most space there, pick one that is an excessive amount of space or doesn’t look like it belongs there, query the proc cache to find out where that index is used, and tune that query either through code changes or indexing changes. The scripts to do this and a more detailed description of the process can be found in my post Cleaning up the Buffer Pool to Increase PLE, and a newer post of mine, Query the Buffer Pool, has an even better script to see what’s in cache across all databases on the instance at once in a more efficient query.

Note that if your PLE is low then what’s in your buffer pool will be changing quickly.  That does NOT mean the scripts in the posts I just mentioned are useless, it means you’ll come up with new opportunities each time you run them.

Look In Your Proc Cache for Opportunities

Lets start by saying this isn’t perfect. These numbers get reset throughout the day and some queries never make it in here at all. That’s saying this isn’t an all-inclusive, one-stop shop. It is in no way saying that you can’t make amazing improvements on the queries you find here.

There are two way you can use the cache. Most people know that you can get your most expensive queries such as on the MSDN sys.dm_exec_query_stats page, and you can use your imagination to sort this by any of the counters available here. That will give you an idea of what needs tuning, if you needed somewhere to start. I love taking this a step further and monitoring that exact information so I have a historical view instead of just what happens to be in cache now. You can read all about that in my post Query Stats.

The other way is parsing through the XML in the plans to find specific items. I heard that grunt when I mentioned XML, and you’re not alone. Not many of us are good at XML; Jonathan Kehayias is an extreme exception here. He gives you his queries so you can run with it, do what he does, and alter it to go even further. This is exactly what I did in my post Cleaning up the Buffer Pool to Increase PLE that I just mentioned in the last section.

Index Maintenance – Defrag

Many people only thing about defragmenting their indexes to help get contiguous reads on their disk, but that’s only half the story. In fact, the better you do everything in this post the less you’ll have to worry about your disks.

However, you also have to keep in mind how full each page of your indexes is. If you have an index page that’s 100% full then you have a full 8kb of data there. Add another row and you now have 8.1kb of data that’s split between two 8kb pages, so you’re wasting almost 50%. 50% isn’t even your worst case scenario because pages are never automatically removed or merged if records are deleted unless you delete every row stored in that page. Paul Randal’s post Performance Issues From Wasted Buffer Pool Memory takes a deeper look at this problem, and Ola Hallengren’s scripts can help you with a solution with trusted scripts to clean up fragmentation.

Index Maintenance – Statistics

It may seem obvious to do your index defrag job off hours. While I’ve seen servers without the job, I’ve never seen one scheduled for mid-day. That’s a good thing, because a lot of data needs to be pulled into cache to play around with indexes like that.

However, do you have auto update statistics turned on? Do NOT turn it off because of this, but understand that it updates the statistics by reading about 1% of the index into cache. If that’s a 100 GB index then you’re reading 1 GB of data which is a descent portion of your cache. To help avoid this, you should be updating your stats off hours, and Ola Hallengren’s scripts can help with this, too.

You need to update your stats because as more and more updates are made to a table the less accurate your stats are, making your execution plans less accurate. Once you update about 20% of the rows of a table since the last time your statistics were updated then they’ll automatically be updated again if you leave the default option turned on. Again, don’t go turning it off because of this post alone. Instead, schedule your stats to be updated off-hours and only the tables that have 20% of their rows updated throughout the day will get their stats updated automatically, and that will stop most auto-updates on the larger tables that would cause issues.

Purge Your Data

We talked a lot about what to do with the data you have, but do you need all of your data? Step away from SQL Server once and go talk to an accountant. Ask them where they have paper work from last month and they’ll point to a filing cabinet. Now, knowing they’re legally obligated to keep a lot of stuff for 7 years, ask them where a 5-year-old document is. It’s not right in front of them, but they know where to look for it. It’s not taking up valuable space in their office (read: not on the main production server), often not in the same building they’re in (read: not online), and it will take a bit to get to it. They no longer have an active business use for it, and they can get to it in the rare case that it’s needed.

Ask them where something is that’s 8 years old, and they’ll find a friendly way to tell you that they bought an OCD pyromaniac a pair of scissors and a book of matches. We’re more eloquent with deleting things in batches, but their way sounds like a lot more fun.

If they kept everything in one room would you call that room a cluttered, unmanageable mess? If they kept historical records forever, would you think they were being obsessive and wasteful with purchasing storage?

So, can you pull up a detailed sales report for February 29th, 2000 from Prod for me?

Shrink TempDB

This one’s an eye opener, and one that I believe is a bug enough where I opened a connect item on the issue.  The larger your TempDB data files the more space you’ll find in your buffer pool used by unallocated pages in TempDB.

I go into details in my post TempDB Memory Leak?, but here are the basics.  TempDB can use memory up to the size of the data files, not just the size of the used space in the data files.  The only workaround I know of right now is to make the size of the data files smaller.

I’m not talking about making TempDB tiny and letting it grow, there’s no excuse for that.  Look at your monitoring software to see how large TempDB has to be to avoid hitting autogrowth outside of accidental issues, then size TempDB about 20% larger than that.  I specifically avoid the best practice of presizing TempDB to fill a dedicated drive for this reason alone.

Other…Am I Giving Up?

There are too many things to list in one blog post. You could write a book on this subject. I listed what I feel will help you the most, but I also wanted to take time to let you know that my list isn’t some magical, all-inclusive, everything you can do to make your servers run better. It’s a start. Hopefully a good one you found to be productive, but a mere start no matter how you look at it.

Speaking of “you could write a book on the subject”, well, they did. A lot of them did. Some even made them free PDFs. Not bootleg bit-torrent copies, but actual real, legal, free PDFs made possible by Red Gate in their book selection on SQL Server Central

Some books you have to buy, if that’s your thing:

Also, I’m not the only person blogging about this topic:

Results

If you did all of this and there’s something running on your servers to trend performance, you’ll notice:

  • Page Life Expectancy: Raised significantly, but you saw that coming
  • Page Reads/sec: (Physical Reads) Dropped because we’re not cycling data in and out of cache as much
  • % Processor Time: Dropped due to lower I/O, more efficient queries, less pressure on procedure cache, etc.
  • Critical query execution time: Typically less due to less chance of waiting for physical reads and lower CPU stress.
  • PageIOLatch wait types: Dropped due to fewer physical reads leading to fewer waits on physical reads.
  • DBA Pay Rate: If this isn’t on the list, try using a chart of the above counters

Keep Reading

The largest part of memory management in SQL Server is indexing.  This can be changing the indexes themselves or how your queries interact with them.  Doing either requires a great understanding of what indexes are and how they work.  I’ve written several posts on the topic and recently added them to my Indexing page to help you browse them easier.

SQL Server Running Slow

A user calls to say the app or server is slow today.  Here’s a quick summary of how I get started.  It depends on monitoring where you can compare baselines to recent activity, and I include links to all of that code I use.

The initial question is too vague.  You can’t blame this individual user because this is how 95% of issues make it to you.   The initial follow-up will be the following questions I ask the user along with the corresponding question your boss is asking you:

  • Can anyone else duplicate your issue?
    • Is this the user’s machine or a server issue?  Not a 100% conclusive answer, but it’s a start.
  • Is everything running slow or just one process or screen?
    • Is the whole server running slow or just one stored procedure?
  • Were any changes made to the system recently?  Depending on the user, this may be a question for someone else.
    • Did we cause it by making a change or did this just come up?

However, this is enough for me to do a very quick overall health check while I wait for those answers.  It seems like a lot at first glance, but it’s practically automated with the scripts I have on hand.  This is only opening 5 pre-written scripts, running them, and peeking at the results.

Quick Health:

  • Look at historical wait stats for the last two hours and the same time window both 1 and 7 days ago.  Documented in the Wait Stats post.
  • Look at memory and CPU contention for the same times as wait stats.  You’re specifically looking for high CPU usage, PLE drops, and changes to target memory. Documented in the OS Performance Counters post.
  • Look for blocking in the last hour, even though it would typically show up in wait stats.  You can compare this to previous days if needed, Documented in the Blocking post
  • Look at the large queries that completed in the last two hours along with the queries currently running to see if anything pops out.  This is just a quick glance looking for glaring issues, and you can look for details after getting answers to the follow-up questions.  Unfortunately, this information isn’t as cut-and-dry as the rest, and it’s going to rely on you being a little familiar with what normal is on this server.  Completed queries are documented in the Tracing Creation and Reading posts, and currently running queries are documented in the Running Processes post.
  • Look at the queries recompiled recently to see if a new execution plan is taking excessive resources.  Documented in the Recently Recompiled Resource Hogs post.

So, where do you go from here?  Yeah, you get the typical “it depends” answer from me.  The information you just looked at will normally comes up with something that doesn’t look right and needs to be looked into further.  However, there are so many directions this can go from here that a single document can’t cover it.  The purpose of this post is simply to answer the question “Where do I start looking with the most common question a user throws at me”, nothing more.  Also, I wanted a way to tie all of the random posts together so you could see them working together as the full solution.

Running Processes

Sp_who2 is not used by this DBA.  That info is amazingly good to have, but I want more.  Don’t get me wrong, sp_who2 is kinda hidden in this one as it’s really just a proc that does little more than “SELECT * FROM SysProcesses“, and I use SysProcesses where I couldn’t find the same info in the DMVs that were supposed to replace it.  Don’t go thinking that SysProcesses is just a depreciated dinosaur, even MVPs still use it: http://sqlblog.com/blogs/kalen_delaney/archive/2008/06/29/why-i-still-need-sysprocesses.aspx

Threads and RunningThreads are new for me here.  It’s good to see when a query is split into 100 threads and only 1 is doing anything.  It’s also easier on the eyes to see the number 100 instead of seeing 100 separate rows for one SPID.

The percent complete and estimated completion time on this script are typically blank, with the major exceptions being backups and restores where these columns are the best thing ever! 

BlockedBy is something overlooked by many people.  If your query seems to be running forever when it normally runs quick, there’s about a 75% chance you’re just waiting for another process to complete. It’s also why you’ll see the notorious (nolock) in almost everything I write for myself, but that’s another conversation.  The HeadBlocker attempts find the start of a blocking chain, although I can say that I know my logic here isn’t bulletproof.

My wait types used to be perfect where if there wait time was 0 then it was lowercase, otherwise it was all caps.  It seems I broke it slightly, but it’s almost perfect.  My OCD will eventually get the best of me here, but it has to contend with time with the rest of my life.

The rest is pretty self-explanitory with the exception of query plans.  They are the last line of the SELECT and FROM clauses that are commented out.  It’s somewhat rare to need the plan, and it’s not too uncommon to have this script run noticably slower when looking for plans.  Because of that, I uncomment that column and the the cross apply that makes it possible only when needed.

SELECT SPID = er.session_id
    , ot.Threads
    , RunningThreads = coalesce(rsp.RunningThreads,0)
    , Pct_Comp = er.percent_complete
    , Est_Comp_Time = CASE er.estimated_completion_time WHEN 0 THEN NULL ELSE dateadd(ms, er.estimated_completion_time, getdate()) END 
    , er.status
    , er.command
    , database_name = sd.name
    , BlockedBy = wt.blocking_session_id
    , HeadBlocker = coalesce(hb5.session_id, hb4.session_id, hb3.session_id, hb2.session_id, hb1.session_id)
    , wait_type = coalesce(CASE er.wait_type WHEN 'CXPACKET' THEN 'CXPACKET - ' + sp.lastwaittype1 ELSE sp.lastwaittype1 END, lower(er.last_wait_type)) --Lowercase denotes it's not currently waiting, also noted by a wait time of 0.
    , Wait_Time_Sec = Cast(er.wait_time/1000.0 as DEC(20,3))
    , er.wait_resource
    , Duration_Sec = Cast(DATEDIFF(s, er.start_time, GETDATE()) as DEC(20,0))
    , CPU_Sec = Cast(er.cpu_time/1000.0 as DEC(20,3))
    , Reads_K = Cast(er.reads/1000.0 as DEC(20,3))
    , Writes_K = Cast(er.writes/1000.0 as DEC(20,3))
    , [Statement] = SUBSTRING (st.text, er.statement_start_offset/2,
        abs(CASE WHEN er.statement_end_offset = -1
               THEN LEN(CONVERT(NVARCHAR(MAX), st.text)) * 2 
            ELSE er.statement_end_offset END - er.statement_start_offset)/2)
    , st.text as Query
    , es.login_time
    , es.host_name
    , program_name = CASE LEFT(es.program_name, 29)
                    WHEN 'SQLAgent - TSQL JobStep (Job '
                        THEN 'SQLAgent Job: ' + (SELECT name FROM msdb..sysjobs sj WHERE substring(es.program_name,32,32)=(substring(sys.fn_varbintohexstr(sj.job_id),3,100))) + ' - ' + SUBSTRING(es.program_name, 67, len(es.program_name)-67)
                    ELSE es.program_name
                    END  
    , es.client_interface_name
    , es.login_name
    , es.status
    , es.total_scheduled_time
    , es.total_elapsed_time
    , er.start_time
    , es.last_request_start_time
    , es.last_request_end_time
    , er.database_id  
    --, qp.query_plan 
FROM sys.dm_exec_requests er
    INNER JOIN sys.dm_exec_Sessions es on er.session_id=es.session_id
    LEFT JOIN sys.databases sd on er.database_id=sd.database_id
    INNER JOIN (SELECT session_id, count(1) Threads FROM sys.dm_os_tasks GROUP BY session_id) ot on er.session_id=ot.session_id
    LEFT JOIN (SELECT spid, LastWaitType1 = MIN(lastwaittype), LastWaitType2 = MAX(lastwaittype) FROM sysprocesses sp WHERE waittime > 0 AND lastwaittype <> 'cxpacket' GROUP BY spid) sp ON er.session_id = sp.spid
    LEFT JOIN (SELECT spid, RunningThreads = COUNT(1) FROM sysprocesses sp WHERE waittime = 0 GROUP BY spid) rsp ON er.session_id = rsp.spid
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt WHERE wt.blocking_session_id <> wt.session_id GROUP BY session_id) wt ON er.session_id = wt.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb1 ON wt.blocking_session_id = hb1.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb2 ON hb1.blocking_session_id = hb2.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb3 ON hb2.blocking_session_id = hb3.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb4 ON hb3.blocking_session_id = hb4.session_id 
    LEFT JOIN (SELECT session_id, max(blocking_session_id) blocking_session_id FROM sys.dm_os_waiting_tasks wt GROUP BY session_id) hb5 ON hb4.blocking_session_id = hb5.session_id 
    CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS st  
    --CROSS APPLY sys.dm_exec_query_plan(er.plan_handle) qp
WHERE er.session_id <> @@SPID
    --AND es.host_name like '%%'
    --AND er.session_id = 2702
ORDER BY er.percent_complete DESC, er.cpu_time DESC, er.session_id

--Use the below command to get the last input of an open session id
--dbcc inputbuffer(61)