Disaster Recovery and Business Continuity


Fires, SAN failures, tornados, and car accidents all came up today, yet it was a very good day at work.  I just started my job two weeks ago and made a list of questions to go through with them. After asking the one about when they last pulled an offsite backup and restored each Tier-1 database, I ended up in the middle of a project to restructure the disaster recovery and business continuity plan.

As a DBA I thought my job was to ensure that a database was recoverable from any disaster and that if you focused completely on the worst disasters possible then you would be safe from anything.  In the short-sighted version of things, perhaps I’m still right.  However, in the realistic version I had much to learn, and still do.

What is my job?

First, my job isn’t to make sure the databases are recoverable, it’s to make sure that database recoverability is part of a full business continuity plan.  That goes well beyond restoring a simple database to some server that I have no idea where it’s at or how it got there connecting to an unknown network without an app server.  There is so much more to a DBA’s responsibility in this aspect than I once believed.

If the app isn’t restored with the database then there’s no purpose for the database.  If there’s nowhere for the employees to sit and use the app then there’s no purpose for the app.  Finally, if the you restore too far from your current location then no one except the most senior and dedicated employees will travel to work.  All of this has to fit together for the mere responsibilities of a DBA to matter at all.

Second, the worst of disasters isn’t the most common, and the multiple day turnaround time is unreasonable for something as simple as a SAN failing or a building burning down.  Save the week long turnaround times for the disasters that remove your ability to work in the city you’re in.  Small disasters shouldn’t cost more than a couple days of data and productivity loss.

You need to look at disasters in different levels, and you’ll realize that as you go up the levels that they are increasingly less common.  Therefore you can’t have a simple business continuity plan and expect it to cover it all.  Also, if you don’t have anything then you can’t wait until you can figure it all out before you implement the first levels.

Missing Data

At the most common, you have a processing mistake.  The server and infrastructure are perfectly fine, someone just did something they shouldn’t have done.  Forget about the details, it doesn’t matter to you at this point if it was a developer mixing up < and > (haha, brings back memories of lost weekends), a user deleting the wrong records, or a DBA making a simple mistake.  No matter what, your challenge at this point is the same, you’re missing data and the business wants life back to normal as quick as possible.  For this the business needs to define what level of recoverability you need to have and help ensure you have the budget to do it.  If they never want to lose more than 2 minutes of data, you need log backups.  If an hour is ok then you can get by with Volume Shadow Snapshots (VSS) on VMs or possibly differentials on physical boxes.  If a whole day of data loss is acceptable then you’re home free with daily fulls or differentials.  No matter what you’re talking about here, this should be easily accessible by the DBA without outside help anytime day or night.

Small Hardware Failures

Now lets move on to local hardware failures.  A single drive shouldn’t be a concern because there’s no excuse not to have a minimum of RAID-5 on absolutely every file on a server.  Not saying you don’t have to double-check that the server “Prod-012″ isn’t a laptop sitting under a stack of papers on an engineer’s desk, I’m just saying there’s no excuse.  Anyways, motherboards don’t last forever, RAM fails, and if your RAID-5 array is too large then it’s no safer than a stand-alone drive.  If you lose something, how long will it take to have something else take over?  If you lose a couple drives do you have enough spare space to get you by until you can replace them?  As for servers, Active/Passive clusters or VMs are great, one loss and you’re up and running in 5 minutes without too many complaints.  Active/Active clusters are also great if they’re set up right, as long as you’re not double-dipping on the RAM with two instances each set up to use 50 GB of RAM on a box with 64 GB of RAM where the OS will lose the fight quickly.  Standalones, however, are out of luck and need to have a plan in place.  How long does is the business able to deal without that server?  Did the business give you that answer in writing, and is the person in the business responsible for that server currently aware of that potential downtime as of your last annual review of disaster recovery and business continuity?  Wait, do you do an annual review with the business of your disaster recovery and business continuity???  Many companies don’t, and it’s your responsibility to start that conversation.

Shared Hardware Failures

Shared hardware and services also fail, no matter how fault-tolerant you think it may be. Power goes out, do you have a generator?  Networks go down, do you have two connections that can handle the traffic at an acceptable speed coming in through separate hardware?  SANs spontaneously combust, were your data files and backup files on that same rack?  Walk through your server room one day with a system admin and point to each device, discuss what happens when it stops working.  Telephone polls and transformers take time to replace, were your fault-tolerant connections coming in on the same pole?  Now we’re at the point that I’m not expecting you to have a spare on hand, but you still need a plan.  SANs are expensive, so you at least need to know where the money is going to come from if you need a new one overnight.  Again, know your risks, know your recovery time, and make sure the business has agreed to that in writing.

Building Loss

Now for what people tend to have in mind, fires, tornados, floods, blah, blah, blah, they all happen.  I’m in Pittsburgh, so hurricanes are covered in floods and there is no specific earthquake plan.  Is your server room in an area that would get flooded more than once a century?  If so, can it be moved?  If the building your server room in is no longer there, what do you have to recover with?  Offsite backups of databases aren’t enough.  You aren’t restoring databases at this point.  If that’s what you were thinking, you’re in a heap of trouble here.

First, you’re panicked; everyone’s panicked. Where is your plan, it wasn’t in that building, was it?  Who knows where to find the plan since you’re not in a disaster where you can count on everyone being available anymore.  Can everything be restored to a functioning state with acceptable data loss from data stored offsite, and was it tested to prove that it could be?  Remember, if you’ve only restored the database then you’ve restored a lot of valuable, yet useless, information.  Are the app servers in that same plan and up-to-date so the users can use the data?  Where are those users going to sit?  I know, this is on a DBA’s blog, so mostly DBA’s are reading this, we don’t tend to think about a user’s cube, or if their temporary cube is further than an hourly employee would even consider traveling to work.  However, right now our job is business continuity, not strictly databases.

So, say you do have an offsite, easily accessible plan that is written in a way that panicked and stressed employees mixed in with temps to make up for the employees that had to tend to their families instead of showing up for work could understand and implement flawlessly, but what does the business expect?  I keep coming back to this because the answer is usually that they think we have a million-dollar plan while giving us a small budget.  Realistically it may take us a week to have it all together, and we’d never know the full picture without testing it.  However, the business may be there saying that if it’s down for more than four days then we’re out of business, which would really suck after putting all that work in to this awesome, yet lower budget plan.  Make sure everyone knows the expectation and have that offsite plan not only signed off by the business, but the proof that they signed off on it stored offsite as well.  The IT department isn’t the only department stressed out when the data center burns to the ground, and stressed out people like to blame others for everything no matter who is at fault.

Oh no, I never thought that would happen

Finally, the largest of disasters.  What if the entire city where your data center in is inaccessible.  I know, how often does that really happen?  Hurricanes took out New Orleans for months and parts of New York City for weeks, earthquakes threaten to do the same to parts of the country, terrorist attacks are possible, plagues can happen (although I don’t foresee the zombie apocalypse), and I now know where the nearest nuclear power plant is.  Do you have recent backups stored outside of the region that can recover from these types of disasters, and has the business signed off on this?  For some companies it is acceptable for the answer to be that they don’t exist and it’s acceptable to go out of business for these disasters, and that’s ok.  Again, get it in writing, and store that piece of information in your personal offsite storage (yay, Google drive).  If this is the case, you’re protecting your rehirability when a company goes out of business saying their IT department failed them in a disaster.

I provide advice, not business continuity plans

Now it’s time for my disclaimer.  This is by no means all inclusive, and nothing you get from ANYONE’s blog should be taken as such.  Some companies have someone who has the job title VP of Business Continuity, and some don’t.  No matter what the case, it’s your responsibility as a DBA to make sure the DBA team has a part in this process and that the process is implementable for all levels of disaster with restorability that the business is aware of.  Unless you’re the only person in the IT department you shouldn’t have to do this alone, but it is your responsibility to start it if no one else has taken that responsibility.

Monitoring and Baselines Presentation


Here’s my full presentation for SQL Saturday #250 in Pittsburgh this past Saturday along with some notes on what I got out of it.

My Lessons

I learned a lot doing this, and I hope people learned a lot from it. For the people on the other side of the podium from me, the main lessons were in what I put in the abstract; they learned mostly about SQL Server. On my side of the podium though, the lessons were far from the same. Public speaking, creating presentation, prepping for unknown audiences, and seeing the gratitude of the SQL Server community in person. Anything that was on the abstract I learned slightly better, but that didn’t compare to what wasn’t on it.

Public speaking was always a fear of mine, and I tend to be someone who avoids my fears. However, throwing myself out there at the public as a whole was never my thing either, and I’ve been doing half decent, at least in my own mind. That being said, I decided to go all out and push myself a little further. After all, what are the chances of them picking me to present with all those professional teachers, consultants, and MVPs out there throwing up their abstracts. Best case I could say I tried, worst case I was going to throw myself out of my comfort zone and hope for the best.

They picked me, which I didn’t let myself expect. Everyone that knew I put in my abstract was also told by me that I only had a 50/50 chance at best, which was more optimistic than I really was, but I don’t like people knowing when I’m being dismissive of myself. It turns I was wrong a lot, and in every case I was wrong I was glad I was wrong.

Being Forced to Get Better

I’m not one to learn things for the sake of learning them, which is why I sucked in school. To really learn things I need a real-life use case, somewhere I’m going to apply it. I didn’t practice speaking ever before because I didn’t speak in front of large groups. I didn’t know how to put together a presentation properly because that’s not my thing…well, wasn’t is probably a better word now. Just like a couple months ago I wasn’t quite sure how to blog, but I’ve made it past that point.

Like everything in my life, blogging went from unknown to addiction quick. I’m not sure that I can say the same thing about speaking, but I can see the possibility. SQL Saturday only comes to Pittsburgh once a year and the local user group, which I’ll be joining soon, only meets once a month and has a single speaker. However, I can’t say that I didn’t look up when the SQL Saturdays in Cleveland and Washington DC were. Also, I took my own thoughts on my presentation and the feedback I received in and immediately starting thinking about how I could make that presentation better.

What’s below is the original, not touched up at all in the last 4 days. In part to show where I went wrong and how I’m going to fix it, and in part because Monday was my first day at a new job. By the way, starting a new job when between jobs is the best and worst thing you could ever do to yourself. You don’t have time to be too nervous about anything because you’re too overworked, but you’re also too overworked. Eh, you can’t win them all.

Well, on to the presentation. What you have here is my script that I talked through and taught myself before the presentation. However, if you were there, you’ll notice this isn’t exactly what I said. I didn’t read it, I presented it. There were no demos (the biggest complaint from the crowd) to avoid me from stumping myself, and the PowerPoint was just a whimsical picture for the start of each paragraph to keep me on track because I knew I’d be nervous and lose my way without a paragraph-by-paragraph guide helping me along.

I must have done at least descent because my reviews came back with two 3′s and the rest 4′s and 5′s out of 5. You’re just asking me to get off topic and start talking about why I hate ratings that are odd numbers, but no, I’m not getting off topic, I’m getting on to my presentation!!!

The Presentation

Download slide deck here.

I’ve been working on databases for over a decade, and most of that without having good standards or monitoring in place. My job was rough and I was much less effective. Now I know what I’m doing and I want to share that knowledge. In this presentation I am going to tell you what I watch and why I watch it. A presentation is a poor format to go through how to do this; having it in writing is much more beneficial to you. That is why I’m pretty much skipping that portion here and diverting you to my blog at SimpleSQLServer.com. Not so much to promote a blog that I lose money on, but instead to give you the resources in the best format.

As a DBA you will hit problems with the performance of your databases. It doesn’t matter if it’s one process or across the board, or if it’s just today or it has always been that way. No matter what it is, the more you know about your servers the easier it’s going to be to fix them.

It’s not easy, there’s no single spot to watch, no single solution to all of your problems, and there’s no “normal” values for these counters you can apply across every server you manage (Note: Thank you, Mike John, you stressed this point a lot). I watch several things on all the important servers, and most of these on every server. The important part is that you watch them continuously, even when you aren’t expecting to use the data. Some parts are cumulative and you can’t tell what was during an incident or what was from overnight maintenance. Other parts are snapshots and there’s no looking back.

(Note: Now I’m on to the stuff Brian Castle taught me. He’s the best you could hope for in a manager, and last I checked he was still hiring at EDMC in Robinson Township near Pittsburgh, PA)

To me, monitoring and baselining is the same thing. I know others will do a specific baselining process on a specific day and keep that. I feel you lose too much doing that and I watch enough to say that I have a continuous baseline going back for 13 months in most instances. I’m not all that worked up about lucky numbers or anything, it’s just nice to be able to say “that annual process we ran last week ran like this last year”. If you’re superstitious or have OCD, 400 days makes me just as happy.

Traces – Snapshot
Wait Stats – Cumulative – Resets on restart
Blocking – Snapshot
Query Stats – Cumulative – Resets on recompile
Index Stats – Cumulative – Resets on restart
OS Perf Counters – Varies
Database Sizes – Snapshot
Table Sizes – Snapshot

Traces – What long running processes have completed on the server?

You have to be careful, this one can kill your server. I have, on more than one occasion (sorry, Brian), caused a large server to reboot in the middle of the day because I filled the drives writing 40 GB of traces in 5 minutes. That being said, there is a safer way to trace, just don’t do an unfiltered trace catching everything.

I personally feel that absolutely every server should have a trace capturing RPC:Completed and SQL Batch:Completed that ran over X seconds duration. What is X? Well, that depends on the server. I’ve seen the best number for X be as low as 100 ms and as high as 10 seconds. Start high, and work your way down. You can add or do a separate trace for the Statement Completed for each of these if you want more detail.

You can get all of this except for SQL Batch:Completed with the text data from Extended Events, and that is a very good alternative. I haven’t made that jump for several reasons, although I would recommend it. First, this level of tracing hasn’t hurt me. Second, I still support several important servers that are still on SQL 2005 and I want to be as consistent as possible to provide as consistent as possible support.

What you’re trying to solve here is answering what ran long, have some hints as to why it ran long, and be able to say how often it has been running that way in recent history. A long duration doing very little work was waiting on something else, and other monitoring will help solve that problem. A lot of CPU, reads, or writes shows that you may need to look into tuning or statistics. Just keep in mind that reads can be reads of work tables, and writes can be writes to tempdb.

If you do it right you should have 4 or more days worth of history, and by doing it right I mean having 5 to 10 rollover files sized a reasonable size you can send off if needed (100 MB at most, they zip well) and capturing over the best duration threshold for your environment.

Wait Stats – What’s slowing you down?

The traces tell you what work was done, and the waist statistics tell you what went on when the query couldn’t actively do its work. This is cumulative over the server, so you can’t get details down to an individual process, however, you can see how much time is wasted and where it’s wasted at.

There are almost 500 distinct wait types in 2008 R2, but you don’t need to worry about that. The important thing is to know what are your biggest waits are, and you can look them up if you don’t recognize them. In the last slide I have links to a free eBook written by Jonathan Kehayias and Tom Kruger for the Accidental DBA that does an amazing job documenting what you can ignore and the meaning of the big ones you’re most likely to see.

If you’re looking for overall server health then you would want to look at waits over a long period of time, specifically the times of day you’d like to see better performance. For incidents, you want to look at what your wait stats are now and compare them to your running baseline. For me, the typical baseline I use is the same timeframe yesterday and 7 days ago. These are actual relevant days that have all but the changes you put in over the last couple of days, and it’s a true baseline for this specific server. If this server never waits on PageIOLatch, but it’s in your top 3 today then you instantly have a direction your heading in troubleshooting.

Blocking – Who’s in your way?

Locking occurs when one query is using data, and blocking occurs when another query needs to do something that is incompatible with that lock. If you don’t keep an eye on it then you’re looking back at a trace and seeing that a query took 1 second of CPU, did 1,000 reads, and no writes, yet it figured out a way to take 5 minutes in duration. The users complain the app is slow or unusable, and you’re giving them the answer that you’ll try to figure it out next time it happens and THEN try to find out the root cause.

That’s a bad idea. There are three types of users – those who don’t know you because things are running smooth, those that like you because you can say “I see exactly what happened and I can start working on avoiding reoccurrences”, and those users that you keep telling that you’re not sure what happened and you’ll try to get a better idea the next time it happens. I wish I could tell you that I could have more users not know your name, but watching blocking is a very easy way to get the users that like you to outnumber the users that hate you.

So, what can you do to tell them you know what just happened? My answer is to capture everything that’s blocking on the server every minute. Sure, there’s going to be a lot you capture that is actually only blocking for 10ms, and there are going to be things that were blocking for 59 seconds that were timed perfectly so you never saw it happen. The trick is to know that there is no perfect solution, and capture what you can. The DMVs are there to provide all the information, and being able to capture that on the fly along with in a proactive monitoring setup will make you look amazing to the users. If something blocked for 5 minutes, you have no excuse not to say “This query blocked this query, and this was the head blocker”. Sure you may have an excuse to say “It’s a vendor database and I passed it on to their support”, but at least you can tell the users something, which is always better than nothing.

Query Stats

The DMV dm_exec_query_stats is used by many DBAs, even if all we realized was that we were running one of the “Top 10 Most Expensive Query” scripts off the internet. The problem is that this is using it wrong. This DMV holds everything from when the query first went into cache and loses that information when the query goes out of the cache, even just for being recompiled. So the big process you ran overnight could still be in cache, leading you down a path that’s actually low priority for you. However, that relatively large process that runs every 5 minutes that just recompiled a couple minutes ago due to stats auto updating, it’s not even in there. This is too volatile to say that you’re getting good numbers out of it.

The answer isn’t to find somewhere else to get this information, it’s to capture this information more often and more intelligently. Capture it once and you have a snapshot of what it looked like, capture it again an hour later and you know what work has been done in that hour. Repeat the process over and over again, constantly adding another hour’s worth of data and saving off the latest snapshot in the process and you have some real information to go off of. This isn’t so much of what happened to be in cache at the time, it’s a pretty good summary of what ran and when it ran. Now you can query in detail, specifically saying “I want to know what queries are doing the most physical reads between 9:00 AM and 5:00 PM on weekdays so I can have the biggest impact when I tune this database.”

Index Stats

The DMV dm_exec_index_usage_stats isn’t too different from dm_exec_query_stats except that it’s cumulative since the time SQL services were started. Looking at a snapshot still leaves you half blind, unable to see what happened before the last time you rebooted the server and unable to tell when that index was used. Personally, seeing when it was used is less important to me than how much it was used long-term. Tracking this can do amazing things for you if you’re really fine-tuning a database.

This DMV is really under-used in my experience. As time goes on with a database the data changes and you add more indexes to make it run faster, but you don’t know if or when the index it used to use was being used by anything else so it stays there. Over time that means you have more and more unused indexes, and there’s no solid proof that they aren’t being used without you doing the work to collect that proof. This, in addition to duplicated indexes, adds to the workload for inserts and updates, adds to the workload of index maintenance, adds to the database size, and put additional strain on your cache and thus the PLE.

OS Performance Counters

This is one of the most misunderstood DMVs while also being one of the most useful. It’s giving you critical counters that give you an idea of how SQL Server is interacting with the hardware, but it’s doing it with several different types of counters that have to be measured differently. There are a couple great posts out on the internet to understanding how each one needs to be measured. As for right now, we’re focusing on why you want to watch it and what it will do for you. And wouldn’t you know that I saved the hardest one to describe for last, because each value is different and there will be controversy on which ones you should be watching and which ones are just a waste of your time monitoring.

PLE is one that this is no controversy about if you should watch it. This is the average age of the pages in cache, and a good measurement of when you’re doing too many physical reads. Peeking in on this from time to time is good, you can make sure it’s high enough. However, watching this will let you know when it’s dropping and help you dive into why it’s dropping. If you’re also running a trace you can see what did a lot of reads in that timeframe, and if you’re capturing Query Stats then you can find out which one of your large read queries is doing all the physical reads that destroy your PLE.

Server Target and Total memory are also great to watch even though they rarely change after the server is back up to speed following a restart of SQL services. The target memory is how much SQL Server would like to have, and Total memory is how much it actually does have. If target memory drops then there’s OS pressure you need to worry about. Also, as Total memory is increasing then it means the server is still filling the cache and PLE can pretty much be ignored. Of course it’s low, everything in there is new.

Page Lookups, Page Reads, Page Writes and Lazy Writes give you a better idea of how SQL Server is interacting with your cache and disks. If you want a real picture of what SQL Server is doing behind the scenes, this is much more useful than glancing at cache hit ratio.

Deadlocks and Memory Grants Pending are two things you’d like to always see at zero. That may be out of the question with deadlocks, but if you’re getting above zero on memory grants than you need to find out when that happened and everything that was running at that time.

SQL Compilations and Recompilations are hidden CPU hogs. They’re never going to be at zero because very few servers have stuff stay in cache forever, ad-hoc code runs, and several other reasons. However, when one of these counters jump up, you may have a hidden cause. For instance, if a piece of code that runs every minute or more was written in a way that it can’t be stored in cache, these numbers will be noticeably higher. When this happens, your CPU will be noticeably more stressed with no other indicators as to why, and this query won’t even show up in your Query Stats as it relies on showing the stats of everything that’s currently in cache.

Database Sizes

This seems simple, and it is. Most people know how to find the sizes of their databases, but do you track it? If not, you should. Some questions you should be able to answer readily are how fast your databases are growing, when will you run out of space, and is the steady decline in PLE justified by the growth of the database. Getting into more details, you can touch on when does the database grow such as an accounting app jumping in size every April, or answering if the data is growing steadily or exponentially. These are important things to know when it comes time to budget for new servers and disk space.

On the other side of things, you also have information to push back on application teams on how much space they’re using and if that’s really necessary. A lot of times you’ll find that they’re surprised by their growth to the point that you have to show them the numbers for them to believe it. It’s not uncommon for the response from this to be a cleanup project that helps keep the databases smaller and running faster, which is a big goal being accomplished.

Table Sizes

This is just building off of watching your database sizes in more detail. When a database starts filling up, and it’s filling up quick, it’s good to tell an application team what table is doing it. In my current environment it makes sense to watch every table that is over .5% of the database size AND over 100 MB. This means that some databases we don’t watch anything and other databases we watch about the 20 biggest tables. The biggest point is that we’re not trying to watch everything while watching anything big enough to make a difference to us.

If a database is filling a lot faster than normal then a lot of the time there’s a process that isn’t running like it should. The biggest table in the database may be a rather static value, but the third biggest table in the database wasn’t even on your radar two weeks ago. The app teams love it when you can tell them that not only is the database growing out of control, but we also see the growth in table X which has been growing at 200 MB per day starting on the first Saturday of last month. By the way, wasn’t last Saturday the day you changed a couple procs around or implemented an upgrade?

Now you just graduated from “Hey, there’s a problem” to also include “and here’s a huge lead to finding the root cause.” It’s rare to use this information, but it’s lightweight to capture for something that gives you a heavyweight appearance when you reference this knowledge to other teams.

My Critiques

I’d love to hear what you think about this presentation. Before we get to that, here are the complaints from the most harsh person in the room when I was presenting…

I didn’t do enough to say what each thing was. Although I marked the presentation as intermediate, you don’t know who’s going to be there. In this case, I knew a couple people in the crowd, and they ranged from data analyst to database manager. It should have been presented in more of “If you’re not ready for intermediate then you have everything you need, but it will take some effort to keep up”.

My slides were lacking. I stick with the idea that there should be few words on the screen, but I took it too far. I admitted above that the slides were there more to keep me on track then it was to help the audience, which I probably needed for my first public speaking venue. However, next time I’ll be less nervous and more prepared to make it geared for the audience better. Instead of random pictures, I should have more of graphs and data that I can talk through to give the audience visuals.

Nothing was given to the audience, and there should have been something. To be fair, none of the presentations that I know of handed anything to the audience. However, I set my own bar, and I feel that people would like a sheet of paper with the outline on it with links to online sources. My blog, of course, because it shows how to grab everything. Also, an outside link for each item. This gives a physical reminder to turn this knowledge into action.

There was no demo, and that was the audience’s biggest complaint. While I’m still not sure that I would do a live demo on my second go around, I’m going to hold myself on the remark above about my slides having graphs and data. That would also drag me away from the podium to talk through the slides, making it a more dynamic and interesting presentation; that would really help the audience take things in.

Your Turn

Now I need you to do two things. Put this information to work in your environment is what will help you the most. What will help me the most is you commenting on this post both before and after you implement any of this so you can help me become a better presenter.

Thank you for taking the time to read this!

Monitoring Database and Table Sizes


Trending database and table sizes helps give you an idea of what to expect, and, sometimes, points out problems and their root causes. I even go so far as to monitor the disk space, file growth rates, and have a report going out telling me that “according to your current growth trends, you’ll run out of space on this day”. That last part will have to be another post since getting disk space can be kinda tricky depending on how your disks are set up; you need to go outside of SQL Server to grab the size of mountpoints if you use them. Don’t worry about mountpoints right now, if you don’t know the term that usually means that you can swap out the word “drive” and have everything I say be accurate for your environment.

First, this is lightweight stuff, there’s really no reason to capture it more than once a day, and it will only add up to several MB for a year’s worth of data. You’re not going to kill yourself doing this, you’re not going to stare at the data every day taking up your time, but you’re going to look like a heavyweight when you put this into action and reference this data to others.

Second, I do have other similar posts doing snapshots of this same info for Table and File sizes. It’s basically the same thing just different in how you’re using it and if you’ll have a historical view of it.

Database sizes

Lets start with the databases because, well, they’re bigger. I capture it all, the database name, logical file name, file type (row, logs, etc), filegroup, allocated space, used space, max size, growth rate, if it’s percent growth, and the drive/mountpoint it’s on. Other than dividing the appropriate numbers by 1024 twice to store all my sizes in MB, everything gets stored in native format. The reason for that is you have raw data in a format easy for you to verify and easy for any outsider (consultant, new employee, etc) to understand everything you have. I may make some of you cringe because I don’t normalize this; it’s small data and not worth the effort, especially with page compression (see Brad McGehee’s post on compression) turned on.

What I like doing is having automated reports with logic behind it saying “if drive size is X, file growth is Y, then you can grow to a potential size of Z”, follow that up with “the used space in this database has grown X in the last 90 days and Y in the last 30 days, so according to those trends I’ll hit the potential size in Z days”. One of my favorite reports takes that last Z as a parameter and tells me anything that will run out of space before a specific number of days. I get that report daily for anything that will run out of space in the next two weeks and a separate instance of that report monthly telling me what will run out of space in the next 100 days. For me to run out of space without having a couple automated emails sitting in my inbox yelling at me saying “See, I told you that would happen!!!”, a database has to do something crazy and out of character. The only part of that last sentence that doesn’t really happen is the automated email yelling, those voices don’t exist outside of my head (do they?).

Even without monitoring disk space, you can predict autogrowths. This is useful for trying preventing autogrowth all together if you’re into that, but with instant file initialization this has never been a spot I’ve chosen to spend my time on. However, I do care about fragmentation, and I have a report that will tell me if we continue growing according to our 30 or 90 day trends then the data files will have to grow X times. If X is higher than 4 and filegrowth is below 2 GB chunks, it shows up on my report and I change the filegrowth to make it more reasonable. If you allow percent filegrowths in your environment (just say no), or even if one sneaks through, things get complex and you have to stroll through financial websites to find the calculations for compounding interest to figure out how much the files will grow.

Cast(CEILING(LOG((Cast((ProjectedGrowthInAMonthMB - FileGroupFreeMB) as Dec(20,4)) + Cast(FileSizeMB as Dec(20,4)))/(Cast(FileSizeMB as Dec(20,4)))) /LOG(1+(Cast(Growth as Dec(20,4))/100))) AS Int)

Something inside that database is growing, but what is it?

Now if a database is growing, the first thing someone will ask you is “why is it growing”. That’s an answer I can’t give you, but “where is it growing” I can do. This is why I’m also tracking the table sizes in a database. Be fair to yourself here and admit it up front that you don’t care if a table changes size radically if the largest size is still insignificant to that database. So make up rules on “I want to watch tables that are bigger than…” and only store the stats for those tables to keep your little collections from being the source of your size issues. For me, that means that a table has to be both over 100MB in total size including indexes and over 0.5% of the size of the database. I’ve used these sizes for a while now and have never had an issue on either side; no one needed info I didn’t have for space trending issues and the collection time and space I used never caused any complaints. This isn’t saying that someone won’t ask “did that table have 10 or 11 records this time last month”, but I’m not willing to collect enough to answer every little question.

If a database is growing at an alarming rate it’s easy to assume the largest table is causing it. That’s not always the case, and it’s not too uncommon for a database change to either create a new table that grows quick intentionally or accidentally cause an old table to stop cleaning itself up. No matter what the reason is, if you want to see how a database is growing, look at your table stats documenting the growth. Developers always listen to you better when you tell them this is what happened and here’s how I know while the words “I think” are viewed as instant discredidation.

How am I supposed to do this stuff?

You’ll need somewhere to store all the info. I like to store two weeks of info on the local server and pull everything to a central server nightly where data is held for a full 13 months. However, to keep me under control, lets just focus on getting everything to the local server in the following tables.

CREATE TABLE [Drives](
    [DateAdded] [smalldatetime] NOT NULL,
    [DriveLetter] [char](1) NOT NULL,
    --[MountPoint] [smallint] NOT NULL,
    [CapacityMB] [int] NULL,
    [FreeSpaceMB] [int] NULL,
 CONSTRAINT [DBInven_Drives_DriveLetter_MountPoint_DateAdded] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DriveLetter] ASC,
    --[MountPoint] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [Files](
    [DateAdded] [smalldatetime] NOT NULL,
    [DatabaseName] [sysname] NOT NULL,
    [FileName] [nvarchar](128) NOT NULL,
    [TypeDesc] [nvarchar](60) NULL,
    [FileGroup] [sysname] NULL,
    [SizeMB] [int] NULL,
    [UsedMB] [int] NULL,
    [FreeMB] [int] NULL,
    [MaxSizeMB] [int] NULL,
    [Growth] [int] NULL,
    [IsPercentGrowth] [bit] NOT NULL,
    [DriveLetter] [char](1) NOT NULL,
    --[MountPoint] [smallint] NOT NULL,
 CONSTRAINT [DBInven_Files_DatabaseID_FileID_DateAdded_U_C_IX] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DatabaseName] ASC,
    [FileName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [Tables](
    [DateAdded] [datetime] NOT NULL,
    [DatabaseName] [nvarchar](128) NOT NULL,
    [SchemaName] [nvarchar](128) NOT NULL,
    [TableName] [nvarchar](128) NOT NULL,
    [RowCounts] [bigint] NULL,
    [AllocatedMB] [int] NULL,
    [DataSizeMB] [int] NULL,
    [IndexSizeMB] [int] NULL,
    [PercentOfDB] [decimal](5, 2) NULL,
 CONSTRAINT [PK_Tables] PRIMARY KEY CLUSTERED 
(
    [DateAdded] ASC,
    [DatabaseName] ASC,
    [SchemaName] ASC,
    [TableName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

If you use mountpoints, and if you call them “mountpoint” followed by an incrementing digit, and you never have more than can fit into a tinyint, uncommenting those pieces of the tables will do wonders for you. In case you didn’t guess, they’re commented out because the people who do use them probably don’t use them that way.

DECLARE @RunTime SmallDateTime
DECLARE @SQL NVarChar(max)

SET @RunTime = (SELECT Max(DateAdded) FROM Perf.DBInven.Drives)

SET @SQL = N'use [?]; 
    INSERT INTO Perf..Files 
    select ''' + cast(@RunTime as nvarchar(50)) + ''' 
        , DatabaseName = db_name()
        , FileName = f.Name
        , TypeDesc = f.Type_Desc
        , FileGroup = fg.Name
        , SizeMB = ((f.size*8/1024)) 
        , UsedMB = (FileProperty(f.name, ''SpaceUsed'')*8/1024) 
        , FreeMB = ((f.size-FileProperty(f.name, ''SpaceUsed''))*8/1024) 
        , MaxSizeMB = Case f.Max_Size
            WHEN -1 THEN -1
            ELSE f.Max_Size / 128
            END
        , f.Growth
        , f.Is_Percent_Growth
        , DriveLetter = left(f.physical_name,1) 
        /*, MountPoint = CASE substring(f.physical_name,4,10)
            WHEN ''MountPoint'' THEN 
                Case isnumeric(substring(f.physical_name,14,4)) --MountPoint >= 1000
                WHEN 1 THEN substring(f.physical_name,14,4)
                ELSE Case isnumeric(substring(f.physical_name,14,3)) --MountPoint >= 100
                    WHEN 1 THEN substring(f.physical_name,14,3)
                    ELSE Case isnumeric(substring(f.physical_name,14,2)) --MountPoint >= 10
                        WHEN 1 THEN substring(f.physical_name,14,2)
                        ELSE substring(f.physical_name,14,1) -- MountPoint single digit
                        END
                    END
                END
            ELSE Cast(0 as varchar(10)) --Root drive is MountPoint 0 
            END*/
    from sys.database_files f
        LEFT JOIN sys.filegroups fg ON f.data_space_id = fg.data_space_id
    '

exec sp_MSforeachdb @SQL

And now to grab the tables. Keep in mind that I specifically only want to grab stuff over 0.5% of the database size AND over 100 MB. You’ll see that towards the end of the code.

DECLARE @SQL NVarChar(max)
DECLARE @DateAdded SmallDateTime

SELECT @DateAdded = MAX(DateAdded) FROM Perf.DBInven.Drives 

SET @SQL = ' Use [?];
    DECLARE @DatabaseSize Float

    IF OBJECT_ID(''tempdb..#TableSizes'') IS NOT NULL BEGIN
          DROP TABLE #TableSizes
    END

    SELECT DatabaseName = DB_NAME()
    , TableName = a2.name
    , SchemaName = a3.name
    , RowCounts = a1.rows
    , TableSizeMB = (a1.reserved + ISNULL(a4.reserved,0)) / 128
    , DataSizeMB = a1.data / 128
    , IndexSizeMB = (CASE WHEN (a1.used + ISNULL(a4.used,0)) > a1.data 
                        THEN (a1.used + ISNULL(a4.used,0)) - a1.data 
                        ELSE 0 
                    END) /128
INTO #TableSizes
FROM (SELECT ps.object_id
            , [rows] = SUM(CASE
                                WHEN (ps.index_id < 2) THEN row_count
                                ELSE 0
                            END)
            , reserved = SUM(ps.reserved_page_count)
            , data = SUM(CASE
                            WHEN (ps.index_id < 2) 
                                THEN (ps.in_row_data_page_count + ps.lob_used_page_count + ps.row_overflow_used_page_count)
                            ELSE (ps.lob_used_page_count + ps.row_overflow_used_page_count)
                        END)
            , used = SUM (ps.used_page_count) 
        FROM sys.dm_db_partition_stats ps
        GROUP BY ps.object_id) AS a1
    INNER JOIN sys.all_objects a2  ON a1.object_id = a2.object_id
    INNER JOIN sys.schemas a3 ON a2.schema_id = a3.schema_id
    LEFT JOIN (SELECT it.parent_id
            , reserved = SUM(ps.reserved_page_count)
            , used = SUM(ps.used_page_count)
        FROM sys.dm_db_partition_stats ps
            INNER JOIN sys.internal_tables it ON it.object_id = ps.object_id
        WHERE it.internal_type IN (202,204)
        GROUP BY it.parent_id) AS a4 ON a4.parent_id = a1.object_id
WHERE a2.type <> ''S'' and a2.type <> ''IT''

    SELECT @DatabaseSize = Sum(TableSizeMB) FROM #TableSizes

    INSERT INTO Perf..[Tables]
    SELECT ''' + Cast(@DateAdded as VarChar(50)) + '''
        , DatabaseName
        , SchemaName
        , TableName
        , RowCounts
        , TableSizeMB
        , DataSizeMB
        , IndexSizeMB
        , (TableSizeMB/@DatabaseSize) * 100 
    FROM #TableSizes
    WHERE TableSizeMB > .005 * @DatabaseSize 
        AND TableSizeMB > 100

    DROP TABLE #TableSizes'

exec sp_MSforeachdb @SQL

So this leaves us with the size of the drives. This isn’t an easy subject. You can get everything you need through SQL Server if you don’t use mountpoints, but you’re left with either Powershell or VBScript to capture it with mountpoints so you can query WMI. Another post, another day. I’d actually prefer to rewrite my VBScript as PowerShell to both learn PowerShell and put a better solution out here for you.

This was my last post I needed to prep for my Monitoring and Baselines Presentation for SQL Saturday where I talk about what you want to watch and why. Since the presentation didn’t get into how to watch everything, I made sure that everything was on my blog ahead of time.

Why Blog?


The main reason I started blogging was a community thing. I leeched from the community to get me to where I am in my career, and it was to the point that the community was owed something in return. Everyone can give back in their own way, some answer forum questions, some blog, and some people find their own special way. For me, trolling for forum questions that I could answer that weren’t already answered adequately wasn’t for me; nothing against those folks, it’s just my preference. So I decided that I was going to put things in my own words.

My words were on the tip of my tongue, but there was a road block. Please, don’t laugh at me too much, but I didn’t know how to get my words out there. All these people had cool blogs where they programmed ways for people to leave comments, they programmed this odd little word cloud thing for links to keywords, WOW…they programmed ways to have keywords. I’m out of my league, I’m a SQL nerd, I don’t do this stuff, I can’t do this stuff. When I’m not too busy being bitter, it’s actually quite humorous how I took the long way around to discover WordPress. Those “geniuses” out there knew something I didn’t, and the thing they knew that I didn’t wasn’t how to be a master programmer of cool toys. Man, I feel like a fool!

I’m not selling WordPress here, it’s what I use. Other platforms are comparable; I know that now. The important thing to know is that if you can write an email, you can write a blog. The only trick is copy/pasting code from your editor to WordPress, which I found a cool tool to help you do this. There, my secrets are out, I just told you how to bypass all the big bumps in the road that looked like brick walls to me.

WAIT!!! It’s not that simple. I started and it wasn’t what I expected; same as everything in life, I guess. Blogging wasn’t a small little thing you do on the side and carry on with your life. It becomes a part of your life. 10 people viewed this stuff, and this is stuff that would have revolutionized my career 10 years ago if I would have read it back then. How can I get more viewers? I’m obsessed with view counts, I want this message to go EVERYWHERE!!! How can I write it better? I want the message to mean something and be something people want to read. How can I get myself to stop writing in spurts with 4 posts this week, no posts that month?

I hunted down my favorite blogger and begged for help even though he didn’t even know my name. He was quick to give advice and search his site doing the following searches on Google “site:brentozar.com blogging”, “site:brentozar.com writing”, “site:brentozar.com presenting”. AMAZING and SCARY all at the same time. You’ll get information ranging from starting blogs to who your target audience really is (and shooting yourself) to the 51 steps to make an awesome presentation. Some people show you the rabbit hole and let you go as deep as you want. I felt like I hit the bottom of the bottomless rabbit hole with a solid thud. Brent, I’m not sure if you’re reading this or not, but you’re the only person out there without a college degree who deserves a doctorate in communications on top of your Microsoft MVP and MCM titles you have earned. This is a cross between a “Thank You” and a “Why Me???”. More of a thank you, though. You taught me a lot.

Twitter_BrentO_BlogAdvice

Don’t go thinking that you’re going to hear Brent’s point of view restated here though. I enjoy what he has out there, but my point of view is my own. For instance, I just received an automated email from his group saying “But if you want to be a professional DBA, you should leave the developing to the developers.” in reference to creating tools to monitor the servers, leaving me almost with the impression of not reinventing the wheel, while my blog is largely comprised of a blueprint for a wheel. He continues with “I’m not saying you shouldn’t learn the dynamic management views (DMVs), how to dig through them to find the evidence you need, and how to turn that data into actions. Hopefully, I’m going to teach you how to do a lot of that over the course of the next six months.”, so maybe I’m mistaken. You know what, I’m off subject now and you really just need to got to his free email-based accidental DBA class and make up your own mind.

Back on topic now… The blog is now syndicated, so it’s showing up on SQLServerCentral.com and ToadWorld.com. More over, about 1/4 of my posts are highlighted by SQLServerCentral.com’s emails that go out to thousands of people and make my views go through the roof. At this point, on September 3rd, I have already had more views on my blog for September than I did for the entire month of July! I’m also writing ahead of time, scheduling posts to go out at 4:00 AM my time (ET) on the second business day of each work week (Wednesday for this post), giving me time to review my work while also providing some consistency and a scheduled commitment for me to keep myself inline. So many things have changed, all for the better, and I’ve only been at this for about 4 months by the time this is scheduled to go public.

It’s still not that simple, it’s not what I expected on so many other levels as well. I was just trying to give back to a community that gave to me. That’s part of it still, look at how that small part just blends into the full picture:

  • Community
    • I took a lot, giving a little is the least I can do
  • Expertise
    • I’m learning the little details to be able to explain it to others without misleading them.
    • If you can’t explain it simply, you don’t know it well enough. – Albert Einstein
      • This is my blog’s namesake, in case you were wondering.
  • Confidence
    • What I’m writing is being read by and used by the community. People look up to me, even if that’s only in my own mind.
    • Learning it well enough to teach makes you really learn it and be more confident.
  • Credibility
    • It is on my resume and employers LOVE it.
      • I’m excited about the technology I use at work enough to spend my free time on it.
      • There are examples of my work beyond the one-liner on my resume saying “I know how to watch for performance issues.”
      • You can feel confident giving me any employee and know that I can and will teach them.
      • Careful here, this means you have to write even more betterer to not make fool you out of. :-)
  • Warm Fuzzies
    • I know if I had this early in my career, I would have been down the rabbit hole of system tables and DMVs years ago. It would have changed things for me, and I can only imagine that I’ve had that effect on one or two people.
  • Vengeance
    • Once you know about the DMVs, there is no going back. You took the wrong pill and you’ll never wake up knowing that the world was what you once thought it was. Welcome to the Matrix!
  • Obsession
    • Ok, it couldn’t all be good. I want more read counts, I want the message out there more, I keep looking for read counts, I keep looking to better myself, I keep looking to better my posts, aaaahhhhhh, I’m trapped, help me!!!!

Also, without this, I wouldn’t have the confidence and background to say that I’m ready to be a presenter. However, here I am putting the last details into my part of SQL Saturday #250! See an outline of my presentation right here on my blog.

SQL Server Central highlighted my posts on weeks 20, 21, 25, 26, and 34.  Steady growth otherwise.

SQL Server Central highlighted my posts on weeks 20, 21, 25, 26, and 34. Steady growth otherwise.

So, here I am, 4 months and 10,000 reads (WordPress + SQLServerCentral Syndication) later, and I’ve learned a thing or two.  By no means am I perfect; this is a continual learning experience.  For all those who have something to say out there, stop saying you have something to say and start saying it.  Syndicate it if you want others to read it.  The readers will come with time, just keep writing and strive for something worth a syndicator highlighting in their newsletters.  Who knows, maybe I’ll even make it to the SQL Rockstar Rankings someday. Read about writing to improve yourself.  Then, finally, be ready for a better you.

If you do blog, what has it meant to you? If you don’t, why not? I’d love to hear more people’s take on this as everyone has a unique experience and thought process.

Query Stats


DBAs rarely use the full potential of sys.dm_exec_query_stats.  It’s common to see the queries for looking at the most expensive queries according to any of the stats contained within the current cache, which is great to see.  However, if you grab snapshots of this information the proper way then you can get more detail that will really make a difference in your job.  Capturing this every 4 hours is probably enough, and here’s what it’s going to do for you.

First, using it the way most DBAs do and look at the most expensive queries according to a specific metric.  It works the same for them all, so we’ll just say we’re interested in physical reads in this case because our PLE isn’t what we’re hoping it could be.  You’ve probably seen the queries that directly query sys.dm_exec_query_stats to give you this similar to the one on MSDN for this DMV.  That will give you everything that’s in the cache right now, which could contain queries that recompile every hour in combination with other queries that have remained in the cache since the server was rebooted two weeks ago.  This means that when you’re querying to see what you can do to improve your PLE during business hours you’re getting anything that happens to be there now, not what ran in the time window you’re looking to improve.

Now look at it from the point of view that you capture this information every 4 hours.  You can get the total number of physical reads grouped by sql_handle where the interval_end time is in the last month, on weekdays, between 9:00 AM and 5:00 PM.  This is both a more complete and more accurate picture of what you care about, and you’re more likely to be focusing on the queries you should be tuning from the start.

Once you do start tuning, you want to know what effect you really had, and not just rely on the couple sets of test parameters you used in your NonProd environments.  A couple days after your changes made it to production, it’s a good idea to make sure you had the desired effect.  Query the data you have saved off for the last three days and for the same days last week.

What I went over so far is just realizing how this takes the traditional use of this to a new level, but there’s more functionality here now that it’s being saved off.  Now this is extremely useful in incident response.  Before you could see if anything was hogging resources as seen in my Recently Recompiled Resource Hogs post, but now you take it a step further by being able to verify if what’s running long now has always run that way or if something is different.  Also, with having the plan_handles saved off you can even see if it’s getting a different execution plan, which will help you determine if this was a change in data or if it’s a bad execution plan.

Not to shoot myself in the foot, but this isn’t perfect.  If you capture this data every 4 hours starting at 12:00 and it recompiles at 3:59 then you’ll only capture one minute for that interval.  Also, if you’re dealing with a query that can’t be cached then it won’t be found here at all; this is why you’re also running traces to find the big queries and watching sys.dm_os_performance_counters to know when there are a lot of compiles and recompiles.

To start capturing this, here are the tables you’ll need to capture everything.

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

/*
IF OBJECT_ID('QueryStats') IS NOT NULL BEGIN
    DROP TABLE QueryStats
END

IF OBJECT_ID('QueryStatsLast') IS NOT NULL BEGIN
    DROP TABLE QueryStatsLast
END

IF OBJECT_ID('QueryText') IS NOT NULL BEGIN
    DROP TABLE QueryText
END
--*/

CREATE TABLE [dbo].[QueryStats](
    [interval_start] [smalldatetime] NOT NULL,
    [interval_end] [smalldatetime] NOT NULL,
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [execution_count] [int] NOT NULL,
    [total_elapsed_time_ms] [int] NOT NULL,
    [min_elapsed_time_ms] [int] NOT NULL,
    [max_elapsed_time_ms] [int] NOT NULL,
    [total_worker_time_ms] [int] NOT NULL,
    [min_worker_time_ms] [int] NOT NULL,
    [max_worker_time_ms] [int] NOT NULL,
    [total_logical_reads] [int] NOT NULL,
    [min_logical_reads] [int] NOT NULL,
    [max_logical_reads] [int] NOT NULL,
    [total_physical_reads] [int] NOT NULL,
    [min_physical_reads] [int] NOT NULL,
    [max_physical_reads] [int] NOT NULL,
    [total_logical_writes] [int] NOT NULL,
    [min_logical_writes] [int] NOT NULL,
    [max_logical_writes] [int] NOT NULL,
    [creation_time] [smalldatetime] NOT NULL,
    [last_execution_time] [smalldatetime] NOT NULL,
) ON [PRIMARY]
GO

--Wide is bad, if you have a lot of nonclustered indexes.  
--This is a high-writes, low-reads table, so a single clustered index is probably all I want.
--With Interval_End being the first key field, they're inserted in order and eliminate mass fragmentation
  -- as well as bad write performance.
CREATE UNIQUE CLUSTERED INDEX IX_QueryStats_intervalend_sqlhandle_statementstartoffset_planhandle_U_C ON QueryStats
(
    interval_end
    , sql_handle
    , statement_start_offset
    , plan_handle
)

CREATE TABLE [dbo].[QueryStatsLast](
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [objtype] [nvarchar](20) NOT NULL,
    [execution_count] [bigint] NOT NULL,
    [total_elapsed_time_ms] [bigint] NOT NULL,
    [min_elapsed_time_ms] [bigint] NOT NULL,
    [max_elapsed_time_ms] [bigint] NOT NULL,
    [total_worker_time_ms] [bigint] NOT NULL,
    [min_worker_time_ms] [bigint] NOT NULL,
    [max_worker_time_ms] [bigint] NOT NULL,
    [total_logical_reads] [bigint] NOT NULL,
    [min_logical_reads] [bigint] NOT NULL,
    [max_logical_reads] [bigint] NOT NULL,
    [total_physical_reads] [bigint] NOT NULL,
    [min_physical_reads] [bigint] NOT NULL,
    [max_physical_reads] [bigint] NOT NULL,
    [total_logical_writes] [bigint] NOT NULL,
    [min_logical_writes] [bigint] NOT NULL,
    [max_logical_writes] [bigint] NOT NULL,
    [creation_time] [datetime] NOT NULL,
    [last_execution_time] [datetime] NOT NULL,
    [DateAdded] [datetime] NOT NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [IX_QueryStatsLast_sqlhandle_planhandle_statementstartoffset_U_C] ON [dbo].[QueryStatsLast] 
(
    [sql_handle] ASC,
    [plan_handle] ASC,
    [statement_start_offset] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

CREATE TABLE [dbo].[QueryText](
    [sql_handle] [varbinary](64) NOT NULL,
    [QueryText] [nvarchar](max) NOT NULL,
    [DatabaseName] [nvarchar](128) NULL,
    [objtype] [nvarchar](20) NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [IX_QueryText_sqlhandle_U_C] ON [dbo].[QueryText] 
(
    [sql_handle] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

Now that the tables are created, here’s how I populate them.  Set this up to run once every 4 hours to keep the data small during typical running, and kick it up to once an hour or more if needed to pinpoint an issue if you need it.

DECLARE @interval_start smalldatetime
    , @interval_end smalldatetime

IF OBJECT_ID('tempdb..#QS') IS NOT NULL BEGIN
    DROP TABLE #QS 
END 

CREATE TABLE [dbo].[#QS](
    [sql_handle] [varbinary](64) NOT NULL,
    [plan_handle] [varbinary](64) NOT NULL,
    [statement_start_offset] [int] NOT NULL,
    [statement_end_offset] [int] NOT NULL,
    [objtype] [nvarchar](20) NOT NULL,
    [execution_count] [bigint] NOT NULL,
    [total_elapsed_time_ms] [bigint] NOT NULL,
    [min_elapsed_time_ms] [bigint] NOT NULL,
    [max_elapsed_time_ms] [bigint] NOT NULL,
    [total_worker_time_ms] [bigint] NOT NULL,
    [min_worker_time_ms] [bigint] NOT NULL,
    [max_worker_time_ms] [bigint] NOT NULL,
    [total_logical_reads] [bigint] NOT NULL,
    [min_logical_reads] [bigint] NOT NULL,
    [max_logical_reads] [bigint] NOT NULL,
    [total_physical_reads] [bigint] NOT NULL,
    [min_physical_reads] [bigint] NOT NULL,
    [max_physical_reads] [bigint] NOT NULL,
    [total_logical_writes] [bigint] NOT NULL,
    [min_logical_writes] [bigint] NOT NULL,
    [max_logical_writes] [bigint] NOT NULL,
    [creation_time] [datetime] NOT NULL,
    [last_execution_time] [datetime] NOT NULL,
    [DateAdded] [datetime] NOT NULL
) 
/*
--This sounded like a great idea, but it just slowed it down several seconds.
CREATE UNIQUE CLUSTERED INDEX TempQS_Cluster ON #QS  
(
    sql_handle
    , plan_handle
    , statement_start_offset
)
*/

--The Cached Plans Object Type is in here in case you want to treat ad-hoc or prepared statements differently
INSERT INTO #QS 
SELECT qs.sql_handle
    , qs.plan_handle
    , qs.statement_start_offset
    , qs.statement_end_offset
    , cp.objtype
    , qs.execution_count 
    , total_elapsed_time_ms = qs.total_elapsed_time/1000
    , min_elapsed_time_ms = qs.min_elapsed_time/1000
    , max_elapsed_time_ms = qs.max_elapsed_time/1000 
    , total_worker_time_ms = qs.total_worker_time/1000 
    , min_worker_time_ms = qs.min_worker_time/1000
    , max_worker_time_ms = qs.max_worker_time/1000
    , qs.total_logical_reads
    , qs.min_logical_reads
    , qs.max_logical_reads
    , qs.total_physical_reads
    , qs.min_physical_reads 
    , qs.max_physical_reads
    , qs.total_logical_writes 
    , qs.min_logical_writes 
    , qs.max_logical_writes
    , qs.creation_time
    , qs.last_execution_time
    , DateAdded = getDate()
FROM sys.dm_exec_query_stats AS qs
    INNER JOIN sys.dm_exec_cached_plans cp on qs.plan_handle = cp.plan_handle
--WHERE cp.objtype NOT IN ('Adhoc')

INSERT INTO QueryText (sql_handle, QueryText, DatabaseName, objtype)
SELECT QS.sql_handle
    , QueryText = qt.text
    , DatabaseName = DB_NAME(max(qt.dbid))
    , max(QS.objtype)
FROM (SELECT #QS.sql_handle
        , #QS.objtype  
        FROM #QS
            LEFT JOIN QueryText QST ON #QS.sql_handle = QST.sql_handle
        WHERE QST.sql_handle IS NULL) QS
    CROSS APPLY sys.dm_exec_sql_text(QS.sql_handle) qt
GROUP BY QS.sql_handle
    , qt.text

--All the values are the same in each table
SELECT TOP 1 @interval_start = dateadded FROM QueryStatsLast
SELECT TOP 1 @interval_end = dateadded FROM #QS

IF @interval_start IS NULL BEGIN
    SELECT @interval_start = create_date
    FROM sys.databases
    WHERE name = 'tempdb'
END

INSERT INTO QueryStats (interval_start
    , interval_end
    , sql_handle
    , plan_handle
    , statement_start_offset
    , statement_end_offset
    , execution_count 
    , total_elapsed_time_ms
    , min_elapsed_time_ms 
    , max_elapsed_time_ms
    , total_worker_time_ms  
    , min_worker_time_ms 
    , max_worker_time_ms 
    , total_logical_reads
    , min_logical_reads
    , max_logical_reads
    , total_physical_reads
    , min_physical_reads 
    , max_physical_reads
    , total_logical_writes 
    , min_logical_writes 
    , max_logical_writes
    , creation_time
    , last_execution_time)
SELECT @interval_start
    , @interval_end
    , qs.sql_handle
    , qs.plan_handle
    , qs.statement_start_offset
    , qs.statement_end_offset
    , qs.execution_count - ISNULL(qsl.execution_count, 0)
    , qs.total_elapsed_time_ms - ISNULL(qsl.total_elapsed_time_ms, 0)
    , qs.min_elapsed_time_ms 
    , qs.max_elapsed_time_ms 
    , qs.total_worker_time_ms - ISNULL(qsl.total_worker_time_ms, 0)
    , qs.min_worker_time_ms
    , qs.max_worker_time_ms 
    , qs.total_logical_reads - ISNULL(qsl.total_logical_reads, 0)
    , qs.min_logical_reads 
    , qs.max_logical_reads 
    , qs.total_physical_reads - ISNULL(qsl.total_physical_reads, 0)
    , qs.min_physical_reads 
    , qs.max_physical_reads 
    , qs.total_logical_writes - ISNULL(qsl.total_logical_writes, 0)
    , qs.min_logical_writes 
    , qs.max_logical_writes 
    , qs.creation_time
    , qs.last_execution_time
FROM #QS qs
    LEFT OUTER JOIN QueryStatsLast qsl ON qs.sql_handle = qsl.sql_handle
        AND qs.plan_handle = qsl.plan_handle
        AND qs.statement_start_offset = qsl.statement_start_offset
        AND qs.creation_time = qsl.creation_time 
WHERE qs.execution_count - ISNULL(qsl.execution_count, 0) > 0 --Only keep records for when it was executed in that interval

TRUNCATE TABLE QueryStatsLast

INSERT INTO QueryStatsLast
SELECT sql_handle
    , plan_handle
    , statement_start_offset
    , statement_end_offset
    , objtype
    , execution_count 
    , total_elapsed_time_ms
    , min_elapsed_time_ms 
    , max_elapsed_time_ms
    , total_worker_time_ms  
    , min_worker_time_ms 
    , max_worker_time_ms 
    , total_logical_reads
    , min_logical_reads
    , max_logical_reads
    , total_physical_reads
    , min_physical_reads 
    , max_physical_reads
    , total_logical_writes 
    , min_logical_writes 
    , max_logical_writes
    , creation_time
    , last_execution_time
    , DateAdded
FROM #QS

SQL Saturday 250 – Pittsburgh – Advice for My Presentation on Monitoring and Baselines


I’ll be presenting at SQL Saturday 250 in Pittsburgh on Baselining and Monitoring. This is my first public presentation, and I could use some advice to make sure my presentation is the best it can possibly be for both myself and the audience.

First, my presentation. Anyone who’s been following my blog won’t be surprised by anything they see. I’m going to hit all of the the statistics I watch and why I watch them. The whole thing is a pep rally for what, why, and how it will change your job. How to do it will be completely skipped for the reasons below.

  • Brent Ozar told me to in point 13.
  • Trying to take notes on the how would be too intense for any audience member.
  • What and why are more exicting, life’s too short for boring!!!
  • I already did the how, it’s on my blog, and I’ll tell the audience all of the hows are already listed here in both my opening and closing statements.

The points I’m hitting are:

  • Traces (with mention of Extended Events)
  • Wait Stats (sys.dm_os_wait_stats)
  • Blocking (sys.dm_os_waiting_tasks with LCK% type)
  • Query Stats (sys.dm_exec_query_stats)
  • Index Stats (sys.dm_db_index_usage_stats)
  • OS Perf Counters (sys.dm_os_performance_counters)

In all of these areas I’ll go through why I watch these and how monitoring them over looking at the values you can get at any point-in-time has changed my life as a DBA. After going through each section I would like to ask the audience for questions in the middle of the presentation so the topic of the questions is fresh in everyone’s minds and more likely to be taken to heart. How you capture these will be completely documented on my blog to the point that you can practically copy/paste the code to develop a customized monitoring setup on your servers from one source.

Baselining is built directly into my monitoring and how I use my results. Instead of having a static baseline that I took on a certain day, I’m a huge fan of comparing today’s incidents to yesterday and last week with a rolling baseline. I find it to be an impressive way to handle things.

The slide deck still needs a bit more work, but it’s literally just going to be a couple words (Steve Jobs style!!!) with the information coming from what I say instead of the audience’s attention being divided between me and the screen behind me. They’ll see the topic on the screen with a picture of what it means to me while hearing my views.

This presentation is almost complete on my end, but I’m flexible enough to say I’m wrong on anything and make changes. Some questions I have for your are:

  • Is this a presentation you would enjoy seeing? Why or why not?
  • Do you feel this is a complete list of items to monitor? What would you change?
  • Are questions in the middle of a presentation a good idea?
  • How do I make this public, both assisting the audience as much as possible while keeping in mind my rights and possible future use of this presentation? I would consider posting the full thing on the internet, even a recording of me doing it, but I don’t want to think about it a year from now and think that I just shot myself in the foot doing it. I don’t know what to expect, thus the call for help.
  • Anything else you have to add will be greatly appreciated. Adding that information before September 14th would raise the level of my appreciation! :-)

I’m extremely motivated to do this, and I want to absolutely kill this presentation! The only way to hit my expectations is to prep for it, and the best way to prep for it is to take in as much positive and negative criticism before it goes live as I possibly can.

Thank you for taking the time to read this!
Steve

Fixing Page Life Expectancy (PLE)


There are several reasons Page Life Expectancy (PLE) can drop, and understanding those root causes will help you improve SQL Server performance. This is a very long post to go with a very important counter showing how to make a real impact on your server’s performance. If you stick with me through this, you’ll be looking at your servers from new angles and that effort will be noticed by the Sys Admins, SAN Admin, the users, and your boss.

Before we being, there are some ground rules we need to get out of the way defining PLE and understanding there are external memory factors. There’s no shame in skipping that and going straight to the focus of this post. Personally, I find the next two sections a little dry, but I’m also my harshest critic!

Define PLE

Before we get too deep into it, lets make sure we’re on the same page on a couple things.

Page Life Expectancy is the number of seconds the average page of data has been the data cache (aka buffer cache). The data cache is part of the buffer pool (the similar name is why I use data cache instead of buffer cache, although they’re the same thing), which is the memory you allocated to SQL Server. SQL Server uses this area to give quicker access to data instead of taking the long, slow trip to persisted storage (disk).

Keep in mind that SQL reads the data pages from the cache, always from cache. If the data you need isn’t there then SQL Server does a physical read to put it there. After that’s done it will do a logical read to use the page that’s now in memory. If you want to dive into the detail you can do it here, specifically with the reading pages and writing pages links on that page.

That physical read is going to the slowest part of your server (even expensive flash is the slowest part, it’s just not as slow) to read the page(s) from persisted media, and it also uses the CPU in the process. This makes PLE critical because, even if you ignore the extra load you just placed on the I/O system and CPU, you’re waiting for an additional, slower action to take place.

You can find your page life expectancy in sys.dm_os_performance_counters using my post on OS Perf Counters. That post will also help you realize how much load you take off of your disks by raising PLE, then you’re really going to start to understand the burning drive picture you find there.

The problem is that many people see the 300 value for Page Life Expectancy you can still find documented is wrong, very wrong. If this is simplified to be how long the average data will stay in memory and you have a server with 64 GB of memory with 56 GB allocated to SQL Server, that means you’re reading about 56 GB of data from disk every 300 seconds. If you look at Page 36 of Troubleshooting SQL Server – A Guide for the Accidental DBA by Jonathan Kehayias and Ted Krueger you’ll see an actual scalable version of this; PLE should be 300 for every 4 GB of RAM on your server. That means for 64 GB of memory you should be looking at closer to 4,800 as what you should view as a critical point.

I’ll go so far as to say that you should be cautious of anything that doesn’t have a formula to it because our servers keep getting faster, our databases keep getting bigger, and static values…don’t. Jonathan Kehayias and Ted Kruger giving a formula is great, because the old value became outdated the day SQL Server went 64-bit, or even before that with the AWE setting.

Jonathan Kehayias also put this in his blog post Finding What Queries in the Plan Cache Use a Specific Index, and that had a great series of comments going back and forth between him and Brent Ozar. Brent is discussing not putting so much faith in PLE and watching wait stats instead as different speed I/O systems can greatly sway the impact of low PLE or even make it so a high PLE just isn’t high enough. They’re both right, and if you can understand the intent of each one then you’ll be in a much more comfortable place when tuning your servers.

If you force me to take a side on that post, Jonathan wins out over Brent here. This is because Jonathan is talking about having a higher PLE, which in turn reduces load on the disks and typically reduces the amount of work that needs to be done to execute a query. Brent is basically implying that money may have already be thrown at the issue, which negates the current symptoms on what could be a less scalable solution. However, you can’t discount Brent’s logic here as the things that are limiting your server’s performance at this time lie in the Wait Stats, not this counter.

Quick Rant: We’re Not Alone

So, you want higher PLE to show that you’re using your resources better, and the only way PLE goes up is by leaving data in the data cache once it’s there. The problem is, there’s a lot going on that wants to work the other way. Even if you have 64 GB of memory dedicated to SQL Server to host a single 64 GB database it won’t all fit in memory, that’s because other things want to play as well.

First, your memory isn’t completely dedicated to the buffer pool. Memory pressure from the OS or VMs changing size can drop the size of the buffer pool has unless you’re using “lock pages in memory”, which is not recommended in most situations especially on VMs. If you’re not on a completely dedicated box, which means a physical box with no apps, no GUI, no RDP sessions, no SQL Server, no network card driver, etc., then this can happen. Wait a minute…you can’t uninstall SQL Server to make sure SQL Server’s buffer pool remains untouched, yet SQL Server uses memory outside of this area for any purpose that isn’t specifically assigned to the buffer pool and that memory usage can get out of control. If there’s memory pressure then some pages are forced out of data cache to drop the size of the buffer pool, leaving you open to more contention and a lower PLE.

Second, your buffer pool isn’t dedicated to your data cache as it also houses your proc cache, which are execution plans stored to be run again. These plans take a lot of CPU to create and reusing them saves you a lot of resources, so you can’t complain that you’re sharing space here. SQL Server determines the amount of the buffer pool to assign to each, and it’s possible for either side to have unnecessary data in it. I’m going to focus on the data cache here, so I’ll defer the abuse of proc cache conversation to Kimberly Tripp. There’s more to it than just her post, but this is a great place to start understanding what’s there and not in use.

Focus, Focus, Focus

Ok, so we know what PLE is and that external factors can affect it, but that’s not really why you’re here, is it? Are you here because you have your data cache and you want to manage it to keep your PLE going up…maybe a little. My guess is that you’re here because you heard that PLE is critical to your server’s performance and you want to increase performance by raising this counter the right way. You do that in two ways.

First, make sure you have as much room as possible. If your server can handle more memory it’s often cheaper to buy that much memory than it is to intervene on the database side. Tuning and purging scales well, will help a lot more than just memory management, and is the best thing you can do given the time budgeted to it, but it just isn’t as quick and easy as a memory upgrade in many cases. Even if you buy memory now and tune later, it will continue to help. 64 GB of memory, as Brent Ozar pointed out here, is only a couple hundred dollars here. FYI, I absolutely hate throwing money at issues, and I still listed this first.

Second, keep as little in memory as possible to make efficient use of the space you have. Yeah, it’s that easy. Well, at least it’s that easy to say, but doing it takes a little more work. There are a list of ways you can trim down on the space you need.

  • Drop Unused Indexes
  • Merge Duplicate Indexes
  • Use Your Indexes – SARGability
  • Watch for Big Queries
  • Look in Your Proc Cache for Opportunities
  • Know What’s in Your Buffer Cache
  • Index Maintenance – Defrag
  • Index Maintenance – Statistics
  • Purge Your Data
  • Other

Through all of this, please remember that your initial thought may be to raise a counter, but is that your real goal? I’m betting you want your server to run faster, and PLE is merely one counter that helps put a value on performance. Servers with high PLE can run horrible, they just don’t need to keep running back to disk. Servers with low PLE can run great, they just need to go back to disk to get what they need.

A little bit ago I mentioned buying more memory, this will raise your PLE without making any queries really run more efficiently. Sure, they’re more likely to have their data in cache which is great, but they’re doing just as many logical reads and using just as much CPU to do their joins. The rest of this post isn’t like that, the rest is making queries more efficient in a way that just happens to make your PLE go up.

Drop Unused Indexes

This part’s tricky. SQL 2005-2008 R2 told you how much indexes were used since the last time SQL Server was started in the DMV dm_db_index_usage_stats. It’s still there in SQL 2012 and beyond, but these statistics get reset when you rebuild an index now. That means that best case scenario on a server that’s patched monthly, you have a month’s worth of data to go off of and you can’t even rely on that being the whole picture anymore. I get into more about how to keep track of this over time in my post Indexes – Unused and Duplicates.

This has a minor implication on PLE because as the pages of your unused indexes are updated SQL Server has to read part of the B+tree into memory to find the page, then it has to read the page to be updated into memory, it updates it in memory and marks it as a dirty page, then eventually writes it back to disk. The key here being that pieces are read into memory, AKA the buffer pool, the part of SQL Server you’re trying to clean up.

Merge Duplicated Indexes

This one’s not as tricky, but the performance issues are much worse. The problem is that you have two indexes, we’ll call them ix_A and ix_B, that are very similar. They have the same first two key fields, and stray a little after that. The differences mean that ix_A will be better for one query and ix_B will be better for another, and that means SQL Server will be dragging them both into cache. Sure, if you take the columns that are in ix_B that aren’t in ix_A and include them in ix_A then drop ix_B then the queries that used to use ix_B will probably need to do a little more work, but you’ll end up with a more efficient use of the cache. Your typical testing will show you the worst case scenario here, because you’ll run it each way multiple times and see what it’s like when the indexes are completely in cache. However, the real world is more friendly than that (that’s a first), and the query that used to use ix_B may actually run faster because now it’s using an index that’s more likely to be in cache, cutting down on physical reads. Going to the same blog post, Indexes – Unused and Duplicates, you can see details on how to help relieve this problem.

There’s a bit of an art to this at first, but you’ll get it down to a science with practice.

Use Your Indexes – SARGability

SARGability (Search ARGument capable) is it’s own topic, and it deserves at least one post of its own. You need to understand how indexes work, as stated in my post Indexing Fundamentals, and make sure that your queries can take advantage of indexes. If you look in a phone book for people with the last name like ‘%ood’ it’s no good because you need the first letter of the last name to take advantage of that index. If you look up people with the first name ‘Steve’ then that’s useless because the first key field was last name. Doing functions, implicit conversions, and many other techniques can cause you to need to scan an entire index.

In terms of PLE, that means you read the entire index into memory instead of just the pieces you need. That can add up to a lot of data being kicked out of cache to make room for a lot of physical reads from disk.

This isn’t an easy topic, and I’m not sure of any resources that say everything that needs to be said on the topic. While I have plans to write a post on it and will update this one when I do, it’s still an unwritten post. Until that time comes, Google “SQL Server SARGable” and see what pops up.

If you know how to read execution plans, there are two sections on a seek, scan, or lookup that you need to know here. Predicate and Seek Predicate. Predicate is what it had to scan for, while a seek predicate is what it could find efficiently using the fact that the key fields are sorted.

Watch for Big Queries

It doesn’t matter if you use Extended Events or Profiler Traces, you need to know when large queries are being run on your servers. If you want to watch this at a statement level then you’ll want to make the jump to Extended Events, but I’ve never seen it hurt to watch SQL Batch Completed and RPC Completed filtered only by duration over 10 seconds.

The point of this when it comes to PLE is that the queries that aren’t SARGable or don’t have an index to take advantage of will be doing a lot of scans. The bigger the scan the longer the query will take to run, and if you care about PLE then you’ll know what queries those are. Sure, some are legitimately doing enough work where they’ll always take that long to run, but most of the queries that take a while can be tuned. Be it changes to the query itself (should be your first instinct) or indexing changes, there’s often a way to have a query require less data in cache.

This is an extremely useful technique to use in correlation with watching for drops in PLE. If it drops then check to see what was running at the time. This shows you what queries were running, which would be most of your issues, but it could be outside of the query itself and actually be something like a large auto update statistics task being kicked off that you won’t see here.

Know What’s in Your Buffer Cache

Very few people do this, yet it’s so obvious once you start. You want to raise PLE, right? PLE is a measurement of how long stuff stays in your buffer cache, and drops because something else needed to be put in there. So, what’s in there that’s taking up all your space and how’d it get there? It took me years to come up with that question, and I feel like a bit of a fool for that.

It was actually one of my biggest tuning revelations since I read Grant Fritchey’s Execution Plans book. And you always know it’s a great revelation when you find yourself yelling at yourself for not realizing it earlier…years earlier.

So, how do you do this? Query your cache to find the indexes that take up the most space there, pick one that is an excessive amount of space or doesn’t look like it belongs there, query the proc cache to find out where that index is used, and tune that query either through code changes or indexing changes. The scripts to do this and a more detailed description of the process can be found in my post Cleaning up the Buffer Pool to Increase PLE

Look In Your Proc Cache for Opportunities

Lets start by saying this isn’t perfect. These numbers get reset throughout the day and some queries never make it in here at all. That’s saying this isn’t an all-inclusive, one-stop shop. It is in no way saying that you can’t make amazing improvements on the queries you find here.

There are two way you can use the cache. Most people know that you can get your most expensive queries such as on the MSDN sys.dm_exec_query_stats page, and you can use your imagination to sort this by any of the counters available here. That will give you an idea of what needs tuning, if you needed somewhere to start. I love taking this a step further and monitoring that exact information so I have a historical view instead of just what happens to be in cache now. You can read all about that in my post Query Stats.

The other way is parsing through the XML in the plans to find specific items. I heard that grunt when I mentioned XML, and you’re not alone. Not many of us are good at XML; Jonathan Kehayias is an extreme exception here. He gives you his queries so you can run with it, do what he does, and alter it to go even further. This is exactly what I did in my post Cleaning up the Buffer Pool to Increase PLE that I just mentioned in the last section.

Index Maintenance – Defrag

Many people only thing about defragmenting their indexes to help get contiguous reads on their disk, but that’s only half the story. In fact, the better you do everything in this post the less you’ll have to worry about your disks.

However, you also have to keep in mind how full each page of your indexes is. If you have an index page that’s 100% full then you have a full 8kb of data there. Add another row and you now have 8.1kb of data that’s split between two 8kb pages, so you’re wasting almost 50%. 50% isn’t even your worst case scenario because pages are never automatically removed or merged if records are deleted unless you delete every row stored in that page. Paul Randal’s post Performance Issues From Wasted Buffer Pool Memory takes a deeper look at this problem, and Ola Hallengren’s scripts can help you with a solution.

Index Maintenance – Statistics

It may seem obvious to do your index defrag job off hours. While I’ve seen servers without the job, I’ve never seen one scheduled for mid-day. That’s a good thing, because a lot of data needs to be pulled into cache to play around with indexes like that.

However, do you have auto update statistics turned on? Do NOT turn it off because of this, but understand that it updates the statistics by reading about 1% of the index into cache. If that’s a 100 GB index then you’re reading 1 GB of data which is a descent portion of your cache. To help avoid this, you should be updating your stats off hours, and Ola Hallengren’s scripts can help with this, too.

You need to update your stats because as more and more updates are made to a table the less accurate your stats are, making your execution plans less accurate. Once you update about 20% of the rows of a table since the last time your statistics were updated then they’ll automatically be updated again if you leave the default option turned on. Again, don’t go turning it off because of this post alone. Instead, schedule your stats to be updated off-hours and only the tables that have 20% of their rows updated throughout the day will get their stats updated automatically, and that will stop most auto-updates on the larger tables that would cause issues.

Purge Your Data

We talked a lot about what to do with the data you have, but do you need all of your data? Step away from SQL Server once and go talk to an accountant. Ask them where they have paper work from last month and they’ll point to a filing cabinet. Now, knowing they’re legally obligated to keep a lot of stuff for 7 years, ask them where a 5-year-old document is. It’s not right in front of them, but they know where to look for it. It’s not taking up valuable space in their office (read: not on the main production server), often not in the same building they’re in (read: not online), and it will take a bit to get to it. They no longer have an active business use for it, and they can get to it in the rare case that it’s needed.

Ask them where something is that’s 8 years old, and they’ll find a friendly way to tell you that they bought an OCD pyromaniac a pair of scissors and a book of matches. We’re more eloquent with deleting things in batches, but their way sounds like more fun.

If they kept everything in one room would you call that room a cluttered, unmanageable mess? If they kept historical records forever, would you think they were being obsessive and wasteful with purchasing storage?

So, can you pull up a detailed sales report for February 29th, 2000 from Prod for me?

Other…Am I Giving Up?

There are too many things to list in one blog post. You could write a book on this subject. I listed what I feel will help you the most, but I also wanted to take time to let you know that my list isn’t some magical, all-inclusive, everything you can do to make your servers run better. It’s a start. Hopefully a good one you found to be productive, but a mere start no matter how you look at it.

Speaking of “you could write a book on the subject”, well, they did. A lot of them did. Some even made them free PDFs. Not bootleg bit-torrent copies, but actual real, legal, free PDFs made possible by Red Gate in their book selection on SQL Server Central

Some books you have to buy, if that’s your thing:

Also, I’m not the only person blogging about this topic:

Results

If you did all of this and there’s something running on your servers to trend performance, you’ll notice:

  • Page Life Expectancy: Raised significantly, but you saw that coming
  • Page Reads/sec: (Physical Reads) Dropped because we’re not cycling data in and out of cache as fast
  • Page Lookups/sec: (Logical Reads) Will drop with some of these
  • % Processor Time: Dropped due to lower I/O, more efficient queries, less pressure on procedure cache, etc.
  • Critical query execution time: Typically less
  • DBA Pay Rate: If this isn’t on the list, try using a chart of the above counters

Rewritten Post

Some readers may have realized that I rewrote this post, and you can find the original version here. I did this because I didn’t feel the old post lived up to my expectations, and I’d love to hear feedback on what you think of the differences.