Optional Parameters Causing Index Scans


Optional parameters in a stored procedure often lead to scans in the execution plan, reading through the entire table, even if it’s obvious that the right index is in place. This is extremely common in procs that are used behind search screens where, for example, you could search for a person according to their first name, last name, city, state, zip code, gender, etc.

To take a simplified look at this, make a quick and easy table consisting of an indexed identity column for us to search by and fluff to make the row size bigger.

CREATE TABLE Test 
(
    ID Int NOT NULL Identity(1,1) PRIMARY KEY
    , Value CHAR(500) NOT NULL
)
GO

INSERT INTO Test (Value)
SELECT '';
GO 250

INSERT INTO Test (Value)
SELECT TOP (1000) Value 
FROM Test 
GO 100

Now we’ll create a proc to get data. In this proc we’ll get the data several different ways, showing the advantages and disadvantages of each method.  You’ll notice that I put my comments as PRINT statements, which will make it easier for you to read through the messages tab and see what’s going on.

CREATE PROC proc_Testing 
    @ID int = 125
    , @MinID int = 100
    , @MaxID int = 150
AS 

SET STATISTICS IO ON
SET NOCOUNT ON

PRINT 'Clustered index scan, because the query optimizer has to be prepared for both null and non-null values'

SELECT *
FROM Test 
WHERE @ID = Test.ID OR @ID IS NULL

PRINT ''
PRINT 'It''s a small mess, but the query optimizer knows exactly what to expect for each query.'

IF @ID IS NOT NULL BEGIN
    SELECT *
    FROM Test 
    WHERE @ID = Test.ID 
END ELSE BEGIN
    SELECT * 
    FROM Test
END 

PRINT ''
PRINT 'Expand that to two possible parameters and it still looks simple right now, when it''s doing a scan'
SELECT *
FROM Test 
WHERE (ID >= @MinID OR @MinID IS NULL)
    AND (ID <= @MaxID OR @MaxID IS NULL)
    
PRINT ''
PRINT 'Make it optimized with IF statements and it quickly becomes a big mess.'  
PRINT 'Many "search screen" queries will have 20 or more optional parameters, making this unreasonable'

IF @MinID IS NOT NULL BEGIN
    IF @MaxID IS NOT NULL BEGIN
        SELECT * 
        FROM Test 
        WHERE ID BETWEEN @MinID and @MaxID 
    END ELSE BEGIN
        SELECT *
        FROM Test 
        WHERE ID >= @MinID 
    END
END ELSE BEGIN
    IF @MaxID IS NOT NULL BEGIN
        SELECT * 
        FROM Test 
        WHERE ID <= @MaxID 
    END ELSE BEGIN
        SELECT *
        FROM Test 
    END
END

PRINT ''
PRINT 'However, the query optimizer can get around that if it''s making a one-time use plan'

SELECT *
FROM Test 
WHERE (ID >= @MinID OR @MinID IS NULL)
    AND (ID <= @MaxID OR @MaxID IS NULL)
OPTION (RECOMPILE)

PRINT ''
PRINT 'And again with the single parameter'

SELECT * 
FROM Test
WHERE ID = @ID OR @ID IS NULL 
OPTION (RECOMPILE)

PRINT ''
PRINT 'However, this leaves two nasty side effects.'
PRINT 'First, you create a new plan each time, using CPU to do so.  Practically unnoticed in this example, but it could be in the real world.'
PRINT 'Second, you don''t have stats in your cache saying how indexes are used or what your most expensive queries are.'
PRINT ''
PRINT 'Another option is dynamic SQL, which still has most of the first flaw, and can also bloat the cache'

DECLARE @Cmd NVarChar(4000)

SELECT @Cmd = 'SELECT * FROM Test WHERE 1=1'

IF @ID IS NOT NULL BEGIN
    SELECT @Cmd = @Cmd + ' AND ID = ' + CAST(@ID AS VarChar(100))
END

IF @MinID IS NOT NULL BEGIN
    SELECT @Cmd = @Cmd + ' AND ID >= ' + CAST(@MinID AS VarChar(100))
END

IF @MaxID IS NOT NULL BEGIN
    SELECT @Cmd = @Cmd + ' AND ID <= ' + CAST(@MaxID AS VarChar(100))
END

EXEC (@Cmd) 

PRINT ''
PRINT 'Yes, you can do sp_executesql which can be parameterized, but it gets more difficult.'

Now that we have the data and proc, just execute it. It will be best if you add the actual execution plans to this, which you can get to on your menu going to Query / Include Actual Execution Plan. If you don’t know how to read execution plans then this is even better! No experience necessary with plans this simple, and you’ll start to feel more comfortable around them.

EXEC Proc_Testing

Even with the default values for the parameters the query optimizer knows that you could have easily passed in NULL for the values, and it has to plan for everything. Lets step through the results to see what this all means.  Part of that planning for everything is saying that @ID IS NULL will be used instead of ID = @ID, so the query optimizer can’t say for sure that it can just perform a seek on that column.

In the messages tab we have the number of reads done by each statement courtesy of me adding SET STATISTICS IO ON in the proc itself (bad idea in prod, by the way). For the first one we have 7272 logical reads, saying we read 7272 pages of 8kb of data, so we had ask the CPU to play around with over 56 MB of data to find records that matched.

Looking at the execution plan we can see why that happened. The Predicate you can think about as the “Scan Predicate”, since it’s saying what you couldn’t seek for and had to scan instead. In this case it’s showing the entire “@ID = Test.ID OR @ID IS NULL” in there, because it compiled a reusable plan that might be used with a NULL.

Execution Plan of Query 1

In the next statement we were able to do a seek, doing only 3 reads. This is because there are two levels of the B+tree on this index, so it read the root level, then the next level of the B+tree, and finally the single leaf level of the index that contained the data we wanted.

Take a look at the execution plan here to see the difference with not only the Seek itself, but the Seek Predicate in the details. You’ll see that it uses @ID in there still showing that it’s a reusable plan ready for any ID to be passed in, but the WHERE clause is telling the query optimizer that it can always use a seek to find this data.

Execution Plan of Query2

Now that I know you can read through a couple yourself, I’ll skip ahead to the 6th query where we used the same “@ID = Test.ID OR @ID IS NULL” that caused is trouble before, but did OPTION (RECOMPILE) at the end of the statement. Again it did the same 3 reads I just mentioned still using a seek, but the details of the execution plan looks different. The seek predicate used to say @ID so it could be reused, but the query optimizer knows nothing with OPTION (RECOMPILE) will get reused. Knowing that, it simply converted @ID into a constant value of 125 and was able to eliminate the possibility of 125 being NULL.

Execution Plan of Query 6

If you want to get more into the details, you can look at the XML version of the execution plan by right-clicking on the plan and selecting “Show Execution Plan XML…”. Here’s what you’ll find for the parameter’s value in the first two queries. It shows that the value that was sniffed when it was compiled was 125, and also that it was run this time with a value of 125. If you ran this again passing 130 in then you’d see the sniffed value staying at 125 because it’s not going to use CPU resources to recompile it, but the runtime value will be 130 for that run.

<ParameterList>
  <ColumnReference Column="@ID" ParameterCompiledValue="(125)" ParameterRuntimeValue="(125)" />
</ParameterList>

The OPTION(RECOMPILE) versions don’t look like this though, they’re changed to a constant value even at this level.

<RangeColumns>
  <ColumnReference Database="[tempdb]" Schema="[dbo]" Table="[Test]" Column="ID" />
</RangeColumns>
<RangeExpressions>
  <ScalarOperator ScalarString="(125)">
    <Const ConstValue="(125)" />
  </ScalarOperator>
</RangeExpressions>

I hope I left you with a better understanding of what happens when you write code for parameters that could be null. We’ve all done this, and many of us (me especially) have been frustrated with the performance. This is a change I put into production not too long ago that ended scans on a rather large table, and the results were noticed in query execution time, the page life expectancy counter since the entire table was no longer being read into memory, and I/O activity levels since that index and the indexes it kicked out of memory didn’t have to be read off of disk nearly as often.

Let me know what you think, and especially if you have a better way of handling this issue.

Why worry about CXPACKET


TSQL2SDAYMany people see CXPACKET at the top of their waits and start trying to fix it. I know this topic’s been done before, but people ask me about it enough that it should be done again, and again. In fact, Brent Ozar wrote an excellent article about it, and Jonathan Kehayias wrote an article about tweaking it using the cost threshold of parallelism. Even with those, it took Grant Fritchey warning us to be careful about giving guidance to lead me to Bob Ward’s take on it.  After all of these people and more saying CXPACKET isn’t what you think it is, enough people have it all wrong to the point that I felt this was the biggest piece of public opinion to speak out against for Michael Swart’s T-SQL Tuesday #052.

BOL still doesn’t state it completely clear when it defines it as “Occurs with parallel query plans when trying to synchronize the query processor exchange iterator. If waiting is excessive and cannot be reduced by tuning the query (such as adding indexes), consider adjusting the cost threshold for parallelism or lowering the degree of parallelism.” The true meaning of this wait is much easier, with Bob Ward’s explanation of “This wait type only means a parallel query is executing.”

Yes, that’s it, it’s not really all that complex. If you have a MAXDOP of 4 then you could end up with 5 threads. Think of it as having one manager who can delegate work out to 4 subordinates. The subordinates do their pieces of the job and the manager marks down CXPACKET on his time sheet while he waits for them. When they finish, the manager compiles their work (repartitions the streams if you want). There was a little extra work done here because the manager wouldn’t have to combine all the work if he did it all on his own (MAXDOP = 1), and it’s possible the subordinates did more work than they had to. However, in most cases, it does get done faster, and duration is what all your users talk about.

The advice in BOL above saying “consider adjusting the cost threshold for parallelism or lowering the degree of parallelism” is incomplete at best. Upping the cost threshold is exactly what it sounds like, making it so only more expensive tasks can be assigned to those subordinates. Then lowering the degree of parallelism is just dropping how many subordinates you can use.

Cost Threshold for Parallelism is something Brent Ozar makes fun of a bit, because it’s a default of 5 units that are undefined anymore. Jeremiah Peschka recommends setting this to 50 and tweaking it from there. I’d recommend 25 for most OLTP because I like parallelism and my procs aren’t stressed. However, the point is that you should only tweak what you understand, and now I hope you have a descent understanding of this. If you have a lot moderately large queries you wish could break out the work to get it done faster, drop it. If all of your CPUs are running hot and you wish less work was being done repartitioning streams or compiling more complex plans, raise it.

Max Degree of Parallelism is what people really like to tweak, and it’s usually a bad idea. My personal recommendation is to set it to 4 in OLTP (higher in OLAP, very dependent on your load) and forget it unless you have a really, really good reason to change it. The default of 0 is absolutely horrible for OLTP on larger servers, where this could mean that you’re breaking out a task and assigning it to 80 or more subordinates, then trying to put all of their work back together, and killing yourself doing it. This shouldn’t be above 8 unless you really like to be overworked putting your subordinates’ work back together and dealing with those performance problems. However, it’s great to have subordinates who can help you out when needed, so please don’t drop this below 4 unless you have an amazing reason to do so, and only on the query level if you only have one amazing reason.

So, what do you do when CXPACKET is taking over? The trick is to think of it in terms of how a business would work. You have a ton of work to do, enough where your managers are going crazy trying to find subordinates to do it for them. Do you tell your managers to do it themselves OPTION(MAXDOP=1), do you tell them to go out and find as many subordinates as they can OPTION(MAXDOP=0), or do you tell them you’ll see if you can help them find a more efficient way to do their jobs?

If you have the Cost Threshold for Parallelism set to 25 or 50 you have a descent limit before you consider spreading the load around. So, what takes that long for your server to handle? A good portion of the time you’ll find this when you’re scanning a large index or table, or even doing a very expensive seek. If you look at your execution plan is it doing a scan? If it’s a seek that’s starting your parallelism then does it have a small “Seek Predicate” and a lot to do in the “Predicate” in the properties for that operator? It could be that you could use better indexes, it could be that you could find a more SARGable approach, or a variety of other performance tuning opportunities.

The thing is that parallelism isn’t bad and CXPACKET is saying you’re using a good thing. A high CXPACKET could mean you’re trying to use too much of a good thing because you’re “overworking your managers”, but the workload is typically more of an issue than the fact that you let your manager delegate their work when there’s enough to go around. A CXPACKET a lot higher than your baseline could mean that you have poor statistics or parameter sniffing that’s causing a bad execution plan, or it could mean you have a new query that could use some attention.  When you’re talking about parallelism it’s also possible that the statistics were off and one thread (subordinate) ends up doing 99% of the work, making CXPACKET jump up, and that can be found in the actual execution plans; read more about that in Grant Fritchey’s book.

If you have a MAXDOP around 4 (between 2 and 8, reading these recommendations) and Cost Threshold of 25 or higher then I’d be looking into the workload more than the fact that you’re using parallelism. Sure, some tasks just don’t take well to parallelism and there’s reasons to use a MAXDOP=1 hint, but it’s rare to need to take it down to 2 on the server level and even more rare to turn if off by taking it down to 1.

What ever you do, when you have “a big query using parallelism”, try to focus on “a big query” first, it’s usually the problematic part of that statement.

Related articles

Free SQL Server Training (not a sales pitch)


My company cut the training budget this year, and it’s not that uncommon for that to happen anymore. However, I’m still getting my week of training in, and more, without paying much.

Blogs of people who talk too much – FREE

There’s so much for free online, but you’re reading my blog so I don’t have to dive too deep here. Follow a couple blogs in an RSS reader, and make sure you sign up for freebies they may offer like Paul Randal giving away training!

Personally, I try to read a majority of everything written by:

You can include me on this list if you’re feeling generous. However, they’re all smarter than me, and I’ll understand if you don’t.

Local User Groups – FREE

I know most DBAs don’t go to the local user groups. Some never go, some go when the topic is right, but you can benefit from these every time.

Even the presentations that are outside of your normal job duties are valuable. If someone’s teaching SSIS and you don’t use it, knowing what you can learn from an hour-long presentation will put you in a situation to have more intelligent conversations when these topics come up. Read over job descriptions, they expect DBAs to know everything, and here’s your chance to know a little about it.

Also, you can learn from every presentation. I don’t care if you could give that presentation, people in a field known for introverts could use practice and examples of how to present ideas to others. You’ll see good, you’ll see bad, and you’ll be better because of it. Also, one or two small details you might pick up on a topic you know extremely well will make a big difference.

Speaking of our field being known as introverts, this is your opportunity to step out of your comfort zone and present. That’s what I did. Some present because they were born to do it, and I started presenting because I was born to blend into a cube farm! I’ll be presenting at my local user group in April, have already done a SQL Saturday this year, and would like to do at least two more SQL Saturdays, which drags me (kicking and screaming) into improving my soft skills where a typical DBA, like me, is behind the curve.

Our last presentation by Bill Wolf was an excellent presentation on avoiding common coding mistakes. It wouldn’t have been too much of a stretch for me to make a similar presentation, but I still learned a couple things and met some good people. The next meeting on March 25th will be about the interdependency between SQL Server and .NET, which would be amazing for any DBA or .NET developer to go to, even if Dell wasn’t giving away free laptops as door prizes!

Finally, the presentations aren’t the real reason you’re going there, are they? You’re getting to know the other DBAs in the area, and you’re sitting there talking person-to-person to the ones that want to share everything they know with you. You get to know them, they get to know you, and when one of you wants a new job or new coworker then you have a better chance at finding a good fit that you can be comfortable with from the start.

If you’re not sure where your local user group is or if you even have one, start by looking at this link.  Most groups are affiliated with PASS, and that means they’ll be listed there.

SQL Saturday – $10 Lunch

SQL Saturday is a cross between the local user group and the PASS Summit. In fact, the last one I was at has been referred to as a mini PASS Summit because of the number of high-quality speakers there, including multiple names from the list of bloggers I mentioned above.

The cities that host these events typically do so as an annual event, and there’s probably one not too far from you each year. I live in Pittsburgh, PA, which hosts one just a couple minutes from my house. I’m also trying to make it to ones reasonably close to me such as Cleveland, OH a couple weeks ago that’s about 2 hours away, Philadelphia, PA coming up in June that’s a 5 hour drive where I have family I can stay with. I think I’ll skip Columbus, OH because it’s the week after Philly and I have three small kids. Although I won’t make it this year, I may go to Richmond, VA, which is a 6 hour drive where I also have family to stay with.

It’s been a cold winter here in Pittsburgh, and now I’m starting to think about how much I miss my relatives in Texas, California, Florida and South Carolina. It knocks this idea out of the free or low cost training blog post, but that doesn’t mean it’s a bad idea.

While my company did cut the budget, they still paid mileage and gave me Friday off as a travel day. Even without a budget, still ask to see what you can get.

Virtual PASS Chapters – FREE

Pass also does virtual online events. I know I’m risking making this sound like a PASS commercial doing the local PASS chapter, the PASS SQL Saturdays, now the Virtual PASS chapters, but it’s hard to talk about free or low cost SQL Server training without talking about these people. So, why would I avoid this sounding like a commercial for them? They deserve to be known, and with all this free junk they can’t have much of an advertising budget!

Anyways, the virtual chapters each have their own focus, and I’m a member of the PowerShell, Performance, Database Administration, and I think a couple others. They do presentations about once a month, and I try to watch them when I have time. Worst case, they usually make the recordings available later.

There are also a couple “24 hours of PASS” events throughout the year where there are 24 back-to-back sessions. The next one on the schedule is the Russian version, which I’ll pass on.

BrentOzar.com – FREE

In addition to their blogs, the Brent Ozar Unlimited team (Brent’s just a member, not their leader!!!!) does weekly webinars that are usually very useful to me. I’d say I watch 3 of them a month, missing one either due to my schedule or because it’s a rare one that I don’t feel bad about missing. They put up the recordings later, although I have to admit I usually don’t watch them if they aren’t live. I know I don’t make any sense, but you’ll get used to it.

Twitter – FREE

Seriously, if you can fit your question and #sqlhelp in 160 characters or less, you’ll have someone pointing you in the right direction within minutes. They’ll point you to the posts that you didn’t know what key words to Google to find, but keep in mind that they’re doing this for free. Always be grateful, and understand when they say that digging into the issue would mean a trip to the consultant.

The End – Sad, but still free

So, yeah, my company cut my training budget. That means that so far this year (March 3rd here) I had to stick to learning from Steve Jones, Grant Fritchey, Erin Stellato, Stacia Misner, Tom LaRock, and others in-person. Then I watched free videos by Jes Borland, Jeremiah Peschka, Brent Ozar, Kendra Little, and Doug Lane. I have a free month of the SQL Skills Pluralsight classes with a code that Paul Randal personally emailed me, with which I’m in the middle of a class by Jonathan Kehayias right now. I’m planning on taking full advantage of this trial with classes by Joe Sack, Erin Stellato, Paul Randal, Bob Beauchemin, and Kimberly Tripp, although I’ll end up running out of time and buying a subscription for $30 / month.

So, is a lack of a training budget stopping you? You can say it is, but it’s not.

I gave you a lot and I missed a lot. This is not all-inclusive, and a quick look around the internet will prove that point. There are still Free e-Books and many other resources out there. I have to start skipping stuff or this post will never make it out to you, and what good would that be?

Hey, look!!! There’s a section right below here marked “Comments” that’s perfect for your addition to the list. Maybe it can be all-inclusive!!!

The Approachable DBA


DBAs are the gatekeepers, but if we make it an unpleasant process then people will find a way around the gate. It’s common to think of DBAs and developers being polar opposites that don’t speak to each other. This is wrong! It’s not one of the database-themed answers of “it depends”, this is just flat wrong.

DBA with a bat

I may look calm, but in my head I’ve killed you 3 times.

The issue with this is that when you take a polarizing position you get a polarized result. Take Grant Fritchey’s use of his Hickory Stick of Education for example. He goes on a rant about databases needing to be backed up, which, as protector of the data, is the proper view to have. However, the harsh approach many DBAs take to this role, with this being an exaggerated extreme, alienate their coworkers.

His coworkers on the other end of this conversation, we’ll call them developers to go with the theme here, will take one of two initial views. “I better do what he says, or he’ll beat me”, or “I don’t need this, I don’t need him, I have my reasons, and I’ll do it my way without discussing the details with him.” Unfortunately, both of these answers are wrong, and the ensuing conversation with HR may also be unpleasant.

Although they’re opposite responses, they’re both wrong for the same reason. Neither developer is going to discuss the details. The first developer is going to start backing up the database at any cost, but is a full backup and restore the best recovery option? The second developer isn’t going to back anything up, but are pieces of this database unique and need to be recovered in a disaster? After all, you’re responsibility as a DBA isn’t making sure you’re backing everything up, it’s making sure you can put it back to where it was in accordance to the written agreement you have with the business owners. In this case it could be data that is pulled from 30 different locations and the best option for recovery is to pull that data again, or you could have 90% static and the best option is to have a static backup of one filegroup and nightly backups of another, or…talk to the developer.

The number of details they tell you is directly proportional to how approachable you are. If you use a polarizing approach that sets your approachability to 0 then you’ll get 0 details, even to the extreme that people will try to spin up production servers without your knowledge. I’ve seen it happen. I’ve also seen those same developers come back later and ask us to restore something that was never backed up because we never knew it was production. You can’t allow the recovery conversation to be the first time the developers talk to you because that will end as well as trying to do a restore without a backup. It’s your responsibility to make sure there is communication going on.

So, would I hire a DBA with a reputation for not working well with developers? Maybe, after a long conversation about how to work well with others, not as my only DBA, and it would take a lot more in other areas for to convince me I was making the right choice. I need to have a DBA that will talk to the developers, go out to lunch with them, and have the developers accept someone protecting our data as one of their own. When they get to that point the developers start asking questions that help them use the database as smoothly as possible and the DBAs start asking questions to find out why the bad practices are so tempting in the first place.

Stop and think about this. Would I hire you? I would hire Grant because I’ve met him outside of an interview and feel his is much more approachable than his nicknames and this blog post. Just look at his posts directly before this post where he’s acknowledges the issue saying “I’m pretty sure most people don’t talk to their Dev teams if they can help it, and it does seem like most Dev teams seem to be on less than a perfectly chatty basis with their DBAs”, and after this post where he mentions the struggle to resolve this issue stating “I spent years at my previous job trying to do a better job interacting with the development teams.” The thing is, Grant’s an odd one; he’s a SQL Server MVP which means I can find more than one piece of information about him. Does your future manager know other details about you or is it possible that one rant is the only piece of your reputation they’ve heard? Chances are, if I ever heard pieces of this about you then it’d be an uphill struggle to get to an interview, let alone a job offer. Even worse, IT is a small world, and if you live in the same city as me then I can find someone who has worked with you before that’s not on your list of references.

Now think about it from the side of your current job. Yeah, you just kinda won the fight with making sure this database is backed up, but the developers are starting another project as well. This time they’re making a new database from scratch, didn’t involve you, and it’s an ORM reading and writing directly to tables. Everything’s a clustered index scan or table scan, but it works great with 1,000 rows of sample data. There are no cases where they’re reading from a view, no writes are being done through a proc, and they’ve built a lot of stuff on this foundation before you knew any details of the project. You know if you were involved from the start that it would have been done better from the start, but, still licking their wounds from their last conversation with you, the developers avoided you as long as possible. There’s a big mess, it’s your own fault, and a lot of people are going to be paying for it for a long time.

I’m not asking for you to be a pushover. Instead the point is to discuss the situations, find all of the details, give them proof of your logic when you differ in opinion, educate them as you accept their invitations to educate you, and go out of your way to be viewed as approachable. The result you’re going for is the same, which is recoverability of data, and I’m not advocating letting off until you achieve that goal. However, that’s not the only interaction with the developers you’ll ever have, and this approach will help you “win” this interaction quicker at the expense of not being able to work together through every other interaction. The only thing I’m going for is that DBAs as a whole need to develop the proper soft-skills to work with developers. This is going to help a lot of people in the long run, including you, your family, the developers, and me.

You: Your job will be easier. Things that would be issues will be handled before they’re big issues, and a lot more things will be set up right from the first place. Your reputation will be where you want it, allowing you to enjoy the job you have and get the job you want.

Your Family: You’ll be coming home from work less stressed, actually enjoying your interactions with developers. Beyond that, you’ll be more and more desirable, getting better and better jobs, bringing home a better income to help provide for your family.

The Developers: Problems are avoided, they have a mentor on the database side, and going to work is more enjoyable.

Me: I get really tired of this whole “Us vs. Them” mentality, hearing DBAs talk and only half jokingly say “and maybe even, ‘gasp’, talk to your developers”, and developers, rightfully so, assuming DBAs their enemy from the start. If for no other reason, all of the DBAs in the world should be more approachable to make my life easier!!! The only thing you bullheaded DBAs do is give developers ways to slow down! Why do I always have be here to fix your problems? The whole profession is nothing more than a bunch of socially awkward nerds on a power trip going out and telling developers “NO” with a smile on their face every chance they get. Who needs you?!?!

You know what? The developers that don’t talk to DBAs are usually right in their logic. Why ask someone a question if you know the answer is no, you just don’t know how you’ll get belittled on your way to that answer? In many cases developers aren’t talking to their current DBAs because of experiences at past jobs, but it’s still your problem to fix.

I’ll end this post the way I started it. DBAs are the gatekeepers, but if we make it an unpleasant process then people will find a way around the gate.

ApproachableDBA.com

Related articles

Cleaning up the Buffer Pool to Increase PLE


Chances are you have extra information in the buffer pool for a bad query and it’s dragging down your PLE, causing SQL Server to run slower because it’s reading more from disk. Although this approach is taking a 180 from my post Fixing Page Life Expectancy it has the same effect, with that post focusing on fixing your worst queries and this one focused on fixing your most misused indexes.  One approach doesn’t replace the other, it’s more like burning the candle at both ends, except that you end up with a better running database instead of getting burnt out.

With this approach we start with what’s in the cache.  You’ll see the same types of issues in almost any database, and this just happens to be a production database I’m going through today.

ScreenHunter_01 2014-01-06 14.52.28

The results were found with this query:

SELECT count(1)/128 AS cached_MB 
    , name 
    , index_id 
FROM sys.dm_os_buffer_descriptors AS bd with (NOLOCK) 
    INNER JOIN 
    (
        SELECT name = OBJECT_SCHEMA_NAME(object_id) + '.' + object_name(object_id)
            --name = 'dbo.' + cast(object_id as varchar(100))
            , index_id 
            , allocation_unit_id
        FROM sys.allocation_units AS au with (NOLOCK)
            INNER JOIN sys.partitions AS p with (NOLOCK) 
                ON au.container_id = p.hobt_id 
                    AND (au.type = 1 OR au.type = 3)
        UNION ALL
        SELECT name = OBJECT_SCHEMA_NAME(object_id) + '.' + object_name(object_id) 
            --name = 'dbo.' + cast(object_id as varchar(100))   
            , index_id
            , allocation_unit_id
        FROM sys.allocation_units AS au with (NOLOCK)
            INNER JOIN sys.partitions AS p with (NOLOCK)
                ON au.container_id = p.partition_id 
                    AND au.type = 2
    ) AS obj 
        ON bd.allocation_unit_id = obj.allocation_unit_id
WHERE database_id = db_id()
GROUP BY name, index_id 
HAVING Count(*) > 128
ORDER BY 1 DESC;

Service Broker Errors Table

First, there is an errors table here with Service_Broker_Errors, and that should never be in the top 10. What are we doing wrong? The index ID of 1 tells me right away that this is a clustered index, the fact that the table is at least 1.5 GB tells me that we’re probably not purging old data, and 1.5 GB in memory on this table is hinting that we’re probably doing a clustered index scan.

I’ll start by looking at the proc cache to see what’s going on. There’s only one execution plan that used that index and it is, in fact, doing a clustered index scan as I expected.

ScreenHunter_01 2014-01-06 15.24.05

Predicates or Seek Predicates

In an execution plan you’ll have a seek predicate and just a plain old predicate. The seek predicate is what you were able to do taking advantage of the index being in order, and the predicate is what you had to scan for.

ScreenHunter_01 2014-01-06 15.26.02

This case is easy because we’re only searching by a single column, but others could have both a seek predicate and a predicate. For instance, if I had an index on my customers table with the key columns Active, First_Name, Last_Name then searched where Active = 1 and Last_Name = ‘Hood’ then it will show up as an index seek with a seek predicate of Active = 1 and a predicate of Last_Name = ‘Hood’. Anyways, lets get back on topic with the issue of my Service_Broker_Errors table…

Now this sounds like a reasonable query looking for errors. I’m looking for the errors that occurred in the last so many days. The CONVERT_IMPLICIT(datetime,[@1],0) shows me that someone typed this the lazy way of GetDate()-1, which isn’t as efficient as DateAdd(Day, -1, GetDate()), but you’re getting me off topic again.

Fixing a useless index

Looking at the indexes on this table I realize there is only one, and it has the single key column of ID. For uniqueness you can’t do much better than an ID column, but you have to ask yourself if you’ll ever use it to query by.  In this case the index has never had a seek against it, only scans.  Although there table is rarely queried with only 4 uses in the last 2 months (I limited my historical data for this query), it’s still pulling 1.5 GB into cache for every use.  After a couple seconds of shaking my head I start to create a change request to make add TimeStamp in as the first key column in the clustered index.

ScreenHunter_01 2014-01-06 15.35.27

However, I then remembered that this table is 1.5 GB. Is that right? It’s an error table, so if it’s really 1.5 GB then I should either be cleaning up old data or there are so many problems that there is no reason I should be spending my time tuning. Seeing that it has 0 updates in the last 2 months, I already know it’s old data.  To double-check on this I run a simple query, keeping in mind the ID is still the clustered index, to find the oldest record, and discover that we haven’t removed anything from this table in years.

SELECT timestamp
FROM Service_Broker_Errors
WHERE id = (SELECT Min(ID) FROM Service_Broker_Errors)

So I have to talk to people about how old an error can be before we say we just don’t care. It was determined that we probably don’t care about anything more than a month old. I’m paranoid, it comes with the job title, so I made it three months with my change being this:

DECLARE @ID Int

SELECT @ID = MAX(ID) FROM Service_Broker_Errors WHERE TimeStamp < GETDATE()-90 

WHILE @@ROWCOUNT > 0 BEGIN
    DELETE TOP (10000)
    FROM Service_Broker_Errors
    WHERE ID <= @ID 
END 

IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[Service_Broker_Errors]') AND name = N'PK_Service_Broker_Errors') 
    ALTER TABLE [dbo].[Service_Broker_Errors] 
    DROP CONSTRAINT [PK_Service_Broker_Errors] 
GO 

ALTER TABLE [dbo].[Service_Broker_Errors] 
ADD CONSTRAINT [PK_Service_Broker_Errors] PRIMARY KEY CLUSTERED 
( 
    [TimeStamp] ASC
    , [id] ASC 
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] 
GO 

The reason I did it like this is because I don’t want to fool around with the junk values as I drop the clustered index (rebuilding the table) then recreate the clustered index (rebuilding the table) with all of that information in there, then delete it all and wreak havoc on the new index. Sure, the fragmentation at the page level would be fine since I’m deleting in order, but that’s a lot of changes to the b-tree.

Finally, I add these lines to my nightly cleanup job to keep things trimmed up from now on:

DECLARE @BatchSize Int 

SELECT @BatchSize = 10000 

WHILE @@ROWCOUNT > 0 BEGIN
    DELETE TOP (@BatchSize)
    FROM Service_Broker_Errors
    WHERE TimeStamp < GETDATE()-90
END

I know, I just got done saying that GetDate()-90 was less efficient than DateAdd(Day, -90, GetDate()), but it’s an implicit conversion that’s done once per call. I admit this is a bad habit of mine, and any time there’s even a chance of this being called once per row I have to write it differently than I normally do. Nobody’s perfect, and the most important thing is that I’m getting rid of all those records using batches, right?

In Conclusion with Service Broker Errors

Now that all of this is done I took a 1.5 GB table with all of it in cache to a 20 MB table with only 1 MB in cache. Cleaning up the data had more of an impact that my index change, but it’s usually not that easy to clean stuff up. Even if I wasn’t able to clean up the data, the index change alone would have allowed for the cache usage to be decreased by at least 1.4 GB.

On to the next one

That’s not the only issue I’m seeing here, in fact I bet I could take half of these off the top 10 list. However, today I’m picking up low-hanging fruit and moving on to the next task on my list. In this case, External_Messages is the next one that doesn’t look right to me. Once again, it’s a clustered index (index_id = 1) of a table that isn’t part of the primary focus of the database, which is handling orders.

Starting with the same steps I look in the proc cache to see what put this here. Once again I see a single query in cache referencing the table, but this one is different. It’s the typical IF X = @X or @X IS NULL that we’ve all written, and it’s being used as the predicate.

ScreenHunter_01 2014-01-06 16.34.32

I check the indexes on the table and it has the ID as the only key column of the clustered index, so that’s not an issue. Why isn’t it able to use the clustered index with a seek predicate? After all, I’m rather certain that they almost always, if not always, call this with a parameter where it should be filtered down.

Lets look at the execution plan to make sure. The sniffed parameter value from the first time it was called, as found at the bottom of that statement in the XML version of the execution plan, is, in fact, a non-null value.

ScreenHunter_01 2014-01-06 16.39.02

However, SQL Server can’t guarantee that you’re going to pass it a non-null value, and it has to make an execution plan that can account for either possibility. I’ve seen this before, so I know the basic options for a single optional parameter (there are more options, with increasing complexity). I can either add OPTION (RECOMPILE) to the query or I can rewrite it to be two separate queries.

OPTION (RECOMPILE)

Here’s the option I didn’t choose. This will recompile the statement every run, which isn’t too much of a problem because it’s a quick compile on something that only runs a couple times a day. However, it will make it so this doesn’t show up in the proc cache anymore, which I take advantage of quite a bit (for instance, look at the blog post you’re currently reading). Also, it goes against my rule of using hints as a last resort.
That’s not saying you can’t do it this way, just that I didn’t. The reason it works is because SQL Server knows when it makes this execution plan that it will only be used once, so it doesn’t have to account for the possibility of it being a NULL value next run. In fact, it just makes the execution plan with your parameter as a constant value.

ScreenHunter_01 2014-01-06 16.51.09

Two queries

Here’s the method I decided to go with. Assuming third-normal form doesn’t apply to query text, you should be good to go here. If it’s simple enough then it will be very obvious to anyone editing this at a later point that they need to make any changes in two places.

CREATE PROC NotTheRealProcName
    @id INT = NULL
AS

IF @id IS NULL BEGIN 
    SELECT ID
        , Subject
        , Message
        , DateAdded 
    FROM ExternalMessage EM
END ELSE BEGIN
    SELECT ID
        , Subject
        , Message
        , DateAdded 
    FROM ExternalMessage EM
    WHERE ID = @ID
END

This solution isn’t too complex with just a single parameter as it only creates two queries, but if you get just three parameters and try to do this then you’re up to 8 queries. The OPTION (RECOMPILE) method starts to look pretty good as a quick and easy fix before too long. I would still prefer one of the long and drawn out methods, such as getting interim results to a temp table, making it look a little more complex, but getting it to run efficiently with a cached plan.

It worked as I expected, with no one actually getting all of the IDs. I left the code in place to make it possible to get all of them to avoid digging through all of the application code to find where it could return everything then make sure it gets changed there. The result I was looking for was cleaning up the cache, which I got when this dropped from 1.4 GB down to 25 MB in cache.

The End

Overall I was able to drop about 3 GB out of cache, making room for other data while raising the PLE, in two simple changes that were rather easy to track down just by saying “that doesn’t look like it belongs here”. Even the ones that do look like they belong there probably don’t need to have that large of a presence in the cache.

Although I didn’t actually measure the impact that it would have on the end users in these cases because I didn’t start with a performance complaint, I would imagine that SQL Server tends to run faster returning processing 1 row it found quickly instead of reading through a couple million. So, measured or not, I’ll count that as a success as well.

PS. Jonathan Kehayias helped me

My queries looking into the proc cache are slightly modified versions of Jonathan’s work on his blog. Honestly, it didn’t need tweaked or touched for any reason other than the fact that I learn by tinkering. I did help myself by adding in filtering by database because I have a couple servers that have multiple copies of the same database, but, again, it worked great before I touched it, which is something you should expect from Jonathan’s work.

Here’s my altered version:

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
DECLARE @IndexName SYSNAME = '[PK_ExternalMessage]'; 
DECLARE @DatabaseName SYSNAME;

SELECT @DatabaseName = '[' + DB_NAME() + ']';

WITH XMLNAMESPACES
   (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT
    n.value('(@StatementText)[1]', 'VARCHAR(4000)') AS sql_text,
    n.query('.'),
    cp.plan_handle,
    i.value('(@PhysicalOp)[1]', 'VARCHAR(128)') AS PhysicalOp,
    i.value('(./IndexScan/@Lookup)[1]', 'VARCHAR(128)') AS IsLookup,
    i.value('(./IndexScan/Object/@Database)[1]', 'VARCHAR(128)') AS DatabaseName,
    i.value('(./IndexScan/Object/@Schema)[1]', 'VARCHAR(128)') AS SchemaName,
    i.value('(./IndexScan/Object/@Table)[1]', 'VARCHAR(128)') AS TableName,
    i.value('(./IndexScan/Object/@Index)[1]', 'VARCHAR(128)') as IndexName,
    i.query('.'),
    STUFF((SELECT DISTINCT ', ' + cg.value('(@Column)[1]', 'VARCHAR(128)')
       FROM i.nodes('./OutputList/ColumnReference') AS t(cg)
       FOR  XML PATH('')),1,2,'') AS output_columns,
    STUFF((SELECT DISTINCT ', ' + cg.value('(@Column)[1]', 'VARCHAR(128)')
       FROM i.nodes('./IndexScan/SeekPredicates/SeekPredicateNew//ColumnReference') AS t(cg)
       FOR  XML PATH('')),1,2,'') AS seek_columns,
    RIGHT(i.value('(./IndexScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)'), len(i.value('(./IndexScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)')) - charindex('.', i.value('(./IndexScan/Predicate/ScalarOperator/@ScalarString)[1]', 'VARCHAR(4000)'))) as Predicate,
    cp.usecounts,
    query_plan
FROM (  SELECT plan_handle, query_plan
        FROM (  SELECT DISTINCT plan_handle
                FROM sys.dm_exec_query_stats WITH(NOLOCK)) AS qs
        OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) tp
      ) as tab (plan_handle, query_plan)
INNER JOIN sys.dm_exec_cached_plans AS cp 
    ON tab.plan_handle = cp.plan_handle
CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/*') AS q(n)
CROSS APPLY n.nodes('.//RelOp[IndexScan/Object[@Index=sql:variable("@IndexName") and @Database=sql:variable("@DatabaseName")]]' ) as s(i)
--WHERE i.value('(./IndexScan/@Lookup)[1]', 'VARCHAR(128)') = 1
OPTION(RECOMPILE, MAXDOP 1);

Indexing Fundamentals


First, lets understand what the types of indexes are. If you take a reference book, you have the two types of indexes. The book itself is a clustered index, it contains all the data and has the key field of page number. The index in the back is your nonclustered index, it’s in order by the key field of subject, then has the implied second key field in there for page number, which nonclustered indexes in SQL Server have as well. On top of that, the table of contents at the beginning of the book is just like a b-tree built on every index that helps you get to the right page faster.

The key fields I mentioned are the fields everything is ordered by, just as you would expect a book to be ordered by page number. Nonclustered indexes also have implied key fields, which are all of the clustered index’s key fields that weren’t explicitly made key fields already, which is a great reason to keep your clustered indexes as narrow as possible in most cases, which means they have fewer columns.

If you have a phone book then you have a clustered index with the key fields last_name, first_name, address, phone_number. That means it’s ordered by each one, so any duplicate last names are still in order by first name, and so on. It’s wide for a clustered index, but doesn’t have any nonclustered indexes to worry about, so wide is fine here.

However, the important thing to notice is that if I asked you for people with the last name “Hood”, you’d find them rather quickly. If I asked you whose phone number is 412-555-6789, you’d cringe. The same thing happens in SQL Server where it will do the work you ask of it, but it won’t be cheap, quick and easy. This is why the internet has reverse phone number lookups where they have a nonclustered index on phone number which lets you find the name quickly by just knowing the number.

Indexes can also have included columns which are stored in the index, but no sorting is done on them. These fields don’t even exist in the b-tree for finding the information, only the page level where everything is stored. To simplify the process, this is where you put fields that you have in your SELECT clause but not your WHERE clause since don’t need it sorted to search quicker, you just need it to fulfill your query’s needs.

How are they used, and what do they cost?

A nonclustered index is used exactly the same way you read a reference book. You look up the key field in the back of the book and see what information is there. If you covered all the information that you needed just by looking at that index then you’re done. However, if you want to know more that wasn’t listed in the index then you take the key field of the clustered index (the page number) and do a key lookup by flipping to those pages and reading more about the subject at hand.

If you noticed, reading all that information in the back of the book seems pretty simple and efficient, especially when all of the information is there so you don’t have to touch the book itself. If you don’t have to touch the book then that’s called a covering index, which can be accomplished by adding the fields you want to retrieve to an index as either included or key columns. Depending on how many of these key lookups you have to do and how much extra information it has to get from the clustered index, it may make sense to make your index a covering index.

However, none of this is free. Every index you make has to be maintained. If you ignore filtered indexes, which I’ll get to soon, every time you insert or delete rows on a table then you have to write to each index. So if you have 6 nonclustered indexes, it doesn’t matter if you have a clustered index or a heap (merely a lack of a clustered index), you have to do one write to the table itself and one write to each nonclustered indexes, so 7 writes for that one row.

Add on top of that maintaining those indexes for fragmentation, extra storage needed on disks and backups, and more data that can possibly be held in the cache, and this stuff really starts to get expensive.

How many indexes should I have?

There’s no magic number of indexes you should have on any single table. Too few makes it hard to find the data, and too many makes a maintenance nightmare with data modifications taking longer. What indexes and how many will vary wildly, but there are methods to figure out what’s appropriate for each table. Chances are, many of your tables will benefit from both dropping indexes and creating them.

Consolidate what you have

First, look at the indexes themselves. If you have indexes that have the same first key field as the clustered index then you can consider removing those indexes with little to no detrimental effects. This is regardless as to if these indexes are being used or not.

Now, look at what indexes are never used. This can be done by looking at sys.dm_db_index_usage_stats to get you up to a month of data (assuming a monthly reboot for patching), or, my preferred method, by querying your tables you have set up that monitor this DMV. If an index is never being used it means it’s never helping you, but it is extra weight that’s adding up and slowing your database down. Just make sure you have enough data to say that the index really isn’t needed before relying on this alone.

Also, look for patterns in the indexes themselves by comparing the first couple key fields. If the couple key fields are the same, these are candidates for having the indexes combined into a single index. Depending on how quickly those key fields make the data semi-unique, you may be talking one to three key fields matching before I’m itching to reduce the number of indexes. You can do this by taking the columns that are in the second index that aren’t in the first and adding them to the first index as included columns, then drop the second index. Queries that used to hit either one of these indexes should run slower by an unnoticeable amount, yet you’re only storing data, doing updates, and maintaining a single index instead of two.

When you’re going through this process you just have to be careful that you understand the implementations of dropping a unique index. It is rarely a good idea to get rid of a unique index unless that uniqueness is enforced by other means as well.

Take this example.  I have a heap with a unique index on Col1 that gets scanned, but no seeks.  Also, there’s a NC index on Col2 that has a lot of seeks then is obviously doing a lot of lookups to get more information from the heap.  In this case I happen to know that the “unique” part of the index on Col1 is because that’s an identity column, so I’m not too concerned about maintaining the unique constraint here.  After seeing the usage stats over a month I’m able to restructure this table.  Both nonclustered indexes were dropped, a new unique clustered index was created on Col4 and Col1.  User Lookups were eliminated since it’s all just a clustered index now, we saved about 700 MB on disk, and the compressed equivalent on backups as well.

Heap_With_Unique_NC_Identity

Take another example where you can see that an index is duplicated.  No, none of these are 100% the same, but the second index is the unique (is_unique field), clustered (* for included columns, it’s just my way of doing it) has a single key column on Col1, which is the same as the first nonclustered index.  Seeing the usage like this, I’m going to drop the first index, saving 435 MB of space in my database.  This will have a negative impact in that it will move 626 index scans done throughout the month to an index that’s twice the size, and that’s a problem I’ll revisit later when I’m tuning queries that have high physical reads.

Notice you can also see two issues on this table that I’ll need to look into.  First, I have a nonclustered index that’s larger than my clustered index, something is wrong here.  Perhaps I should have put the fill factor on here to show that wasn’t the issue.  The real answer is that I’m working on a table that somehow got missed on the index maintenance, and I now know that this is possible and needs to be fixed.  Another issue is that there is a filtered index on here that has about 0.1% fewer rows than the unfiltered indexes, which means that you’re getting no benefits while alienating queries that may have used this otherwise.

Unused NC2

And one last example.  The 2nd and 3rd indexes both have a single key field on Col2 and just vary on the included columns.  This is a simple task of combining the two indexes by making one of them include Col1, Col2, Col6, Col7, Col8, and Col9, then dropping the other.  That saves us about 200 MB on disk, and avoids us having duplicated data in the cache.  Ask a SAN or backup admin and they’ll tell you the disk savings is great, but ask an unrealistically informed user and you’ll realize that cache savings is what makes a difference dropping your physical reads, upping your PLE, and making your app runs faster.

Then, there’s another simple task on this table of dropping the final index, which is nothing more than 112 MB of dead weight that was being updated an average of 30 times a day.  Not much to say here, they obviously didn’t use this as much as they thought they would.

Finally, you’re left with the info that User_Lookups is almost as high as the total number of seeks and scans on the nonclustered indexes.  This means that almost every query that hits the nonclustered indexes need to do a key lookup on the clustered index.  Knowing that, if I was going to get to that level of tuning, I would be looking into the queries that hit this table and determine if I want to add more included columns to actually cover the query or admit that I’ll never cover it efficiently and remove some included columns.  My answer here will vary partially depending on how many records are involved – key lookups are ok for 10 rows, but not so much for 1,000,000 rows.

Duplicated Index

None of these are hypothetical situations I forced AdventureWorks to do.  This is a real, production database, and I wrote two queries against it.  One to help me find these problems so I can put a change request in to fix them, and a second query that does the exact same thing while obfuscating the names to protect my employment.  So when you see the duplicated indexes or the sizes I’m working with, you can’t say that would never happen in production….it just did.

You reduced the load, now add to it

Now that you have fewer indexes in your database, it’s time to add to them. There are several ways to figure out what indexes you want to add, and I prefer to do everything manually. That means there’s no DTA (Database Tuning Advisor) here, it’s all scripts with proof that what I’m doing is making a difference.

Where are you finding the queries to tune

There are a couple approaches to finding the queries in most need of attention. Query Stats will give you the queries that take the most resources during your peak times, which will reduce the load on your server the quickest. Traces on duration will give you the queries that take the longest along with sample calls with all the parameters you need. Blocking monitors will give you the queries other queries are waiting on the most. Then, one of my favorites, job shadowing gives you what the pain points are for the users.

The first thing I look at is blocking, since that silently sits in the background making queries that could run extremely fast sit there and wait for a single slow query. If there is a lot of consistent blocking then take the head blockers and start tuning them. This won’t eliminate the fact that they’re grabbing locks, but it will cause those locks to be released quicker. Locks by themselves are not an issue and shouldn’t be eliminated, but the impact the locks have on other queries are an issue that is best to be handled this way. If you don’t have many locks that go too long, you can skip this and revisit it later. I say “too long” instead of giving you a number because it varies a lot; 5 seconds is too long for a web app while a minute isn’t much for a data warehouse.

Once that is out of the way, I’ll start diving into the most expensive queries on the server. They may not be the ones with the longest average run-time, but they add up to the most on the server. If you can cut the reads from 100,000 to 1,000 on a proc called every second then you’re doing more for the server than you would by tuning an hourly query that does 10,000,000 reads. Also, focusing on physical reads will help raise your PLE if that’s a concern on your server, which it usually is. Don’t stick on this for too long, you’re just trying to get the easy tuning out of the way and hopefully change the look of your Top-10 most expensive queries list a bit.

Now I move on to looking at the traces to see if any huge queries are running that are easy tuning opportunities. Again, skim through it, pick up the low hanging fruit and move on. Right now you’re still in the phase of reducing the load on the server, but you may not have hit what the users see yet.

Another place to look is in the cache to find out what indexes are taking up the most space there. If you know your data well enough then you should know what to expect from your cache. What you see there will probably surprise you, and you’ll probably see an Index_id of 0 or 1 (Heap or Clustered Index) sitting there taking 10 times what you would expect it to do. These are cases where there are clustered index scans taking place that you need to find. I was recently able to clear up 12 GB of cache on a server doing a full index scan on a 12 GB table where a 10 MB filtered index was able to do the job, which made PLE almost double.

So, now that you have the server running a little (or a lot) better in general, I get into the work that never ends. Get a hold of the people using the app, either by yourself or asking customer service to arrange it, and find out what they hate the most about the app’s performance. What you need here is for someone to write a list of when anything took a while, what screen they were on, and approximately how long it took. Don’t go requesting anyone to use stopwatches, you’re looking for “a couple seconds”, “about a minute”, or “I click OK then go get a cup of coffee.” Your goal here is to be able to correlate their pain points to something captured in your traces, giving you the procs they’re using along with real-life example calls to dive into.

Something you may find odd is that I’m diving into the part that affects the end users the most last, but I do it on purpose. The other pieces are things I skim through and do what I can in a week or so, but they’ll make the bigger impact on the server in general along with resolving some of the customer’s complaints in the process. In the process, you’re learning more about the server and how everything goes together so that when you start talking to your customers you can make a more immediate impact on their complaints.

After getting through list once I let my ADD take over, and man do I have ADD. The customers have an idea of what a DBA can do for them now to the point that they may even be contacting you directly when something annoys them, developers realize a SQL Server specialist can be a good person to lean on, and your server’s biggest pain point hurts a lot less. Once you get to this point, jump around on this list on your servers and take up a query or two when you feel like it.

New Job Checklist


I started a new job a month ago, but didn’t want to just figure things out as they came to the surface. Instead, I walked in with a list of questions I wanted to know about their environment. For both my side and theirs, this went over very well.

For those used to my blog, you’d expect all the performance monitoring to be a part of it, but that’s not all there was to it. In fact, the most important questions had to do with backups, recoverability, and nonprod environments. Lets go through it all in the order that I feel it’s important. This is already a long enough list with just saying what to watch for without getting into the details, and I’ll do deep dives on each section either over time or upon request.

Staying Organized

Do we know what SQL Servers are out there and which ones are prod? Also, who do we contact if it either went down on its own or if we’re going to intentionally take it down for maintenance? For when you do take it down intentionally, what’s the typical maintenance window for each server? Also, what builds of SQL Server and Windows are you running, including SQL Server build number and architecture for both Window and SQL Server? There are a lot of things we should know about the servers we’re working on, right up to basic hardware stats with current and max RAM and the number of cores for licensing. Know what you’re working with and make sure you have all the access you need on all the servers you’re supporting.

Backups and Recoverability

When was the last time each Tier-1 system was recovered from offsite backups? I’m looking for two things here. First, that they are being backed up offsite. Second, that we know those backups are working by testing them regularly. I don’t care if you use tape or disk, and I don’t care if the “test” was an emergency restore to get data back that someone accidentally deleted a week ago. Do they exist and do they work? In my case, I found a bit of a surprise here and ended up getting in the middle of a business continuity plan as seen on last week’s post,Disaster Recovery and Business Continuity

Continuing on with Business Continuity, I wanted to get the typical misconceptions out of the way. So I continued with the questions of how long it would take to recovery from different types of disasters and what does the business think those expectations are. Also, how much potential data loss does the business think is possible compared to the current recoverability? I know from my last blog post that some people feel that I’m overstepping the boundaries of a DBA with some of this, and I’m ok with that. My immediate boss is the VP of Infrastructure, and I think that people would do much better trying to view things from their boss’s point of view instead of their current position’s point of view.

Security

Now that I have basic recoverability covered, I moved on to security. First, most companies have too many people with access to prod along with sensitive data in all environments. This is easy to overlook. First thing’s first, protect prod at all costs. Who has what permissions, specifically, who has sysadmin? It used to be BUILTIN/Administrators, and your old boxes will still have that default, often unchanged. Don’t trust them to tell you who, query it yourself.

The next concern is the sensitive information that needs to be in prod and gets carried down to nonprod. Information such as SSNs don’t need to be accurate in nonprod, and there doesn’t need to be a way to recreate them from the nonprod data. However, you have to be careful when making this data useless since many apps join on this field, making you have a consistent algorithm to obfuscate the data. Hopefully you have another field such as EmployeeID or ClientID that you can substitute consistently and the developers would never know the difference.

Server Builds

Is there any consistency to the servers? Is the Min/Max server memory set up properly, are they using Instant File Initialization where possible, how big and how many files are there for TempDB, what is MaxDOP set as, what accounts does SQL Server run off of and is it a different account for each server? You need to have consistency if you’re going to consistently manage these servers. Yes, the values will vary, but it’s good to know that they’re set up right before you start getting into the issues. Jonathan Kehayias and Brent Ozar both have good server build docs out there, and I think most DBAs could benefit from reviewing what they have.

Depending on how they do it, the DBA will have some role in setting up failure notifications for SQL Agent as well. I’ve seen this as Database Mail where the DBA does almost everything and as an external program that reads the error logs where the DBA just makes sure the jobs are set to write to the app log on failure. Either way, you have to query the system tables on all the servers to make sure the setting are what you expect them to be.

You also can’t forget about the maintenance jobs. For which I believe everyone can benefit from Ola Hellengren’s Maintenance Solution. You may not use all of it, but the parts you do use you won’t write this thoroughly. Ola did something amazing here, put it out there for free, and you’d be crazy not to at least look at his generous offerings.

Staying Up-To-Date

It’s best to find out up-front what the plans are to move forward with getting everything to SQL 2012 or 2014. Even below that level, it’s good to know the process of getting all servers to the latest service pack and what the plan is with hotfixes and cumulative updates. The quicker you know this information the quicker you’ll know what’s on your plate for the next couple months.

How are we doing?

How do other departments view the DBA group? How do our customers (end users of any type) view our overall performance? How will you ever know if you don’t ask? The answers to questions like these are going to make a big impact on where your focus is. I would love to have questionnaires go out to Sr. Management, Mid-Level Management, and to End Users to find out their thoughts on a lot of different things, with a mix of open-ended questions such as their 3 favorite and 3 least favorite things on different areas and simple ratings questions with a rating from 1 to 4 (there is no average, there’s horrible, bad, good, and great). I want to know what they hope the app can do differently a year from now, how they view performance, and what their biggest headaches are.

In my case I made a mistake here when starting my new job because I came in knowing they had severe performance issues. When I looked at the different servers I saw a 2 TB OLTP with PLE dipping below 1,000 at times and a 100 GB OLTP with PLE staying above 10,000. I KNEW which one they were complaining about performance on and starting making some huge improvements. I was wrong, and that 100 GB database needed a lot more help then it was letting on. That’s not to say that the improvements I made weren’t needed or appreciated, it’s that I was fixing stuff that internal users were complaining about while waiting for another manager to ask me why I wasn’t focusing on making life better for the people who were paying our bills. It flew under my initial radar because it was smaller waits and coding issues while it was a constant struggle for the users that could pick one of our competitors over us.

I made a couple large changes that the SAN admin should notice on the 2 TB database and I got a couple “good job” comments out of it. Then I made a smaller change on the 100 GB database and got genuine gratitude and appreciation from multiple people. It wasn’t that I was working harder, it’s that my effort was focused in the right place after knowing where we stood and where our pain points actually were.

Finally, Monitoring

Once you know what you have, if it’s being backed up properly, it has security, is built consistently, there are plans to stay up-to-date, and you know where the databases stand from others’ point-of-view, you are finally ready to start looking into monitoring your servers. Look at my Monitoring and Baselines Presentation to give you an idea of what I’m looking to watch on each server. To be fair, we have Idera at my new job and I think that works out rather well, so there are things that I don’t have custom code watching. What I do is make sure that there is something accessible that watches everything I need and most of the things I want without duplicating too much work.

I do have complaints about Idera, and perhaps I’ll get into the pros and cons about the product in a later post. As far as this type of product, it is by far the one I have the most pros to talk about, so if any of the folks at Idera are reading this then you shouldn’t be sweating too much.

Wrap it up already!

This is a quick overview of the things that I’d prefer to talk about in my first two weeks on a job. Nothing was hit in-depth, and I could definitely write a whole post on each subject. However, don’t count on that happening. If you want a dive into a subject, ask.

I’d love to hear your comments on this, with the warning that I will stick to my guns completely if asked about wandering outside of the DBA role. I used to stick closer to my job description and requested duties, but didn’t feel like I fully bloomed until I tried to see everything from a higher view, most often from my boss’s view. Everyone organization can benefit from more leadership, and leaders have nothing to do with position in the company. I truly hope that this post encourages at least one or two people into taking the lead and getting the full picture under control, no matter whose responsibility it was.