The dmv sys.dm_os_performance_counters is awesome, if you can understand it. This is how I make it easy to read and use. Here are the values I watch and why I watch them. My list isn’t going to be perfect and you’re not going to agree with it 100%, and I’m ok with that. First, there is no perfect. Second, if we agree 100% then one of us is just mindlessly following the other which isn’t good.
- Cache Hit Ratio – I ignore this value, but I still monitor it. I will, ideally, never be the only DBA on a team again, and everyone seems to think this value is cool.
- Page Life Exp – My favorite! When you read a page from disk into memory how many seconds will it stay there? Just don’t use the outdated “300” rule or your disks will catch on fire!!!
- Page Lookups/Sec – How many pages are read from memory.
- Page Reads/Sec – How many pages are read from disk.
- Page Writes/Sec – How many pages are written to disk.
- Lazy Writes/sec – How many pages are written to disk outside of a checkpoint due to memory pressure.
- Batch Requests/sec – How busy is the server?
- Trans/sec – How busy is the server?
- Total Server Memory – How much memory SQL Server is currently using. Typically ramps up to Target value and PLE is low as it ramps up since new pages are in memory dropping the average.
- Target Server Memory – How much memory SQL Server is allowed to use. Should be the same as the max memory setting, but memory pressure can cause this to decrease.
- Memory Grants Pending – How many processes aren’t able to get enough memory to run. Should be 0, always 0, if not then find out why.
- Deadlocks – How many deadlocks are we getting. Most apps handle deadlocks gracefully, but they still lose time doing it. If this number starts going up, start looking into it.
- SQL Compilations/sec – This is a hidden performance killer! Some queries can’t be cached so they’re compiled every time they’re run. I’ve seen this with a query being run once a second and a big server was running slower than my laptop. It’s normal for things to compile throughout the day, it’s not normal for this number to be 10x higher than before that last upgrade.
- SQL Re-Compliations/sec – Same goes here. The counters aren’t that much different.
If you know a little about this DMV then you know these values are cryptic. There’s several ways this data is stored and it has to be retrieved differently for each type to be useful. Then many of these are cumulative since the server was restarted, which isn’t going to help too much. Even worse, MSDN failed us on this one and figuring out this DMV required help outside of that site. Now for the good news, the script below will take care of all of that for you and leave you with some easy reading with values you can filter to the time periods you care about. If you want to add some of your own counters then just follow my lead on one that has the same cntr_type, or you can go to Rabin’s blog post that I learned from.
IF object_id('tempdb..#OSPC') IS NOT NULL BEGIN DROP TABLE #OSPC END DECLARE @FirstCollectionTime DateTime , @SecondCollectionTime DateTime , @NumberOfSeconds Int , @BatchRequests Float , @LazyWrites Float , @Deadlocks BigInt , @PageLookups Float , @PageReads Float , @PageWrites Float , @SQLCompilations Float , @SQLRecompilations Float , @Transactions Float DECLARE @CounterPrefix NVARCHAR(30) SET @CounterPrefix = CASE WHEN @@SERVICENAME = 'MSSQLSERVER' THEN 'SQLServer:' ELSE 'MSSQL$' + @@SERVICENAME + ':' END --Grab the current values from dm_os_performance_counters --Doesn't do anything by instance or database because this is good enough and works unaltered in all envirornments SELECT counter_name, cntr_value--, cntr_type --I considered dynamically doing each counter type, but decided manual was better in this case INTO #OSPC FROM sys.dm_os_performance_counters WHERE object_name like @CounterPrefix + '%' AND instance_name IN ('', '_Total') AND counter_name IN ( N'Batch Requests/sec' , N'Buffer cache hit ratio' , N'Buffer cache hit ratio base' , N'Free Pages' , N'Lazy Writes/sec' , N'Memory Grants Pending' , N'Number of Deadlocks/sec' , N'Page life expectancy' , N'Page Lookups/Sec' , N'Page Reads/Sec' , N'Page Writes/Sec' , N'SQL Compilations/sec' , N'SQL Re-Compilations/sec' , N'Target Server Memory (KB)' , N'Total Server Memory (KB)' , N'Transactions/sec') --Just collected the second batch in the query above SELECT @SecondCollectionTime = GetDate() --Grab the most recent values, if they are appropriate (no reboot since grabbing them last) SELECT @FirstCollectionTime = DateAdded , @BatchRequests = BatchRequests , @LazyWrites = LazyWrites , @Deadlocks = Deadlocks , @PageLookups = PageLookups , @PageReads = PageReads , @PageWrites = PageWrites , @SQLCompilations = SQLCompilations , @SQLRecompilations = SQLRecompilations , @Transactions = Transactions FROM OSPerfCountersLast WHERE DateAdded > (SELECT create_date FROM sys.databases WHERE name = 'TempDB') --If there was a reboot then all these values would have been 0 at the time the server came online (AKA: TempDB's create date) SELECT @FirstCollectionTime = ISNULL(@FirstCollectionTime, (SELECT create_date FROM sys.databases WHERE name = 'TempDB')) , @BatchRequests = ISNULL(@BatchRequests, 0) , @LazyWrites = ISNULL(@LazyWrites, 0) , @Deadlocks = ISNULL(@Deadlocks, 0) , @PageLookups = ISNULL(@PageLookups, 0) , @PageReads = ISNULL(@PageReads, 0) , @PageWrites = ISNULL(@PageWrites, 0) , @SQLCompilations = ISNULL(@SQLCompilations, 0) , @SQLRecompilations = ISNULL(@SQLRecompilations, 0) , @Transactions = ISNULL(@Transactions, 0) SELECT @NumberOfSeconds = DATEDIFF(ss, @FirstCollectionTime, @SecondCollectionTime) --I put these in alphabetical order by counter_name, not column name. It looks a bit odd, but makes sense to me --Deadlocks are odd here. I keep track of the number of deadlocks in the time period, not average number of deadlocks per second. --AKA, I keep track of things the way I would refer to them when I talk to someone. "We had 2 deadlocks in the last 5 minutes", not "We averaged .00002 deadlocks per second there" INSERT INTO OSPerfCounters (DateAdded, Batch_Requests_Sec, Cache_Hit_Ratio, Free_Pages, Lazy_Writes_Sec, Memory_Grants_Pending , Deadlocks, Page_Life_Exp, Page_Lookups_Sec, Page_Reads_Sec, Page_Writes_Sec, SQL_Compilations_Sec, SQL_Recompilations_Sec , ServerMemoryTarget_KB, ServerMemoryTotal_KB, Transactions_Sec) SELECT @SecondCollectionTime , Batch_Request_Sec = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Batch Requests/sec') - @BatchRequests) / @NumberOfSeconds , Cache_Hit_Ratio = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Buffer cache hit ratio')/(SELECT cntr_value FROM #OSPC WHERE counter_name = N'Buffer cache hit ratio base') , Free_Pages = (SELECT cntr_value FROM #OSPC WHERE counter_name =N'Free pages') , Lazy_Writes_Sec = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Lazy Writes/sec') - @LazyWrites) / @NumberOfSeconds , Memory_Grants_Pending = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Memory Grants Pending') , Deadlocks = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Number of Deadlocks/sec') - @Deadlocks) , Page_Life_Exp = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page life expectancy') , Page_Lookups_Sec = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page lookups/sec') - @PageLookups) / @NumberOfSeconds , Page_Reads_Sec = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page reads/sec') - @PageReads) / @NumberOfSeconds , Page_Writes_Sec = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page writes/sec') - @PageWrites) / @NumberOfSeconds , SQL_Compilations_Sec = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Compilations/sec') - @SQLCompilations) / @NumberOfSeconds , SQL_Recompilations_Sec= ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Re-Compilations/sec') - @SQLRecompilations) / @NumberOfSeconds , ServerMemoryTarget_KB = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Target Server Memory (KB)') , ServerMemoryTotal_KB = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Total Server Memory (KB)') , Transactions_Sec = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N'Transactions/sec') - @Transactions) / @NumberOfSeconds TRUNCATE TABLE OSPerfCountersLast --Note, only saving the last value for ones that are done per second. INSERT INTO OSPerfCountersLast(DateAdded, BatchRequests, LazyWrites, Deadlocks, PageLookups, PageReads , PageWrites, SQLCompilations, SQLRecompilations, Transactions) SELECT DateAdded = @SecondCollectionTime , BatchRequests = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Batch Requests/sec') , LazyWrites = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Lazy Writes/sec') , Deadlocks = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Number of Deadlocks/sec') , PageLookups = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page lookups/sec') , PageReads = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page reads/sec') , PageWrites = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Page writes/sec') , SQLCompilations = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Compilations/sec') , SQLRecompilations = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'SQL Re-Compilations/sec') , Transactions = (SELECT cntr_value FROM #OSPC WHERE counter_name = N'Transactions/sec') DROP TABLE #OSPC
Throw that code above here in a proc, schedule it to run every so often (I like 5 minutes) and it’ll….fail. It kinda relies on a couple tables you should create first. Here ya go.
CREATE TABLE OSPerfCounters( DateAdded datetime NOT NULL , Batch_Requests_Sec int NOT NULL , Cache_Hit_Ratio float NOT NULL , Free_Pages int NOT NULL , Lazy_Writes_Sec int NOT NULL , Memory_Grants_Pending int NOT NULL , Deadlocks int NOT NULL , Page_Life_Exp int NOT NULL , Page_Lookups_Sec int NOT NULL , Page_Reads_Sec int NOT NULL , Page_Writes_Sec int NOT NULL , SQL_Compilations_Sec int NOT NULL , SQL_Recompilations_Sec int NOT NULL , ServerMemoryTarget_KB int NOT NULL , ServerMemoryTotal_KB int NOT NULL , Transactions_Sec int NOT NULL ) --You'll typically only query this by one value, which is added sequentually. No page splits!!! CREATE UNIQUE CLUSTERED INDEX IX_OSPerfCounters_DateAdded_U_C ON OSPerfCounters ( DateAdded ) WITH (FillFactor = 100) --Only holds one value at a time, indexes are a waste CREATE TABLE OSPerfCountersLast( DateAdded datetime NOT NULL , BatchRequests bigint NOT NULL , LazyWrites bigint NOT NULL , Deadlocks bigint NOT NULL , PageLookups bigint NOT NULL , PageReads bigint NOT NULL , PageWrites bigint NOT NULL , SQLCompilations bigint NOT NULL , SQLRecompilations bigint NOT NULL , Transactions bigint NOT NULL )
The important part of all this is how you use it. It’s tempting to just look at the last 7 records and say that you know what’s going on, that makes me want to slap you. Every server is different, every server has different loads and baselines, and you’re either underworked or you don’t know what those baselines are for every server you manage. I do simple baselines every time I look at an incident and look at the last hour, the same time yesterday, and the same time a week ago. That gives you a chance to see what’s normal for this server and what’s different right now. This query is so simple you’ll wonder why I even posted it, but it’s effective which is why it’s here. The 7 records per day thing, that’s because 21 records show up on my screen without me scrolling, it is NOT a magic number!
SELECT 'Today', * FROM ( SELECT TOP 7 * FROM OSPerfCounters ORDER BY dateadded DESC ) X UNION SELECT 'Yesterday', * FROM (SELECT TOP 7 * FROM OSPerfCounters WHERE dateadded <= GETDATE()-1 ORDER BY dateadded DESC ) Y UNION SELECT 'Last Week', * FROM ( SELECT TOP 7 * FROM OSPerfCounters WHERE dateadded <= GETDATE()-7 ORDER BY dateadded DESC) Z ORDER BY dateadded DESC
And, well, something I’ve been skipping on my posts and telling people to handle cleanup on their own…. Here’s step 2 of my jobs that populate my monitoring tables to keep your data from being the ever-growing data you’re struggling with in every other app. I delete in batches according to the clustered index. It’s overkill for something deleting one row at a time, or, even if you put this in a separate daily job, 288 rows if the process is scheduled every 5 minutes. So, why the batches? Because I copy/paste my own code everywhere, batches is reusable, and this is how I chop off the tail end of EVERYTHING!
SELECT 'Start' --Give me a rowcount of 1 WHILE @@ROWCOUNT > 0 BEGIN DELETE TOP (100000) FROM OSPerfCounters where dateadded < (GetDate() - 400) END
In the beginning I mentioned that if you agreed with me 100% then one of us is a mindless monkey. Look, I put this out there first, so I’m obviously not the mindless monkey here, am I? There’s a box below that gives you a chance to show that you’re not a mindless monkey either! Tell me I’m wrong, how I can do better, and how everyone else reading this can benefit from it even more! I’ll promote you from mindless monkey to talking monkey!
Related articles
- SQL Server Running Slow (simplesqlserver.com)
- Fixing Page Life Expectancy (PLE) (simplesqlserver.com)
Pingback: What is Deadlock? | weathercode
brilliant
Thank you!
Steve,
Great post! Have you tested this on SQL 2012? I can get it to work on 08R2, but on 2012 I can’t get the following columns to populate data.
[Batch_Requests_Sec], [Free_Pages], [Lazy_Writes_Sec], [Page_Lookups_Sec],[Page_Reads_Sec] ,[Page_Writes_Sec] ,[SQL_Compilations_Sec] ,[SQL_Recompilations_Sec]
The OSPerfCountersLast table populates, but the OSPerfCounters table stays NULL.
I think the script works on SQL 2012, I’m just having problems getting it to work on a different collation. Can you update the script where it will work on a Latin1_General_BIN collation?
I have to admit I’ve never tried this on different collations, but I couldn’t imagine it not working properly as long as everything is consistent.
I’ll test this out tomorrow and let you know what I find.
Thanks Steve, I have a server using the Latin1_General_BIN collation and can’t get this to work for anything. I’ve tried inserting the COLLATE clause but no luck.
From doing more research it almost seems as if something is going wrong on the calculation part, for example, Page_Reads_Sec = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N’Page reads/sec’) – @PageReads) / @NumberOfSeconds
Somewhere the collation names are inconsistent, probably between master database and the database that you’re storing this information in. I haven’t had a chance to reproduce this yet, but that would be the first thing I would do.
Yes, the collation I’m running the query against is Latin1_General_BIN. I’m storing it in a database collated SQL_Latin1_General_CP1_CI_AS
Mixing collations is a bad idea if you can avoid it. When you have one database talking to another then you’ll need to do hints to get almost anything to work properly. Is it possible to have the database you’re storing this in and the master database to have the same collation?
Well, I’m capturing data from multiple servers and storing them on a warehouse server. All my servers are using SQL_Latin1_General_CP1_CI_AS except one. 😦
Upon, more research, it looks like it’s not holding the values for the variables. For example, Lazy_Writes_Sec = ((SELECT cntr_value FROM #OSPC WHERE counter_name = N’Lazy writes/sec’) – @LazyWrites) / @NumberOfSeconds. Once it gets to this point in the script, there are no values for @LazyWrites or @NumberOfSeconds so it results in a NULL. Have you ever saw this before?
There should at least be a 0 in there for those with this already being run: @LazyWrites = ISNULL(@LazyWrites, 0)
Is the error you’re getting a collation error or something related to nulls? Also, I’ll probably have to see how you modified this to know what’s going on.
I haven’t made many adjustments to the code. Just inserting over a linked server. @LazyWrites = ISNULL(@LazyWrites, 0) is still in here. I haven’t removed anything.
There are no errors. Just NULL values in these columns: [Batch_Requests_Sec], [Free_Pages], [Lazy_Writes_Sec], [Page_Lookups_Sec],[Page_Reads_Sec] ,[Page_Writes_Sec] ,[SQL_Compilations_Sec] ,[SQL_Recompilations_Sec]
it’s really strange.
I’m sorry, I haven’t had time to play with this recently. My advice, if you haven’t figured this out already, would be to make sure the queries are working locally then start putting in collation hints everywhere for that server.
The best way would probably be to have a service outside of SQL Server read the data from each server then put it on a central server, which I believe would eliminate worries about different collations, but that’s not exactly a quick and easy fix.
How often do you run this to collect the data?
This data is so lightweight in both resources to retrieve it and store it that every minute wouldn’t hurt, but it also wouldn’t help much. The question is how often will you look at it and what will you be using it for. This isn’t one that you’ll kill your system by over-monitoring, you’d just have more results to filter through when you want to use it.
If you’re trying to find out what’s causing PLE to drop and you want to know right when it’s dropping, every 5 minutes makes sense. If you’re looking for general baselines then every hour is fine.
Other areas of monitoring are a lot worse, specifically getting things like what indexes are in the cache where it has to aggregate a lot of data you can cause damage to your performance by watching it too much. In all cases with all monitoring, know how much each query costs and put thought into if you’d take advantage of the data being that granular.
A good starting point is every 30 minutes for this one. Check back on it in a week or two, look through the data and ask if you need more detail or if you’re collecting more than you’d ever use.
Read this before you apply any monitoring: http://www.brentozar.com/archive/2014/03/the-worst-database-user/
Nice post, very similar to what we are doing.
I have set-up a CACTI poller with a custom PHP script that runs every 5 Minutes and pulls these statistics and puts them onto a CACTI graph.
I have been running these for about 2 years and also an upgrade from SQL 2008 to 2012, it was interesting to see the difference 2012 made and how the performance tweaking over the years has affected the servers.
http://www.cacti.net/
a lot of your posts I have read this evening have given me new ideas of how we can further improve our monitoring. thanks for all the great posts.
I’m glad I was able to help. Please let me know if you run into any issues as this running in a variety of instances like this is more testing than I’m personally able to do.
Just remember that while this counter is perfectly fine to capture every 5 minutes, not everything I blog about it. Some of them risk you being on the wrong side of the performance issues, especially for the ones that look in the cache.
I had to modify the Cache_Hit_Ratio to show percentage
(SELECT cntr_value FROM #OSPC WHERE counter_name = N’Buffer cache hit ratio’) * 1.0
/ (SELECT cntr_value FROM #OSPC WHERE counter_name = N’Buffer cache hit ratio base’) * 100
Free Pages counter is no longer available in SQL Server 2012. Some counters have been standardized as “kb” rather than bytes or pages and a new Memory Node object has now been added in SQL 2012. Hence instead of Free Pages counter, you can now use Free Memory (kb) or Free Node Memory (KB) in 2012 and instead of Total Pages you can use Total Node Memory (KB) and so on
You’re right, and I’ll have to update this post to reflect that. Thank you for pointing this out.