Data Compression

Data compression is often misunderstood to cost CPU in exchange for smaller size on disk.  Somewhat true, but that simple explanation ignores other savings that often result in net drop in CPU utilization.

Full disclosure: This is an Enterprise-ONLY feature introduced in SQL 2008.  It is engrained in the structure of your data, so it also means you can’t take a backup of a database that has a compressed index and restore it to anything other than Enterprise or Developer Editions.

Here are the simple facts we’ll play with:

  • Two levels – row-level and page-level
  • Page-level is row-level plus extra compression
  • Compression ratios vary by column types, index width, and data
  • Data is compressed both on disk and in memory
  • CPU goes both ways, and it needs to be tested
    • Uses CPU to compress on writes and index maintenance
    • Uses CPU to decompress when used by a query
    • Saves CPU with fewer physical reads
    • Saves CPU with fewer logical reads
    • And more…

Abstract Thought

This post is at a level of abstraction that doesn’t get into what happens in the background.  My goal is to encourage you to test it out, understand why it helps, and be able to explain that in your change control process.

For those of you who aren’t satisfied with my “Gas pedal make car go fast” explanation, Jes Borland (b|bob|t) wrote A Look Inside SQL Server Row and Page Compression, and Brad McGehee (b|t) wrote A Quick Introduction to Data Compression in SQL Server 2008.

You can even dive into more details such as using different levels of compression on each partition of an index, or even talking to Joey D’Antoni (b|t) about the archival levels of compression on columnstore indexes.

There’s a lot of detail on how compression can cost CPU, but the details that save CPU are typically only mentioned in passing without doing a deep dive into the topic.  Data Compression: Strategy, Capacity Planning and Best Practices mentions that less Logical I/O is less to consume CPU.  SQL Server Database Compression is indirectly mentioning having a smaller B+Tree structure.

The purpose of this post isn’t to earn anyone a doctorate (or claim that I’m at that level), it’s more of a practitioner level.

What’s it do?

Each page is the same 8kb size but contains more data per page, as per Captain Obvious.  This means less space on disk and backups.  Those are nice, but I don’t care too much about that.

Then you read the data into memory so queries can use it.  This is a physical I/O to disk that goes through the CPU (using extra CPU to decrypt it if you use TDE) to make it into memory.  It stays compressed when in memory, so all of your indexes (not just this one) have more room to hang around and avoid more physical I/Os and the costs I just mentioned.

Finally, a query needs to use the data, and that has positives (+) and negatives (-).  The data is more likely to be in cache (+) because it’s smaller and a page with more data is more likely to be referenced. It’s easier to get into cache if it wasn’t there already (+). Then it’s easier to get to the data because the smaller data may have fewer levels in the B+Tree (+). Along the way it has to decompress the root and intermediate level pages (-) which are always row-level compressed when you use any level of compression then decompress the leaf-level pages (-) which are compressed at the level you picked.  However, there are fewer pages, which results in less Logical I/O (+).

You’re not going to accurately figure out the positives and negatives of that last paragraph.  The important part is that you know there are positives AND negatives, which means you put away the calculus and just run some tests.

My experience is that if the data is compressed by 25% or more than it helps more than it hurts.  Find how much you’ll save by running sp_estimate_data_compression_savings for both row-level and page-level compression.  If you don’t get much extra compression with page-level then don’t even test it, it’s an added expense that needs to be justified.

What Compresses Well?

The hard way is to analyze each column, its data type, the data in that column, the width of the index, etc..  You can read the links in the Abstract Thought section to see what Brad and Jes have to say about it if you want.  This will be very important if you’re designing tables and keeping how compressible the data is in mind, but less so if you’re compressing already existing indexes.

The easy way (my personal favorite) is to just run sp_estimate_data_compression_savings I just mentioned and actually compress the indexes on a non-prod server.  Computers are good at math, let them do it.

How to Test

I’m not diving deep into testing here, but there are three things that stand out.

  • How much memory are you saving?
  • How do your queries perform?
  • How much faster is the data pulled from disk?

For how much memory you would save, look at my Cleaning Up the Buffer Pool post to see how much memory that index is using.  Since you’re only changing how much space the data takes and not the columns of the indexes here, you can just multiply that by the new compression ratio.  Use the actual ratio comparing the index size in prod to where you’re testing in non-prod to make sure it’s accurate.  Yes, if you have a 10 GB index which tends to be 100% in cache that you just compressed 80%, it will be like you added 8 GB of memory in many ways.

I do query performance and how much faster the data is pulled from disk together, because that’s how it’s done in the real world.  Pick your queries that hit that index, typically by looking in the plan cache or an XEvent session.  Then, on a non-prod server, run the queries both with and without DBCC DROPCLEANBUFFERS, again, on a non-prod server.

You can remove compression on any index, down to the partition level, by doing ALTER INDEX … REBUILD WITH (DATA_COMPRESSION = NONE).  Adding compression is the same statement with ROW or PAGE instead of NONE.

Sum It Up

Do this all in non-prod.

  1. See what compresses well
  2. Test it
  3. Test it again

The End

Let’s hear from you.  If you needed more data to make an informed choice, throw it in the comments where others can benefit from your experience, and I may even edit the post to add it in.  Also, if this is making a big difference for a lot of people, I’ll do what I can to tweak the SEO and help more people find this through search engines.

The best compliment is a question.  It means you value my expertise enough to want my thoughts and opinions.

Advertisements

7 thoughts on “Data Compression

  1. Everything is so dynamic that it’s almost impossible to say exactly how it will affect you. In an oversimplified view, a join is less efficient because it has to decompress data, but more efficient because it is traversing a smaller index. It leaves you with three choices: 1. Avoid a potentially beneficial technology due to uncertainty. 2. Do advanced calculus and statistics to determine a somewhat accurate estimated benefit. 3. Test, test again, implement, verify.

    Keep in mind that a lot of the same people who say to be very cautious about data compression are the same ones who say that backup compression is always worth it. I’m not saying either are always worth it (they aren’t, see TDE with backup compression as an example), but I am saying it’s worth testing. Have results to back up the changes you propose, and have a rollback plan if testing didn’t paint 100% of the picture.

  2. Pingback: Indexing Strategy | Simple SQL Server

  3. Pingback: Use Compression to Combine Data Quality and Performance | Simple SQL Server

  4. Sorry, but I can’t find the reason (neither in your post nor in the both “gas pedal” links), why data compression could save more CPU than it uses for the compressing overhead.

    Of course – compression reduces I/O (and the little CPU that Windows needs for I/O purposes) and saves memory (so it should become faster, except you are CPU bound), but as long you do not have a database that does not fit into the RAM I have no idea, why it should result in fewer CPU usage….

    • What I’m finding is simply stating that Logical I/O takes CPU, compression reduces Logical I/O, so compression can also reduce CPU.

      Data Compression: Strategy, Capacity Planning and Best Practices: Logical I/O (if data is in memory): Because logical I/O consumes CPU, reduced logical I/O can sometimes compensate for the CPU cost to compress and decompress the data.

      SQL Server Database Compression: For medium and large databases, even CPU usage goes down: The threshold, where the overhead for compressing and decompressing pages on the fly starts being smaller than the overhead of sorting out which pages to access, is reached very quickly.

      Personal experience has me testing performance on anything that compresses at least 25%, and never blindly trusting it will be better. Same goes for progressing from Row to Page level compression, if it gets another 25% more compression then test it with Page level as well. I’ll test indexes that compress a little less as well, but they’re lower priority and get scrutinized more.

      I have updated this post to have this included. Thank you for having me back up what I stated, it will help me make this more complete and trustworthy.

  5. I can’t say I understand the mechanics in the backend completely, but there are definitely savings when working with fewer pages of data. Also, as data goes in and out of memory, the only path to and from memory is through the CPU, not counting if you’re using TDE as most of us are forced to do anymore.

    Right now I can test many queries both with and without compression to show a drop in CPU, even with data in cache. I plan on getting to know these reasons better to understand the theory, but will always rely on testing it multiple times as I do now even when I get to that point.

Questions are some of the sincerest compliments

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s