I started a new job a month ago, but didn’t want to just figure things out as they came to the surface. Instead, I walked in with a list of questions I wanted to know about their environment. For both my side and theirs, this went over very well.
For those used to my blog, you’d expect all the performance monitoring to be a part of it, but that’s not all there was to it. In fact, the most important questions had to do with backups, recoverability, and nonprod environments. Lets go through it all in the order that I feel it’s important. This is already a long enough list with just saying what to watch for without getting into the details, and I’ll do deep dives on each section either over time or upon request.
Do we know what SQL Servers are out there and which ones are prod? Also, who do we contact if it either went down on its own or if we’re going to intentionally take it down for maintenance? For when you do take it down intentionally, what’s the typical maintenance window for each server? Also, what builds of SQL Server and Windows are you running, including SQL Server build number and architecture for both Window and SQL Server? There are a lot of things we should know about the servers we’re working on, right up to basic hardware stats with current and max RAM and the number of cores for licensing. Know what you’re working with and make sure you have all the access you need on all the servers you’re supporting.
Backups and Recoverability
When was the last time each Tier-1 system was recovered from offsite backups? I’m looking for two things here. First, that they are being backed up offsite. Second, that we know those backups are working by testing them regularly. I don’t care if you use tape or disk, and I don’t care if the “test” was an emergency restore to get data back that someone accidentally deleted a week ago. Do they exist and do they work? In my case, I found a bit of a surprise here and ended up getting in the middle of a business continuity plan as seen on last week’s post,Disaster Recovery and Business Continuity
Continuing on with Business Continuity, I wanted to get the typical misconceptions out of the way. So I continued with the questions of how long it would take to recovery from different types of disasters and what does the business think those expectations are. Also, how much potential data loss does the business think is possible compared to the current recoverability? I know from my last blog post that some people feel that I’m overstepping the boundaries of a DBA with some of this, and I’m ok with that. My immediate boss is the VP of Infrastructure, and I think that people would do much better trying to view things from their boss’s point of view instead of their current position’s point of view.
Now that I have basic recoverability covered, I moved on to security. First, most companies have too many people with access to prod along with sensitive data in all environments. This is easy to overlook. First thing’s first, protect prod at all costs. Who has what permissions, specifically, who has sysadmin? It used to be BUILTIN/Administrators, and your old boxes will still have that default, often unchanged. Don’t trust them to tell you who, query it yourself.
The next concern is the sensitive information that needs to be in prod and gets carried down to nonprod. Information such as SSNs don’t need to be accurate in nonprod, and there doesn’t need to be a way to recreate them from the nonprod data. However, you have to be careful when making this data useless since many apps join on this field, making you have a consistent algorithm to obfuscate the data. Hopefully you have another field such as EmployeeID or ClientID that you can substitute consistently and the developers would never know the difference.
Is there any consistency to the servers? Is the Min/Max server memory set up properly, are they using Instant File Initialization where possible, how big and how many files are there for TempDB, what is MaxDOP set as, what accounts does SQL Server run off of and is it a different account for each server? You need to have consistency if you’re going to consistently manage these servers. Yes, the values will vary, but it’s good to know that they’re set up right before you start getting into the issues. Jonathan Kehayias and Brent Ozar both have good server build docs out there, and I think most DBAs could benefit from reviewing what they have.
Depending on how they do it, the DBA will have some role in setting up failure notifications for SQL Agent as well. I’ve seen this as Database Mail where the DBA does almost everything and as an external program that reads the error logs where the DBA just makes sure the jobs are set to write to the app log on failure. Either way, you have to query the system tables on all the servers to make sure the setting are what you expect them to be.
You also can’t forget about the maintenance jobs. For which I believe everyone can benefit from Ola Hellengren’s Maintenance Solution. You may not use all of it, but the parts you do use you won’t write this thoroughly. Ola did something amazing here, put it out there for free, and you’d be crazy not to at least look at his generous offerings.
It’s best to find out up-front what the plans are to move forward with getting everything to SQL 2012 or 2014. Even below that level, it’s good to know the process of getting all servers to the latest service pack and what the plan is with hotfixes and cumulative updates. The quicker you know this information the quicker you’ll know what’s on your plate for the next couple months.
How are we doing?
How do other departments view the DBA group? How do our customers (end users of any type) view our overall performance? How will you ever know if you don’t ask? The answers to questions like these are going to make a big impact on where your focus is. I would love to have questionnaires go out to Sr. Management, Mid-Level Management, and to End Users to find out their thoughts on a lot of different things, with a mix of open-ended questions such as their 3 favorite and 3 least favorite things on different areas and simple ratings questions with a rating from 1 to 4 (there is no average, there’s horrible, bad, good, and great). I want to know what they hope the app can do differently a year from now, how they view performance, and what their biggest headaches are.
In my case I made a mistake here when starting my new job because I came in knowing they had severe performance issues. When I looked at the different servers I saw a 2 TB OLTP with PLE dipping below 1,000 at times and a 100 GB OLTP with PLE staying above 10,000. I KNEW which one they were complaining about performance on and starting making some huge improvements. I was wrong, and that 100 GB database needed a lot more help then it was letting on. That’s not to say that the improvements I made weren’t needed or appreciated, it’s that I was fixing stuff that internal users were complaining about while waiting for another manager to ask me why I wasn’t focusing on making life better for the people who were paying our bills. It flew under my initial radar because it was smaller waits and coding issues while it was a constant struggle for the users that could pick one of our competitors over us.
I made a couple large changes that the SAN admin should notice on the 2 TB database and I got a couple “good job” comments out of it. Then I made a smaller change on the 100 GB database and got genuine gratitude and appreciation from multiple people. It wasn’t that I was working harder, it’s that my effort was focused in the right place after knowing where we stood and where our pain points actually were.
Once you know what you have, if it’s being backed up properly, it has security, is built consistently, there are plans to stay up-to-date, and you know where the databases stand from others’ point-of-view, you are finally ready to start looking into monitoring your servers. Look at my Monitoring and Baselines Presentation to give you an idea of what I’m looking to watch on each server. To be fair, we have Idera at my new job and I think that works out rather well, so there are things that I don’t have custom code watching. What I do is make sure that there is something accessible that watches everything I need and most of the things I want without duplicating too much work.
I do have complaints about Idera, and perhaps I’ll get into the pros and cons about the product in a later post. As far as this type of product, it is by far the one I have the most pros to talk about, so if any of the folks at Idera are reading this then you shouldn’t be sweating too much.
Wrap it up already!
This is a quick overview of the things that I’d prefer to talk about in my first two weeks on a job. Nothing was hit in-depth, and I could definitely write a whole post on each subject. However, don’t count on that happening. If you want a dive into a subject, ask.
I’d love to hear your comments on this, with the warning that I will stick to my guns completely if asked about wandering outside of the DBA role. I used to stick closer to my job description and requested duties, but didn’t feel like I fully bloomed until I tried to see everything from a higher view, most often from my boss’s view. Everyone organization can benefit from more leadership, and leaders have nothing to do with position in the company. I truly hope that this post encourages at least one or two people into taking the lead and getting the full picture under control, no matter whose responsibility it was.