Category Archives: SQL

Building a DBA Salary Calculator, Part 0: Initial findings

I’m planning on building a salary calculator based on the data from Brent Ozar’s salary survey. I already played around with some of the numbers earlier. This time I’m planning on going a lot deeper.

I want your help and feedback! I want to know what would make a calculator most useful to you. Feel free to poke holes in my methodology and tell me how a real data scientist would handle this project.

In this post, I’m going to outline some initial findings as well how I’m planning to approach this project. All of the information below is based on a narrow subset:

  • USA, full postal code
  • DBA job
  • Between $15,000 and $165,000

Regarding zip codes, some people only entered a portion of their zip for privacy sake. In the final analysis, I plan on taking into account the ~200 US individuals who did that.

Initial findings

The data isn’t very predictive

So I’m using something called a multiple linear regression to make a formula to predict your salary based on specific variables. Unfortunately, the highest Coefficient of Determination (or R2) I’ve been able to get is 0.37. Which means, as far as I understand it, that at most the model explains 37% of the variation.

Additionally the spread on the results isn’t great either. The standard deviation, a measure of spread, is about $25,000 on the original subset of data. Which means we’d expect 68% to be within +/- $25,000 of the average and 95% to be within +/- $50,000 of the average. So what happens when we apply our model?

When we apply the model we get something called residuals, which are basically the difference between what we predicted and what the actual salary was. The standard deviation on those residuals is $20,000. Which means that our confidence range is going to be +/- 20-40k. That to me doesn’t seem like a great range.

There are a few strong indicators

Let’s take a look at what we get when we do a multiple regression with the Excel Analysis ToolPak addin:

image

The two biggest factors by far seem to be how long you’ve worked and and where you live. In fact, we can explain 30% of the variance using those two variables:

image

The two other variables that are very strong are whether you telecommute and whether you are independent. When we add those, our adjusted R2 goes up to 33%.

Then after that we have a handful of variables that have a less than 5% chance of being erroneous:

  • Gender. It’s still a bit early to jump to conclusions, but it looks like being female might cost you $6,000 per year. This is after controlling for years of experience, education, hours worked, and if this is your first job. Gender could still be tied to other factors like a gap in your career or if you negotiate pay raises.
  • First Job. “First job” I identified as having identical values for years of experience and years in this job. If you haven’t changed jobs, it could be costing you $4,000, which lines up with my personal experience.
  • Hours worked per week. This is basically what you would expect.
  • Education. This is the number of years of education you received outside of high school.
  • Build Scripts and automation. One of the tasks people could check was if they are automating their work. Out of all the tasks people could list, this seems to have the biggest impact.

There are some interesting correlations

Part of doing a multiple regression is making sure your variables aren’t too strongly correlated or “collinear”. As part of this, is possible to find some interesting correlations.

  • If you are on-call, you are less likely to have post-secondary education. You are also probably overworked and learning PowerShell (no surprise there).
  • Certifications correlate negatively with being a dev-dba instead of a production dba.
  • If this is your first job, you are less likely to be working more than 40 hours per week. Maybe that $4,000 paycut is worth it Winking smile
  • Independents also work less hours per week. So maybe your second job should be going independent.
  • If you telecommute, you might make $2,000 more per year for every day of the week you telecommute; but you are going to be working more hours as well.

Plans moving forward

So here is the current outline for this blog series:

  1. Identifying features (variables)
  2. Data cleanup
  3. Extracting features
  4. Removing collinear features
  5. Performing multiple regression
  6. Coding a calculator in Javascript
  7. Reimplementing everything in R

So let me know what you think. I plan on making all of the data and code freely available on github.

Things I like about the changes to PASS Summit submission process

So, PASS Summit is making changes to how the curriculum is managed. There’s been some conversation about these changes and whether they are good are not. I thought I would enter the conversation and add my thoughts.

Before that though, I want to thank the board for the work that they do and working on these changes. I think that being on the board is difficult work. It seems to me that no matter what changes they make, there will be criticism. Often, the criticism is the loudest of all the commentary. Sure, I agree there is more they could do to involve community feedback; however, the fact that they have a survey and community townhall is evidence that they are listening and value our opinion.

So, I’m going to try to focus on the positives I see, instead of criticizing. Overall, I think these changes are a good move. I wrote last PASS that I think the data platform is broadening. And while I think that the idea of a data profession presents challenges, I think it’s reflective of trends going on:

  1. The devops-ification of the data platform
  2. A broadening of the surface area of  “data platform”
  3. Virtualization, Moore’s law and Cloud

These are all forces that are causing more technologies to fall under a broader and broader grouping. They are also causing a blurring what used to be more distinct roles. And as a result, PASS Summit is responding to that change.

Having a broader, more nuanced set of topics requires more control. If  you look at companies like Pluralsight or Lynda, they have people whose job is to manage a curriculum. And part of that job is aggressively targeting specific topics to avoid gaps.  It also means doing market analysis to see what education people want and need, not just what people want to present on. These goals makes sense for Summit too.

Part of targeting specific topics, is acknowledging that there are top-level experts in those content areas. If you want something on DAX, reach out to Marco Russo. If you want something on U-SQL, reach out to Michael Rys. Dealing with the changes in the platform require a different approach. In some ways, it would be a shame to to not try to get access to the top leaders of specific topic areas.

I admit that it’s reasonable to worry about Summit becoming some sort of elite club of presenters. Part of the solution is transparency, transparency, transparency. The other part of the solution is answering the question “How do I present at Summit?”

There’s a joke about industries with no entry level jobs. How do those types of industries exist? How do they not just die out? But I don’t think entry level is the issue here. There are tons of opportunities to speak at user groups. SQL Saturdays are a clear stepping stone after that. The next step is…less clear. Do you blog? Do you speak at virtual user groups? Do you go to 20 SQL Saturdays a year?

No seriously, I don’t know. Can someone tell me? You wanna know why I started presenting at virtual user groups? Because I asked someone how to get to Summit, and they suggested that as a stepping stone. I think it’s important to give “new” speakers like me a path for reaching the top.

It’s straightforward to get your white belt in speaking, even your green belt. But it’s a lot less clear how to get your brown or black belt. That’s an area I’d love to see more people talk about. A good example is Brent Ozar’s Career Internals.

That being said, I’m happy with the minimum bar of experience for applying. The current minimum is 3 presentations. I honestly think it could be 5 and it’d be fine. This is a good balance between requiring a base level, while not discouraging new speakers. Three is so low, you could trip over it. That’s two user groups and a SQL Saturday. I can promise you that when I had given only 3 presentations, I was in no way qualified to present to PASS.

One final thing I’m really happy about is that they are going to give guidance on the content they want, and they are going to ask people to specify a general content area they are focusing. I think this is a smart move. The current system just wasn’t sustainable.

The old system seemed to encourage just throwing as much stuff against the wall as you could and see if anything stuck. I don’t think we want to incentivize people submitting more than 3-4 sessions. It puts an undue burden on the people reviewing the submissions. I honestly think it’d be fine to have an explicit limit on submissions, but I expect that would make some waves.

Overall, I think they are making moves in the right direction and I like it.

Runaway Transaction Log: part 1

Last week I wrote about an idea to create labs for when things go wrong in SQL Server. They are partly inspired by the TechNet Virtual Labs, which are a great resource available for free. Another motivating factor is a desire to do something with Windows Azure. I’m new to the virtualization world and want to learn more about it. For someone like me who doesn’t have a lot of lab resources at home, it offers some exciting opportunities. Finally, I’m just plain frustrated with the fact that most DBA training involves frantically struggling when something goes wrong in production.

Today, we are going to create a SQL Server virtual machine. Later on we will misconfigure it with a runaway transaction log. This article covers just the basics. If you want to learn more, you may want to read  Getting Started with SQL Server in Windows Azure Virtual Machines

First log into your Azure account. If you don’t have one, you can create one pretty easily and you will have $200 in credits to play with the first month. Be aware that after the first month you will be paying for storage, even if your machines are turned off. Once you are logged in, go to virtual machines:

image

Then on the bottom left corner, select NEW –> Compute –> Virtual Machine –> From Gallery

image

The gallery has a lot of options available and even more if you know the cheat codes to find the hidden VM Depot. If we take a look at the SQL server category, we see images available for 2008 R2, 2012 and 2014 CTP.  This lab should work for any of those, but in this case we will be picking SQL Server 2012 Standard.

image

Once you’ve chosen your image, click on the next arrow. On the next screen you are going to pick some configuration details. I would pick the latest release date. Medium size should work for out needs regarding compute. WARNING: Be aware that Standard on a medium instance will cost you $0.73 per hour.  If you accidently leave it running all day, that can be $18 per day. I’ve done that before, so be careful.

image

Next we need to set the cloud service. You can think of this as a sort of container for the VM’s. In our case, we’ll make a new cloud service. You are going to want to pick a region close to you.

image

Finally, we are going to set the endpoints so we can talk to the machine.

image

Once you are done with all of that, it’ll take about 10 minutes to provision the machine. Go take a walk and come back.

Are you back? Okay, let’s go back to virtual machines and take a look at our machine.

image

It looks like it’s up and running.  Next, we are going to create a very small data disk for us to fill up. If you want more information, this article covers attaching a disk in more detail.

Select the machine and click attach at the very bottom. Then select attach empty disk.

image

We want to select 1 GB as this is the smallest size we are allowed.

image

Now we have everything we need to log in and start breaking the machine. Next week we’ll configure the machine to run into some transaction log issues!

An idea: SQL Server crisis labs

Many people in the SQL Server world have become DBAs by accident, myself included.  The problem with being an accidental DBA is that you usually learn things the hard way: when something is on fire. At that point, you are under intense pressure to solve everything NOW. It’s a real pain.

Of course, there are a number of great resources if you are learning SQL Server from scratch.  Here are two of them that have helped me:

These will lead you in the right direction, but the real challenge is learning how to deal with problems while under pressure. Reading about a problem isn’t the same as dealing with it first hand. What I think is really exciting is the idea of a safe environment where you can solve a problem without any risk.

Enter Azure (or any virtualization solution). Azure provides a really fun opportunity, in my mind. You can spin up a virtual machine with SQL in a matter of minutes. Then you are free to break things to your heart’s content. So why not take that a step further? Why not create labs where things are intentionally broken and your job is to fix it. Your only guides are a vague error message and your Google-Fu. Just like in real life.

So, I’m going to start a series of posts on how to create intentionally broken VM’s to train your junior DBA’s on. Next, I’ll start creating Powershell scripts to automate the creation process. Finally, we’ll take a look at consolidating those scripts to create a single SQL machine with a lot of problems that need fixed.

For the first lab, I’m going cover how to create a machine with an out of control transaction log. Please feel free to suggest any ideas or provide any comments.