Building a DBA Salary Calculator, Part 0: Initial findings

I’m planning on building a salary calculator based on the data from Brent Ozar’s salary survey. I already played around with some of the numbers earlier. This time I’m planning on going a lot deeper.

I want your help and feedback! I want to know what would make a calculator most useful to you. Feel free to poke holes in my methodology and tell me how a real data scientist would handle this project.

In this post, I’m going to outline some initial findings as well how I’m planning to approach this project. All of the information below is based on a narrow subset:

  • USA, full postal code
  • DBA job
  • Between $15,000 and $165,000

Regarding zip codes, some people only entered a portion of their zip for privacy sake. In the final analysis, I plan on taking into account the ~200 US individuals who did that.

Initial findings

The data isn’t very predictive

So I’m using something called a multiple linear regression to make a formula to predict your salary based on specific variables. Unfortunately, the highest Coefficient of Determination (or R2) I’ve been able to get is 0.37. Which means, as far as I understand it, that at most the model explains 37% of the variation.

Additionally the spread on the results isn’t great either. The standard deviation, a measure of spread, is about $25,000 on the original subset of data. Which means we’d expect 68% to be within +/- $25,000 of the average and 95% to be within +/- $50,000 of the average. So what happens when we apply our model?

When we apply the model we get something called residuals, which are basically the difference between what we predicted and what the actual salary was. The standard deviation on those residuals is $20,000. Which means that our confidence range is going to be +/- 20-40k. That to me doesn’t seem like a great range.

There are a few strong indicators

Let’s take a look at what we get when we do a multiple regression with the Excel Analysis ToolPak addin:

image

The two biggest factors by far seem to be how long you’ve worked and and where you live. In fact, we can explain 30% of the variance using those two variables:

image

The two other variables that are very strong are whether you telecommute and whether you are independent. When we add those, our adjusted R2 goes up to 33%.

Then after that we have a handful of variables that have a less than 5% chance of being erroneous:

  • Gender. It’s still a bit early to jump to conclusions, but it looks like being female might cost you $6,000 per year. This is after controlling for years of experience, education, hours worked, and if this is your first job. Gender could still be tied to other factors like a gap in your career or if you negotiate pay raises.
  • First Job. “First job” I identified as having identical values for years of experience and years in this job. If you haven’t changed jobs, it could be costing you $4,000, which lines up with my personal experience.
  • Hours worked per week. This is basically what you would expect.
  • Education. This is the number of years of education you received outside of high school.
  • Build Scripts and automation. One of the tasks people could check was if they are automating their work. Out of all the tasks people could list, this seems to have the biggest impact.

There are some interesting correlations

Part of doing a multiple regression is making sure your variables aren’t too strongly correlated or “collinear”. As part of this, is possible to find some interesting correlations.

  • If you are on-call, you are less likely to have post-secondary education. You are also probably overworked and learning PowerShell (no surprise there).
  • Certifications correlate negatively with being a dev-dba instead of a production dba.
  • If this is your first job, you are less likely to be working more than 40 hours per week. Maybe that $4,000 paycut is worth it Winking smile
  • Independents also work less hours per week. So maybe your second job should be going independent.
  • If you telecommute, you might make $2,000 more per year for every day of the week you telecommute; but you are going to be working more hours as well.

Plans moving forward

So here is the current outline for this blog series:

  1. Identifying features (variables)
  2. Data cleanup
  3. Extracting features
  4. Removing collinear features
  5. Performing multiple regression
  6. Coding a calculator in Javascript
  7. Reimplementing everything in R

So let me know what you think. I plan on making all of the data and code freely available on github.

T-SQL Tuesday #98: Learning Troubleshooting from Games

 

TSQLTues

This week’s T-SQL Tuesday is about a time that you solved a difficult technical problem. Unfortunately my brain doesn’t store events that way, so I can’t think of any good stories. Instead, I want to talk briefly about how games have made me a better programmer and a better troubleshooter.

Map-making in TFC

TFC Boxart.png

The first game I want to talk about is Team Fortress Classic. It’s a team based shooter from the late 90’s. I used to play this game all the time. But I did even more than that, I would make custom levels to play on with other people.

Mapmaking for TFC, was generally a simple process. You would create simple polyhedrons and then apply textures/patterns to them. Then you would place non-terrain objects, called entities, inside of your terrain. Everything is pretty straightforward…until you get a leak.

A leak is when the outside of the level is accessible to the inside of the level. Imagine you are building a spaceship or a submarine, if you have a leak it just won’t work. The challenge is that the level editor won’t tell you where you have a leak1. So how do you solve it?

In my case, you encase half the level in solid rock, so to speak. I would just make a big cube and cover up half of the level. If the level compiled, I knew my leak was somewhere in that half. Then I just kept repeating with smaller and smaller cubes.

I do the same thing all the time in my professional life. I’ll comment whole swathes of code. I’ll jump to half-way to the data pipeline to see where the error starts. TFC taught me to keep cutting the problem in half until I find it.

Guessing the secrets of the universe with Zendo

Image result

Undoubtedly proof that I was destined to be a programmer, one of my favorite board games ever is Zendo. It was actually one of my nicknames in college. It’s got a silly theme about discerning if something has the Buddha nature. In reality, it boils down to one player making up a rule, and everyone else trying to determine what the rule is.

If it sounds easy, I dare you to play something similar over at the New York Times. Chances are you are going to get it wrong.

The biggest thing Zendo taught me, was fighting against confirmation bias. It taught me to ask “What would prove my theory wrong”. Good troubleshooting involves guessing a cause, determining a test that will give you new information, and then running that test.

That test might be running a simpler version of a query that’s failing. It might mean adding a breakpoint to your code and inspecting variables.

Learning how to think systematically about this sort of thing has been tremendously useful.

Learning outside of programming

Troubleshooting is very often a set of skills and approaches that don’t need to do anything with technology per-se. I think looking at how we can get these skills in other places, like games can be very useful.

Speaking of other sources, there are two book I can recommend wholeheartedly. The first is How to Solve It which is about how to solve mathematical problems, but it provide a number of ways to break down a problem or approach it from different angles. The second is called Conceptual Blockbusting. It focuses on the nebulous issue of how we think about problem solving. It’s very much a book about thinking and I definitely enjoyed it.

Footnotes

1 Only after writing this blog post did I find an article explaining out to get the level editor to tell you exactly were the leak is. Sigh.

Practicing Statistics: Female DBAs and Salary

Brent Ozar recently ran a salary survey for people working with databases, and posted an article: Female DBAs Make Less Money. Why?

Many of the responses were along the lines of “Well, duh.” I, personally,  felt much of the same thing.

But, I think with something like this, there is a risk for confirmation bias. If you already believed that women were underpaid, there is a chance that you’ll see this as more proof and move on, without ever questioning the quality of the data.

What I want to do is try to take a shot at answering the question: How strong is this evidence? Does this move the conversation forward, or is it junk data?

Consider this blog post an introduction into some statistics and working with data in general. I want to walk you through some of the analysis you can do once you start learning a little bit of statistics. This post is going to talk a lot more about how we get to an answer instead of what the answer is.

Data Integrity

So, first we want to ask: Is the data any good? Does it fit the model we would expect? Barring any other information I would expect it to look similar to a normal distribution. A lot of people around the center, with roughly even tails on either side, clustered reasonably closely.

So if we just take a histogram of the raw data, what do we get?

image

So, we’ve got a bit of a problem here. The bulk of the data does look like a normal distribution, or something close to it. But we’ve got some suspicious outliers. First we have some people allegedly making over a million dollars in salary. That’s why the histogram is so wide. Hold on, let me zoom in.

image

Even ignoring the millionaires, we have a number of people reporting over half a million dollars per year in salary.

image

We’ve also got issues on the other end of the spectrum. Apparently there is someone in Canada who is working as an unpaid intern:

image

Is this really an issue?

Goofy outliers are an issue, but the larger the dataset the smaller the issue. If Bill Gates walks into a bar, the average wealth in the bar goes up by a billion. If he walks into a football stadium, everyone gets a million dollar raise.

One way of looking at the issue is to compare the median to the mean. The median is the salary smack dab in the middle, whereas mean is what we normally think of when we think of average.

The median doesn’t care where Bill Gates is, but the mean is sensitive to outliers. If we compare the two, that should give us an idea if we have too much skew in either direction.

image

So, if we take all of the raw data we get a $4,000 difference. That feels significant, but could just be the way is naturally skewed. Maybe all the entry level jobs are around the same, but the size of pay raises get bigger and bigger at the top end.

Averages after removing outliers

Okay, well lets take those outliers out. We are going to use $15,000-$165,000 as a valid range for salaries. Later on I’ll explain where I got that range.

There are 143 entries outside of that range, or about 5% of the total entries. I feel comfortable excluding that amount. So what’s the difference now?

image

So the middle hasn’t moved, but the mean is about the same now. So this tells me that salaries are evenly distributed for the most part, with some really big entries towards the high end. Still, the $4,000 shift isn’t too big, right?

Wait, this actually is an issue…

Remember when I said 2 seconds ago that $4,000 was a significant but not crazy large? Well, unfortunately the skew in the data set really screws up some analysis we want to do. Specifically, our friend Standard Deviation. We need a reasonable standard deviation to do a standard error analysis, which we cover later.

Standard Deviation is a measure of the spread of a distribution. Are the numbers clumped near the mean, or are they spread far out? The larger the standard deviation, the more variation of entries.

If a distribution is a roughly normal distribution, we can predict how many results will fall within a certain range: 68% within +/- 1 standard deviation, 95% within +/- 2 deviations, 99.7% within 3 deviations.

Well, because of the way standard deviation is calculated, it is especially sensitive to outliers. In this case, it’s extremely sensitive. The standard deviation of all the raw data is $66,500 . When I remove results outside of $15-$165K, the standard deviation plummets to $32,000. This suggests that there is a lot of variability in the data being caused by 5% of the entries.

So let’s talk about how to remove outliers.

Removing Outliers

Identifying the IQR

Remember when I got that $15,000-$165000 range? That’s by using a tool called InterQuartile Range or IQR.

It sounds fancy, but it is incredibly simple. Interquartile range is basically the distance between the middle of the bottom half and the middle of the top half.

So in our case, if we take the bottom half of the data, the median salary is $65,000. If we take the top half of the data, the median salary is $115,000. The IQR is the difference between the two numbers, which is $50,000.

Using IQR to filter out outliers

Okay, so we have a spread $50,000 between the first and third quartiles. How do we use that information? Well there is a common rule of thumb that anything outside +/- 1.5 IQR is an outlier. In fact, when you see a boxplot, that is what is going on when you see those dots.

So, $50,000 *1.5 is $75,000. If we take the median ($90,000) and add/subtract 75,000 we get our earlier range of $15,000-$165,000

Standard Error Analysis

Okay, so why did we go through all that work to get the standard deviation to be a little more reasonable? Well, I want to do something called a Standard Error Analysis to answer the following question:

What if our sample is a poor sample?

What is our average female salary is lower because of a sampling error? Specifically, what are the odds that we samples a lower average salary by pure chance? “Poppycock!”, you might say. Well, standard error gives us an idea of how unlikely that is.

Importance of sample size

Let’s say there are only 1,000 female DBAs in the whole world, and we select 10 of them. What are the chances that the average salary of those 10 is representative of the original 1,000? It’s not great. We could easily pick 10 individuals from the bottom quartile, for example.

What if we sampled 100 instead of 1000? The chances get a lot better. We are far more likely to include individuals that are above average for the population as a whole. The larger the sample, the closer the sample mean will match the mean of the original population.

The larger the sample size, the smaller the standard error.

Importance of spread

Remember before we said that a reasonable standard deviation is important?  Let’s talk about why. Let’s say there are 10 people in that bar, and that the spread of salaries is small. Everyone there make roughly the same amount. As a result the standard deviation, a measure of spread, is going to be quite small.

So, let’s say you take three people at random out of that bar, bribe them with a free drink, and take the average of their salaries. If you do that multiple times, in general that number is going to be close to the true average of the whole population (the bar).

Now, Bill Gates walks in again and we repeat the exercise three times. Because he is such an outlier, the standard deviation is much larger. This throws everything out of whack. We get three samples: $50,000, $60,000, and $30,000,000,000. Whoops.

The smaller the standard deviation of a population, the smaller the standard error.

Calculating Standard Error

Getting the prerequisites

To calculate the standard error, we need mean, standard deviation and sample size. Before we calculate those numbers, I want to narrow the focus a bit.

I’ve taken the source data, and narrowed it down to US, DBA and within $15,000-$16,5000. This should give us more of an apples to apples comparison. So what do we get?

image

We’ve got a gap in average salary of about $4,500.  This seems quite significant, but soon we’ll prove how significant.

We’ve also a standard deviation of around $24,500. If salaries full under a normal distribution, this means that 95% of DBA salaries in the US should be within $52,500-$151,000. That sounds about right to me.

Calculating individual standard error

So now we have everything we need to calculate standard error for the female and male samples individually. The formula for standard error is standard deviation divided by the square root of the count.

So for females, it’s 23,493 / sqrt(123) = 2118. This means that if we were only sampling female DBAs, we would expect the average salary to be within +/- $2,118  about two thirds of the time.

image

So, if we were to randomly sample female DBA’s, then 95% of the time, that sample’s average would be less than the male average from before.

image

That seems like a strong indicator that the lower average salary for females isn’t just chance. But we actually have a stronger way to do this comparison.

Standard error of sample means

Whenever you want to compare the means from two different samples, you use a slightly different formula which combines everything together.

The formula is SQRT( (Sa^2 + Sb^2) / (na +nb)) . S is the standard deviation for samples a and b. Standard deviation squared is also known as the variance. N is the count for samples a and b.

If we combine it all together we get this:

image

The standard error when we combine samples is $1,154. This indicates that if there was no difference between the distribution in female and male salaries, then 68% of the time they would be within $1,154.

Well in this case, the difference in means is almost 4 times that. So if the difference in means is 3.88 standard deviations apart, how often would that happen by pure chance? Well, we would see this level of separation 0.01% of the time, or about 1 in 10,000.

Conclusion

I take this as strong evidence that there is a real wage gap between female and male DBAs in the USA.

What this does not tells us is why. There are a number of reasons that people speculate as to the cause of this gap, many in Brent’s original blog post and the comments below it. I’ll leave that to them to speculate what the cause is.

Do you need pro licensing to administer Power BI Premium?

A recent viewer of my new Pluralsight course had a question about data gateways and Power BI Premium. Specifically, do you need a pro license to install and administer data gateways? The short answer is probably not!

Installing data gateways

So when you install a data gateway, you need to log in as a user to register it with your tenant. Well it turns out that whoever is used there is set as the default admin for that gateway. I created a user with just a power BI free license, and I was able to install and administer that gateway just fine. I was also able to assign it to other gateways that already existed.

So, for normal usage you don’t have to be licensed with pro to setup and configure data gateways. I was honestly a bit surprised by this, but in retrospect is makes sense. Pro licensing is all about consuming reports.

What about Premium?

So, the original question was about Power BI Premium. Unfortunately, there’s no developer tier for me to test on, but I have a few guesses.

First, I reviewed the white paper and the distinction it makes between pro users and infrequent users is about producing versus consuming reports. It doesn’t really talk much about administration from what I could tell. Same thing for the faq:

Do I need Power BI Pro to use Power BI Premium?
Yes. Power BI Pro is required to publish reports, share dashboards, collaborate with colleagues in workspaces and engage in other related activities.

Next, I did some searching, and found a page about capacity admins, but that doesn’t relate to data gateways specifically.

So based on what I found, I would assume that you don’t need a pro license to manage data gateways for premium. I would assume it would be a similar experience to normal Power BI.

New presentation: just enough database theory for Power BI

I  just gave a presentation for the Excel BI virtual group on database theory, and I’m really happy with how it went. I think it’s an undeserved topic quite honestly. So many people in the excel world learn everything ad-hoc and never have a chance to learn some of the fundamentals.

A number of questions came up relating to the engine and how the performance works. If you are interested in more detail on that, I suggest checking out my talk on DAX.

Here are the slides for my talk:

Just Enough Database Theory for PowerPivot 2017-20-2017

Video is coming soon as well.

Why DAX is a PITA: part 1

  • So, I think that DAX is a pain in the butt to use and to learn. I talk about that in my intro to DAX presentation, but I think it boils down to the fact that you need a bunch of mental concepts to have a proper mental model, to simulate what DAX will do. This is very deceptive, because it looks like Excel formulas on steroids, but conceptually it’s very different.

Here is the problem with DAX, in a nutshell:

image

This example below is a perfect example of that sharp rise in learning curve, and dealing with foreign concepts like calculated columns, measures, applied filters, and evaluation contexts.

So, one of the things I’m hoping to catalog are example where DAX is a giant pain if you don’t know what you are doing. People make it look really simple and smooth, and that can be frustrating sometimes. Let’s see more failures!

How do I GROUPBY in DAX?

John Hohengarten asked me a question recently on the SQL Community Slack. He said:

I need to sum an amount column, grouped by a column
Measure 1 :=
GROUPBY (
det,
det[nbr],
    “Total AR Amt Paid calc”SUM ( det[amt] )
)
I’m getting a syntax error

So automatically, something seemed off to me. Measures are designed to return a single value, given the filter context that’s applied to them. That means you almost always need some aggregate function at an outer level. But based on the name, you wouldn’t necessarily expect GROUPBY to return a single value. It would return values for each grouping instance.

If we take a look at the definition for GROUPBY(), we see it returns a table, which makes sense. But if you are new to DAX, this is really unintuitive because DAX works primarily in columns and tables. This is a really hard mental shift, coming from SQL or Excel.

 What do you really want?

None of this made any sense to me. Why would you try to put a GROUPBY in a measure? That’s like trying to return an entire table for a KPI on a dashboard. It just doesn’t make sense. So I asked John what he was trying to do.

He sent me an image of some data he was working with. On the far left is the document id and on the far right is the transaction amount.
Pasted image at 2017_07_06 09_28 AM

He wanted to add another column on the right, that summed up all of the amounts for transactions with the same document. In SQL, you’d probably do this using a Window function with a SUM aggregate, like here.

 Calculated columns versus measures

This highlights another piece of DAX that is unintuitive. You have two ways of adding business logic: calculated columns and measures. The both use DAX, both look similar and are added in slightly different spots.

But semantically and technically, they are very different beasts. Calculated columns are ways of extending the table with new columns. They are very similar to persisted, computed columns in SQL. And they don’t care at all about your filters or front-end, because the data is defined at time of creation or time of refresh. Everything in a calculated column is determined long before you are interacting with them.

Measures on the other hand, are very different. They are kind of like custom aggregate functions, like if you could define your own version of SUM. But to carry the analogy, it would be like if you had a version of SUM that could manipulate the filters you applied in your WHERE clause. It gets weird.

My point is, if you don’t grok the difference between calculated columns and measures, you will never be able to work your way around the problem. You will be forced to grope and stumble, like someone crawling in the dark.

Filter context versus row context

So in this case we’ve determined we actually want to extend the table with a column, not create a free-floating measure. Now we run headlong into our next conceptual problem: evaluation contexts.

In DAX there are two types of evaluation contexts: row contexts and filter contexts. I won’t go too deep here, but they define what a formulas can “see” at any given time, and in DAX there are many ways to manipulate these contexts. This is how a lot of the time intelligence stuff works in DAX.

In this case, because we are dealing with a calculated column, we have only a row context, not filter context. Essentially, the formula can only see stuff in the same row. Additionally, if we use an aggregate like SUM, it only cares about the filter context. But the filter context comes from user interaction. Because this data is defined way before that, there is no filter context.

This is another area, where if you don’t understand these concepts you are SOL. Again, for the newbie, DAX is a pain.

 What’s the solution?

So what is the ultimate solution to his problem? There are probably better ways to do it, but here is a simple solution I figured out.

SUM =
CALCULATE (
    SUM ( Source_data[Amount] ),
    ALL ( Source_data ),
Source_data[Document] = EARLIER ( Source_data[Document] )
)

Walking through it, The CALULATE is used to turn our row context, into a filter context. Then it manipulates that filter context so SUM “sees” only a certain set of rows.

The first manipulation is to run ALL against the table, to undo any filters applied to it. In this case, the only filter is our converted row context. (confused yet?)

The next manipulation is to use EARLIER (which is horribly named) to get the value from the earlier row context. In this case we are filtering ALL the rows, to all of them that have the same document. Then, finally we apply the SUM, which “sees” the newly filtered rows.

Here is what we get as a result:

image

 How do we verify that?

A fourth pain with DAX is that it’s very hard to look at intermediate stages of a process, like you can with SQL or Excel formulas, but in this case we have a way. If we convert our SUM to a CONCATENATEX, we can output all the inputs as a comma separated list. This gives us a slightly better idea of what’s going on.

image (1)

 What’s the point?

My point is, that DAX, despite it’s conciseness and richness is hard to start using. Even basic tasks can require complex concepts, and that was a big frustration point for me. You can’t just google GROUPBY and understand what’s going on.

Again, check out my presentation I did for the PASS BI virtual group. I tried to cover all the annoying parts that people new to DAX will run into. That and buy a book! you’ll need it.

Wrangling GotoWebinar Stats with Power Query: Part one

So, this week I gave my first presentation to GroupBy.org. It was very exciting, and I learned that I need a chair that doesn’t swivel so much.

So, I said to Brent, “How many people attended, I want to update my speaking log.” He said, “210? I’ll get you the data tomorrow.”

Here’s what he gave me this:

image

Ugh.

Power Query to the rescue

Normally this would be a giant pain to work with. When it comes to data quality, this is quite the tohubohu. Thankfully, I can clean things up quickly with Power Query.

So, first I’m going to click on the data and select Add to Data Model, under the Power Pivot tab.

image

Excel is going to make some assumptions about what is part of the table. This is convenient for our needs, but we’ll have to find a work around when we want to scale to multiple excel files.

image

We can’t tell it we have headers, because it’s going to think that first row is a header. We’ll deal with that later. Once we click OK, we are taken to the Power Query / Power Pivot window.

image

I made a mistake

Hmm, so it looks like I made a mistake. I hope my honesty won’t lose me any izzat, or ability to command respect. I think it’s important to see how people really learn and really solve problems. So, I’m including my screw ups in this post.

Apparently, I created a linked table and I can’t see how to edit the the Power Query portion for that. A linked table is a nice way to pull raw data from the Excel workbook. It’s great for reference tables, but doesn’t solve our problem.

image

Trying again

Let’s take a different approach. I’m going to open a blank excel workbook and pull the data into there. Okay, so let’s go to manage under the Power Pivot tab.

image

Next, we are going to click “Get External Data From Other Sources”

image

Then I’m going to scroll to the bottom and select Excel File.

image

Once selected, I only have the whole first sheet as an option. If I had table objects or named ranges, that would be different.

image

Hmmm, I still can’t find a way to edit the Power Query. Fiddlesticks!

Normally, in Power BI it would be right here:

image

Trying to do this in Excel is quite the boyg, or vague obstacle.

Third time is a charm

Sigh, okay let’s try this a third time. I’m going to do to the Data tab and the “Get and Transform Query”. “Get and Transform” is the new name for Power Query.

image

Okay, let’s try opening that Excel file. Ah, much better. Now I want to click Edit at the bottom right.

image

Cleaning the Data

So, First thing we need to do is get rid of all of the non-header rows at the top.

image

To do that, I just select Remove Rows –> Remove Top Rows.

image

Then I specify I want to get rid of the top 7 rows.

image

Next, I want to turn the actual header row into a header.

image

Okay, so now it looks like a real table.

image

Comma Delimited BS

Okay, so now we need to parse out the times someone was watching. The problem is that some people were in and out. Their entries are comma delimited. Ugh.

image

Okay, let’s split them up. I’m going to select Split Column –> By Delimiter

image

Unfortunately, splitting by column a) splits into more columns and b) you have to specify how many.

image

Thankfully, we can select those new columns and unpivot them.

image

Perfect. Now we have a row for every time a person as watching.

image

String parsing

Okay, so now we just need to parse out the dates. First, we are going to split on the dash, and then the parenthesis.

image

This is starting to look good.

image

Now we just need to get rid of the timezone and convert it to a datetime. First we need to select Replace Values.

image

image

Lastly, we select the data type we want.

image

What’s next?

Now that are data is cleaned up, we’ll join to sessions table and do some simple data modeling. But that’s for the next blog post.

Keeping up with Technology: a Guide to Drinking from the Firehose

I often ask people with more experience than me “How do you stay relevant in our field?”. I joke that I live in perennial fear of being replaced someday, by a 22 year-old with no family commitments. Really, it’s a joke. Really, it’s a fear.

So, let’s talk about how to keep up with technology.

 First, we must grieve.

So here’s the secret to keeping up with technology. You can’t.

There are too many technologies, too many features, too many updates:

  • SQL Server 2017 is coming out this year.
  • Big data technologies are so numerous as to be indistinguishable from Pokemon.
  • PowerBI ships features on a weekly basis.
  • Worst of all, you’ve got NoSQL which is literally the mathematical dual of SQL. It’s everything that SQL isn’t, by definition!

It’s all too much.

This realization is likely to be expressed in the 5 stages of grief.

  • Denial. Other people can keep up, why can’t I? I’m learning tons of stuff all the time! This is easy.
  • Anger. Why they heck are they making releases every month!? When did Microsoft go agile? I thought SQL Server versions every 2 years was bad enough.
  • Bargaining.  Okay, maybe if I stick to core SQL stuff, I’ll be fine. Hadoop seems like a fad. And Azure is never going to be popular with my  company.
  • Depression. This is impossible. I’m going to lose my job when I’m 40 years old and arthritis starts kicking in.
  • Acceptance. There’s more happening than I can ever learn, but Hacker News doesn’t define my success as an IT professional.

This post isn’t about doing the impossible. It’s about making the most of your resources. That’s something that you can do.

Are you asking the right questions?

I would like to propose that it’s not about keep up with technology at all. This is a second-order goal, a proxy of sorts. Really, there are two questions at the heart of things:

  1. How do I keep my job?
  2. How do I keep my friends?

This is why technology causes so much angst. We want to learn enough that we can put food on the table; and we want to do it in short enough time that it doesn’t destroy our personal lives.

Next, we must think.

In order to get ahold of the the problem, let’s use some analogies: investments and radioactive decay.

How learning is like investing

Do you have a retirement account? If so, do you invest primarily in stocks or bonds? Why?

If you are at all young, the you invest primarily in stocks. That’s because stocks have a higher rate of growth than bond. You are trying to outrun inflation, where the value of your money is steadily decreasing. This is like your current knowledge becoming outdated. All that vb6 coding knowledge is like a pile of cash in your mattress. When I was a kid, $20 was a lot of money.

The next question  is, do you invest in just one stock, like Apple? No, you diversify your portfolio. High growth stocks go up, on average. Some however, tank. High growth means high volatility. A good example is Apache Flex. It used to be a really promising application platform, until Steve Jobs killed Flash.

So we’ve got two risks to our learning portfolio: Losing value in our existing knowledge, and making the wrong choices for our new knowledge. These different risks have different mitigation strategies.

Specialization and generalization

These comparisons to bonds and stocks relates to the challenges of specialization versus generalization. We specializes to pay the bills. We generalize to keep our jobs.

Specialization is how we get paid. The reason anyone pays us is because we have skills or knowledge that is not quickly acquired. The reason consultants like David Klee make gobs of money is that they have taken a subject, like virtualizing SQL, and have gotten really  good at it. To get paid more than minimum wage you are going to need some level of specialization.

Generalization is how we get paid in 10 years. You need to be specialized to get paid right now, it’s a short term investment. However, that investment decays, just like the value of money in your mattress. To get paid a decade from now, you need to broaden your horizons. If you are a data person, it might mean learning R and python. It might mean learning Azure. Heck, it might mean PowerShell and Docker.

The tension here is that you need both. You need paid now AND in 10 years. Kevin Feasel talks about trying to find that right balance. I’m not here to tell you what that balance is. What’s important is that each has different constraints. Specialization requires focus. Generalization requires time. More on that later.

How learning is like radioactive decay

Let’s take another approach.

Our knowledge has a limited shelf-life. Allen White said, in the SQL Data Partners podcast, that you have to retool yourself every 5 years. In college, I remember joking that half of what you knew would be useless in 5 years.

Well, what if we took that literally? How would we model that? How would we think about that?

In nuclear physics, there is the idea of a half-life.  You have radioactive material that decays in half every X units of time. Why not apply this concept to IT? The half-life in this example then would be 5 years.

So the next question is this: if half of what you learn is either irrelevant or forgotten in 5 years, how much do you need to learn in a given year, to keep steady? Let’s ask our friends at Wolphram Alpha.

image

If x is our rate of decay and our half-life is 5 years, we can work backwards from that and solve for x. In this case, x equals 87%.

So, what that means is that if today you know 100 relevant things, then a year from now you’ll only know 87 relevant things. That’s like going from a solid A+ to a weak B+. You just lost a whole grade!  Not good.

Below is a curve showing what this model looks like over 10 years.

image

Changing the math

Well, this isn’t great. What if I want to know 120 things? How can we do that? One option is to learn more things.

Learn more things

If normally we have to learn 13 things each year, then we have to learn another 20 things on top of that. Or, if you are patient, you can learn an extra 4 things each year, and eventually  you will get there.

image

Slow the decay

Another option is to change the rate of decay. Instead of learning more things each year, what if we didn’t have as much decay? If we can slow that rate of decay just a little bit, to a half-life a 6 years instead of 5 years, we’ll have the same effect. That means that instead of brute-forcing things, we might be able to be smarter.

What does this all mean?

So that gives a path forward. We need to either

  1. Increase the number of relevant things you can learn each year
  2. Or, decrease the rate of decay for relevant knowledge

These are the only three knobs we have. Either learn more stuff, learn the right stuff, or learn stuff that lasts longer. Let’s investigate all three.

How do we fix this?

Learning more things

So one solution to our dilemma is to brute force it. Let’s just learn more things and hope that they are the right ones. To do that, we can look at our inputs and constraints.

What are the constraints that affect our learning? There are 3 big ones I can think of:

  1. Time
  2. Energy
  3. Money

If we can increase any one of these, then we might be able to increase how much we can learn in a week.

Time

So, lets say we decide to go the simple route and go for volume.  You’ve got a 168 hours in a week. You can’t make any more hours, and if you try to use all of them in a week, you are going to need some amphetamines.

So, if we can’t create more hours, how can we make more time? One option would be to cut out other activities. If you are willing to quit watching TV or playing video games, that frees up more time for learning. You could use a service like toggl.com to do a time audit and see where all of your time goes.

There is a limit to going down that path, however. You need to sleep. You need to have a social life. You need to have some fun. Like we said, you want to keep your job and keep your friends.

Multitasking

Another option is to multi-task. There is a lot of learning you can do that while doing other things. Things like podcasts are more about exposure than mastery. You can still get value out of them, even while mildly distracted. There are a number of times you could listen to a podcast:

  • Commuting
  • Exercising
  • Washing dishes

This is a way to take back time you are spending on other things without giving up all your free time.

5 minute learning

Another way of reusing your time is taking advantage of those weird breaks in time. Those breaks that are too small to get anything meaningful done. The 15 minutes at the doctor’s office. The 5 minutes waiting for a meeting.

Feed readers are a great way to take advantage of these weird units of time. With Feedly, I’m able to to read a blog article during the 5 minutes I’m waiting. This is way more useful than playing candy crush or reading twitter.

Focus/energy

Okay, so let’s say you’ve managed to find more time in the day. Unfortunately, not all learning takes the same amount of energy. Reading Twitter is mindless. Configuring a homelab requires focus. Listening to a podcast is mindless. Making a presentation takes focus.

One way of getting more focus is to prioritize the harder learning for when you naturally have focus. For some people this is early morning. For many people this is the weekend. I know that after a long day of work, I’m wiped out.

Part of that is learning to take care of yourself. Eat healthy. Exercise. Get sleep. Focus is so easy to destroy by poor lifestyle.

Finally, if you schedule time consistently and push everything out, you can have more focus. I also find having a separate office/learning space helps with this too. Cut out the distractions. Install StayFocusd for Chrome to block timewasters.

Money

If you are anything like me, money is your most plentiful resource of all 3. When I was kid, I had no money at all, but tons of time. When I was in college, I had pretty much no money, and a good bit less time. Now that I have been working for a while, that equation has flipped. I’ve got plenty of money, but no time to use it.

If you are an average DBA, you make plenty of money. According to the Bureau of Labor Statistics, the median salary for DBA’s is $85,000. The median for all occupations is $37,000. Think about that, that’s like 2.5x as much.

Now you may say “Hey, I don’t live in San Francisco.” or “Hey, I just got started.” Let’s take those factors into account. The market here in Pittsburgh is terrible, salary-wise. Even if you are just getting started in the field, you are probably making at least  40K.

image

Let’s do a little math on this. If you make 40K per year, you make roughly  $20 per hour. that means your time is worth $20 per hour. If you can spend $5-10 to save an hour of your time, do it.

If a $50 book saves your 5 hours of Googling, buy it. If a Pluralsight subscription saves you 30 hours per year of frustration, buy the subscription. Don’t be afraid to spend money to save yourself time and energy.

Consider budgeting your money

If you don’t feel like you have a lot of money, I would deeply urge you to start budgeting. I used to look at my checking account to see if I had money. Then if there was money there, I would spend it. This would work until my car insurance bill would come in, and suddenly I needed $600. And just as suddenly, I didn’t have nearly as much money as I thought

Get a budgeting program. There’s some free one’s out there, but I use You Need a Budget.

Learning the right things

So we covered how to learn more things. An alternative is to learn the right things. If you learn the right thing, you’ve have less wasted effort.

Lean on curation

Sturgeon’s law says that 90% of anything is crap. That’s certainly true of learning materials. Not only is there a lot of bad training out there, but there’s a lot of content in general that just isn’t relevant. It’s cool that someone wrote an OS in Rust, but it’s probably not relevant to your job in any significant way.

If you are time poor or focus poor, you need a gatekeeper so the flood of possibilities doesn’t overwhelm you. This means depending on curation.

That curator could be you, first off. One of the reason I suggest a feed reader over Twitter is that you are able to heavily curate the content that hits your eyes. You are the one picking the blogs to read instead of the pachinko ball machine of social media and fate. You are the one deciding what’s relevant to you.

Another options is to depend on professionals in the field that you trust. Find people to follow that consistently have good material or recommendations.

The third option is to pay money. Sure, there’s some paid crap out there. But general it does act as a quality filter. Things like written books and video libraries are far less likely to be total crap, because those things have editors and investors and such. Someone was being paid to make sure it was worth making. Additionally, the author worked hard to condense the knowledge into a concise format. Be willing to pay money.

Having a plan

Okay, so now you are investing your time and energy into quality materials, but are you learning the right things? Are you learning the things that are right for you? You need to have a career plan. You need 1 and 3 year goals. Otherwise your career plan will be a spread out mess.

Figure out where you want to focus, figure out where you want to go. Pick a specialization. More of what you learn will be relevant if you know what that is.

Learning things that last longer

Okay, so we are learning more things and more of the right things. But we still have the problem that our knowledge is decaying whether from atrophy or irrelevance. This is the last step in the funnel and the last thing we can optimize.

Improving retention

One way to keep ahold of knowledge for longer is to lean on active learning. If something goes in through your eyes and ears and back out your fingers and mouth, you will retain it for longer.

If you want to specialize and go deeper, you need to do things like blogging, presenting,coding, etc. Reading isn’t enough. Writing isn’t enough. if you truly want to learn, you have to do.

Reddit and hacker news are good for exposure to a topic, but exposure fades quickly. Use exposure to get a lay of the land, then go deep once you find the right things to learn. If you focus on mastery, you are going to learn things in a way that last longer.

Avoiding planned obsolescence

Another issue is that certain areas of knowledge just have a faster rate of decay. Some of these are facts tied to a specific version of a technology. Another thing are unproven technologies that might turn out to just be fads. Big data is still in this phase, where there are so many technologies that will be gone forever in 5 years.

People change far more slowly than technologies, so skills relating to them last far longer. Soft skills never go out of style. Communications skills never go out of style. If you learn how to write an abstract today, that knowledge will still be relevant 40 years from now.

The fundamentals last longer too. The core basics of computer science are far more establish than this weeks JS framework. If you have solid underpinnings, you can take your knowledge of C++ to Go, for example. If you understand how a ACID and CAP theorem, The new consistency models of Cosmos DB will make sense. Under understanding of one layer beneath will allow you to jump to new technologies far quicker.

Putting it all together

To summarize what we talked about:

  • Take advantage of multitasking to get back more time. Listen to podcasts while you drive or exercise.
    • Invest in a pod catcher. I use itunes and an iPod nano.
    • Invest in a feed reader. I use Feedly.
  • Take care of yourself, physically. Diet, sleep and exercise will all improve focus.
  • Schedule time to maintain focus. Treat focus as your most precious resource. Don’t browse twitter when you are firing on all cylinders.
  • Be willing to spend money on curation. There is a good chance that you have more money than time or focus. Be willing to spend money to avoid wasting time.
  • Understand exposure versus mastery. Spend your time on exposure and your focus on mastery.
  • Pick a specialization. Have a plan on what technologies you need to go deep on. You won’t get there by accident.
  • Learn things that last. Learn the fundamentals. Learn internals. Learn people skills. Do home labs.