The Negotiator: US Salaries are not “Normal” (ly Distributed)

In the last two posts I detailed the project implementation for my project TheNegotiator and provided a snapshot of a brief foray into learning JSON.

A future post will be about my going “down the rabbit hole” in order to find the right dataset that includes salary, location, experience, and education.

This post is a discussion of my attempt to tackle the second step of the Project Implementation:

2. Calculate a targeted salary for user by incorporating number of years of experience. (This can be done by simply assuming the salaries are normally distributed and have a one-to-one mapping for years of experience and salary.)

A. Create a normal distribution with mean or median equal to the salary mean and spread equal to the range or standard deviation of reported salaries.

B. This range is mapped as the min equals to zero years of experience and the max equals 20 years of experience.

Before launching in to a huge chunk of code I decided to explore how I would visualize the salary distributions and then how to map that distribution to experience. If I had a dataset with a salary survey, with wage, experience, education, and location data I would not need to make a mapping. My hunch is that these data are proprietary and this is how companies like payscale.com make money. But if you know of where I can find a large (100,000+ salaries) dataset of salary, experience, education, and location, then please post a comment!

According to the US Census Bureau, the mean household salary is ~$50,000, with first and fifth quintile boundaries of ~$21k and $106k. This means that 20% of households make less than $21,000 per year and the 20% of the population make more than $106,000. The highest earners make $196k or more and comprise of 5% of the population. The distribution is positively skewed with a sharp peak at lower salaries and an extended tail towards higher salaries (distribution). This is not a normal distribution but one that may be more closely approximated by a log-normal distribution. I’ll present the modeling analysis of the real data in a future post.

For the meantime, I use a normal distribution with mean of $50,000 and standard deviation of $20,000. The model and values now define an array of salaries with probability (percent) attached to them. With this model a salary of $70,000 has a z-score (or standard score) of  1.0 and thus lies higher than 84.1% above the other salaries in the distribution.

Now I want to attach an experience level to this distribution. I quickly realized that if you assume a linear scaling between salary and experience the projected salaries for entry level positions is quite low. For example, say you tie the highest earner to 20 years of experience and then scale all other earners down by their years of experience. This would predict that entry level associates at two years of experience should make 10 times less than the senior employee, which is definitely not a sustainable income.

So then I need to find a better way to map the experience to the salary, still without the salary plus experience survey data. I next decided to use some numbers that I know well, my own  job description (postdoctoral researcher). Of course the numbers will change based on discipline and location, however at the moment this is just a thought experiment so I make rough estimates for both the salary and experience distribution. For salary I assume a mean of $50,000 with a standard deviation of $5,000 and for experience I assume a mean number of years of experience to be 5 with a standard deviation of 2 years. My own number of years experience is 3. Now I assume that there is a direct mapping for the salary to the experience distribution, i.e., those with the mean salary have the mean number of years of experience, one standard deviation up in terms of salary means one standard deviation up in experience. Thus, since I have one standard deviation (2 years) less experience than the mean I predict that I also should make one standard deviation ($5,000) less than the mean, or $45,000. Since I made so many assumptions and I don’t have the data to make more accurate predictions, this number is not quite right. However, I do now have a better idea of what data I will need to make this prediction more accurate, i.e., at least a survey database that has salary versus experience.

Stay tuned for the next installment.

TNeg_NormDist