Wednesday, October 5, 2022

Would the real data scientists please stand up?

There are a number of myths surrounding the field of data science, the jobs available, and the pay. When I was in my salad years and green with judgement (read: before I started my master’s) I was lead to believe that data scientists are brilliant individuals who go on to earn $10,000 a month right out of school—provided that they can pass the rigorous demands of their educational programme.

The reality is much more nuanced. It is true that there are a small number of graduates who earn that much straight out of school, along with a somewhat larger number of experienced individuals (e.g. Head of Data, Senior Machine Learning Engineer, Data Science Lead). However, there are many things they don’t tell you:

  1. Those kinds of salaries are rarely found outside of the US, and since most data scientists have at least a master’s degree, that often means debt—as much as $100,000, for which the Federal interest rate is a criminal 6%;
  2. They typically go only to graduates of the top schools;
  3. They are typically only for on-site roles in some very expensive cities, e.g. San Francisco, New York and maybe London, where rents are sky-high;
  4. They are mostly offered by brand-name tech companies, finance firms, and a few startups; the recruitment process tends to be arduous. It often involves a battery of tests in coding, maths, IQ, and personality tests, in addition to multiple interviews. And by “interview”, think live-coding.
  5. By “right out of school”, think PhD. A master’s is just a pre-requisite in many firms.

So clearly, data scientists being paid $100,000+ per year are the minority. If you look at job sites like Glassdoor, you’d think that this is the median salary, but these sites suffer from sampling bias. The actual median salary in the UK is little over 50K GBP. Comparing salaries across countries is difficult because:

  1. Exchange rates fluctuate, and purchasing power parity needs to be taken into account;
  2. In the US, employers pay 7.65% social security taxes; in European countries, it’s usually higher, e.g. 15% in the UK, 25% Belgium.
  3. Rents vary.
  4. Student loans vary dramatically.

So what are some typical salary bands, responsibilities, and what kind of skills are actually required? If you think that you can get a six figure job just by knowing how to work with sklearn, pandas and matplotlib, you are sadly mistaken. These are simply the basic requirements for an entry-level job. (And not even all the basics, at that.)

In order from highest to lowest:

  • $100,000+ per year, or the PPP equivalent in euros and pounds: There are 3 types of people, in my experience.
    • The most “classical” profile is someone with a PhD in machine learning/AI from a top school who has written a few white papers in novel ML algorithms. These people work in really cutting-edge applications.
    • The second kind of person is someone with at least a couple of years of experience who is very proficient at programming, and competent with many technologies: not just Python, but they may also be crack C++, Java or Julia programmers. They know their way around cloud computing, containerisation, distributed computing, and have read a book or two on design patterns. They know SQL and NoSQL. These people productionize ML models. They are basically glorified software engineers.
    • Experienced personnel with deep domain knowledge in medicine, business, finance, or linguistics who also have the technical skills.
  • Anywhere from $60K to less than $100K: These are either PhD grads from reputable-but-less-than-top universities (those other than MIT, Oxbridge, Ivy League), or ML engineers with some experience. Fresh PhD grads will start on the low end of the scale.
  • From $40K to $60K or PPP equivalent: They know the basics, I guess. Strong Python/R skills and stats. Decent SQL skills. Can wrangle their data, write OOP code, and knows how to setup a Docker container or write a shell script. Fresh grads start on the low end, but salaries increase quickly with a bit of experience.
  • Less than $40K: These are typically called data analysts. They might write a Python script or two.
  • Unemployed (or not employed as a data scientist): The legions of wannabes from bootcamps, online certificates, and 1 year degrees who write awful code (usually in Jupyter notebooks), don’t understand statistics, don’t know how to query a database, clean data, or understand how to properly evaluate the output of a model.

Some other things to discuss. I hear a lot of people mention domain knowledge; often, these people have a lot of domain knowledge in their field of expertise and are trying to break into data science. I hate to break it to them, but while domain knowledge is valuable, all the domain knowledge in the world won’t help if you don’t the master the basics of programming and statistical analysis. It’s usually the coding part where they struggle.

Is there a shortage of data scientists in the industry? Ten years ago, definitely. Reputable sources still claim that anywhere from 20% to 50% of companies struggle to fill data science roles, but take these numbers with a pinch of salt.

What I can say is that there’s definitely a shortage of talented, experienced data scientists who are willing to work for less than six figures. (The competition for six figure roles is blisteringly intense, with many talented, qualified individuals applying for each role.) The shortage is particularly acute for “full-stack” data scientists who can productionize models and do data engineering. However, there seems to be a big oversupply of bootcamp/online course graduates, as well as degree holders from orthogonal disciplines. The life and social sciences are the worst offenders. A bit of data analysis in SPSS, or rookie Python ability probability won’t net you a job, or a low-paying one at best.

If you are looking to enter the field like I am, my advice is to start by getting a degree in DS, CS or statistics (but make sure it’s a stats course with strong programming requirements). It’s generally easier to become a data scientist by first starting out as a developer or a data engineer (glorified developer). By the way, a position as a data analyst does not prepare you for the reality of data science or software engineering. The stories you hear about data analysts becoming data scientists was from years ago when management didn’t know the difference – or they have the title data scientist but only do analytics work. A data analyst is usually a glorified Excel monkey who maybe knows some Python and SQL. While they might be able to create a simple ML model, they often don't really understand how to interpret the model or tune it, as their statistics knowledge only covers the basics. The learning curve for software engineering is even steeper and such a person can only do their job if they have an army of data engineers and ML engineers working for them.

Although companies like IBM tell us that data is a fast-growing field, the threat of automation should not be discounted. In 10 years, there might not be a market for data scientists, or at least, it won’t employ large numbers of people. The bar could be set even higher as simple, routine tasks become automated.

I do believe that creative jobs are immune from automation, and certainly, there are no AI models that can write stories, so I still plan on being a writer in the long-term. We’ll have to wait and see whether my creative abilities end up being more lucrative than my technical ones.

EVs are not the future—hybrids are

There has been a wild surge in optimism in EVs—really, a kind of hysteria—with the EU and UK governments hoping to ban combustion engines in...