How to spot a fake data scientist

from sklearn import *

and other dead-giveaways that you’re a fake data scientist

These days it seems like everyone and their dog are marketing themselves as data scientists—and you can hardly blame them, with “data scientist” being declared the Sexiest Job of the Century and carrying the salary to boot. Still, blame them we will, since many of these posers grift their way from company to company despite having little or no practical experience and even less of a theoretical foundation. In my experiences interviewing and collaborating with current and prospective data scientists, I’ve found a handful of tells that separate the posers from the genuine articles. I don’t mean to belittle self-taught and aspiring data scientists — in fact, I think this field is especially appropriate for passionate self-learners —but I definitely mean to belittle the sort of person who takes a single online course and ever after styles themselves an expert, despite having no knowledge of (or interest in) the fundamental theory of the field. I’ve compiled this list of tells so that, if you’re a hiring manager and don’t know what you’re looking for in a data scientist, you can filter out the slag, and if you’re an aspiring data scientist and any of these resonate with you, you can fix them before you turn into a poser yourself. Here are three broad domains of data science faux pas with specific examples that will land your resume in the bin.

1. You Don’t Bother with Data Exploration

A) You don’t visualize your data

Anscombe’s Quartet (Source: Wikimedia Commons)
The datasets in each panel all have essentially identical summary statistics: the x and y means, x and y sample variances, correlation coefficients, R-squared values, and lines of best fit are all (nearly) identical. If you don’t visualize your data and rely on summary stats, you might think these four datasets have the same distribution, when a cursory glance shows that this is obviously not the case.
Data visualization allows you to identify trends, artifacts, outliers, and distributions in your data; if you skip this step, you might as well do the rest of the project blindfolded, too.

B) You don’t clean your data

There are lots of good ways to identify problems with your data and no good ways to identify them all. Data visualization is a good first step (have I mentioned this?), and although it can be a tedious and manual process it pays for itself many times over. Other methods include automatic outlier detection and conditional summary stats.
For an example, consider this histogram of human heights:
A histogram of adult human heights
Training a model with this data would doubtless lead to poor results. But, by inspecting the data, we find that the 100 “outliers” in fact had their height entered in metres rather than centimetres. This can be correcting by multiplying these values by 100. Properly cleaning the data not only prevents the model from being trained on bad data, but, in this case, let us salvage 100 data points that might otherwise have been thrown out. If you don’t clean your data properly, you’re leaving money on the table at best and building a defective model at worst.

C) You don’t bother with feature selection and engineering

  • Dimensionality Reduction: More data isn’t always better. Often, you want to reduce the number of features before fitting your model. This typically involves removing irrelevant and redundant data, or combining multiple related fields into one.
  • Data Formatting: Computers are dumb. You need to convert your data into a format that your model will easily understand: neural networks like numbers between -1 and 1; categorical data should be one-hot encoded; ordinal data (probably) shouldn’t be represented as a single floating point field; it may be beneficial to log transform your exponentially-distributed data. Suffice it to say, there’s a lot of model-dependent nuance in data formatting.
  • Creating Domain-Specific Features: It’s often productive to create your own features from data. If you have count data, you may want to convert it into a relevant binary threshold field, such as “≥100” vs “<100”, or “is 0” vs “is not 0”. If you have continuous data x and z, you may want to include fields xz, and z² alongside and in your feature set. This is a highly problem-dependent practice, but if done right can drastically improve model performance for some types of models.
Most laypeople think that machine learning is all about black boxes that magically churn out results from raw data; please don’t contribute to this misconception.

2: You Fail to Choose an Appropriate Model Type

A) You just try everything

from sklearn import *
for m in [SGDClassifier, LogisticRegression, KNeighborsClassifier,  
             KMeans, KNeighborsClassifier, RandomForestClassifier]:
    m.overfit(X_train, y_train)
This is an obvious giveaway that you don’t understand what you’re doing. It’s a waste of time and easily leads to inappropriate model types being selected because they happened to work well on the validation data (you remembered to hold out a validation set, right? Right?). The type of model used should be selected based on the underlying data and the needs of the application, and the data should be engineered to match the chosen model. Selecting a model type is an important part of the data science process, and direct comparison between a handful of appropriate models may be warranted, but blindly applying every tool you can in order to find the one with “the best number” is a major red flag. In particular, this belies an underlying problem which is that…

B) You don’t actually understand how different model types work

The bigger problem here isn’t that people don’t know how different ML models work, it’s that they don’t care and aren’t interested in the underlying math. If you like machine learning but don’t like math, you don’t really like machine learning; you have a crush on what you think it is. If you don’t care to learn how models work or are fit to data, then you’ll have no hope of troubleshooting them when they inevitably go awry. The problem is only exacerbated when…

C) You don’t know if you want accuracy or interpretability, or why you have to pick

Which type of model you choose should be informed by which of these two traits is more important for your application. If the intent is to model the data and gain actionable insights, then an interpretable model, such as a decision tree or linear regression, is the obvious choice. If the application is production-level prediction such as image annotation, then interpretability takes a backseat to accuracy and a random forest or neural network is likely more appropriate.
In my experience, data scientists who don’t understand this trade-off and those who beeline for accuracy without even considering why interpretability matters are not the sort you want training models for you.

3: You Don’t Use Effective Metrics and Controls

A) You don’t establish a baseline model

If you saw a red circle with a line through it, you’ve tested negative. If you saw a green check mark, you’re lying. The point is, 99% of people don’t have pancreatic cancer (more, actually, but let’s just assume it’s 99% for the sake of this example), so my silly little “test” is accurate 99% of the time. Therefore, if accuracy is what we care about, any machine learning model used for diagnosing pancreatic cancer should perform at least as well as this uninformative, baseline model. If the hotshot you’ve hired fresh out of college claims he’s developed a tool with 95% accuracy, compare those results to a baseline model and make sure his model is performing better than chance.

B) You use the wrong metric

C) You bungle the train/test split


…to import tensorflow as tf

Post a Comment

0 Comments