How to ace your take-home data science assignment?

Photo by Alex Litvin on Unsplash

Photo by Alex Litvin on Unsplash


Opinions are my own and do not reflect my current or previous employers


A take-home assignment is a common part of any data science job interview. I have attempted AND reviewed a number of them and made some consistent observations.

For some reason, many applicants try to make their solution flashy. Obscure plots, overkill models, 100x jargons. The best assignments are those where the applicant acts like they are already an employee of the company and actually try to solve their problem; instead of trying to impress them. However, this needs more explanation.

As with any ML problem, there is a standard framework you can stick to - EDA, cleaning, feature engineering, modeling, postprocessing.

An average Joe would follow a set of boilerplate steps without thinking. How? Replace empty cells with mean/median, standardize/normalize, fire the XG-Boost bazooka, report the 99% accuracy in bold.

Let me share with you some ideas that’ll make you stand out a little. Let’s consider a text-classification assignment.

Approaching the problem

EDA

EDA starts with wandering and ends with clear, actionable insights.

Data Source

What is the data source? Is it from Twitter? If so, there might be slangs and trailing exclamations like !!! Is it OCRed from images? There might be OCR errors like O being replaced by 0. It is important to have some idea about where did your data come from. Does it have multiple languages? Is there code-mixing?

Mindful EDA

It is a good idea to write down questions you might have about the data. But keep in mind that you don’t have all the time in the world. If you are working on sentiment classification, ask yourself this - what kind of data would be ideal for a simple model to work?

  • If the negative examples are full of negative words and positive words are full of positive words — check the most frequent vocabulary in positive samples and negative samples. See if there is a contrast.

  • Bonus - analyze positive sentiment examples having negative words. How do they look?

  • Negative examples would have more trailing question marks, exclamation marks, and upper case letters — This can be easily checked.

  • What topics do positive and negative examples talk about? If the negative examples relate to a certain event like an election, wonder if a model trained on this data could predict negative sentiment for some other topic.

This will help you realize how complex your feature engineering steps and modeling steps are going to be.

Plotting responsibly

Include plots in your presentation ONLY when there is a key insight you want to highlight. Don’t waste time on pretty plots and let the interviewer wonder “So what?”

From A to A+

At the end of your EDA section, make sure you mention how you are going to use the insights. Specifically, what your insights translate to - decisions affecting feature engineering and modeling.

Preprocessing

If you diligently do the EDA, preprocessing steps should naturally come to you. In our case, simply removing punctuations and lower-casing might not be the best choice. You’ll always get bonus points if you mention something like this -

I did not perform <INSERT A COMMON PREPROCESSING STEP> here because <SOME INSIGHT YOU DERIVED FROM EDA>

The goal of pre-processing is to make the data digestible for your model. In our case, it could be reducing vocabulary size by lemmatizing, stemming, removing stopwords, and other common steps.

Metric Selection

A good metric is a function of business requirements and the dataset. There are many articles on metric selection for an imbalanced dataset, the cost of false positives and false negatives. The point is, make sure you show that you thought about it.

Baseline

People really underestimate the use of baseline. A baseline is a simple, usually, an explainable model you fit on your data. The goal is to understand how hard is it to model the data. It sets a benchmark for any complex model to beat.

You can use a simple CountVectorizer/TfidfVectorizer + Logistic regession as a baseline.

I have seen people using a Naive Bayes baseline that beats BERT. No kidding.

Another thing you could do with the baseline is just browsing through misclassified examples. Later, you could see if your complex model does well on these examples. However, do not spend time tuning the baseline. At this point, decide where do you want to go -

  • If the misclassification can be solved with the same feature set and a better model, do that.

  • If not the above (say when the context and sentence structure turn out to be more important, a count-based vectorizer is not a good idea), go for a complex model.

Bonus - If you try to visualize your decision tree or analyze your logistic regression weights.

Feature Engineering

If you decide to go with a non-Deep Learning approach, you’ll get a chance to show your creativity here.

Modeling

Based on what you decide after the baseline step, model the data again. This time your focus should be a better metric value. In my opinion, don’t waste time on marginal increments. Remember that you are in a time crunch. No one would reject your assignment because you left out that 2% accuracy.

Bonus - If you think about inference time, model size, whether or not explainability is required.

Augmentation

If you feel that you have fewer data points for a particular class, consider augmentation. For text, there are packages to paraphrase your sentence. Or try changing the subject/object of the sentence by replacing them with synonyms.

Error analysis

One of the most important steps missed by almost everyone. Focusing on single number metrics is not revealing. Hand-picking examples, lower confidence predictions, analyzing the confusion matrix is what gets you closer to the root problem.

Which class has the least F1 score? Is it because of low precision or low recall? Which class is it misclassified as the most? Is there a mistake in data annotation?

Making a presentation

All of your work will boil down to a presentation. I personally recommend using PowerPoint. It gets easy for the interviewer to go through your work in less time.

Here are some Dos and Don’ts -

  • Do not explain technical concepts. Add links to a relevant blog or paper instead.

  • Structure your slides well. The flow should be absolutely logical.

  • Add a slide, in the beginning, summarizing your findings and approach. Normally it should contain these points-

    • Your final model’s results.

    • The improvement over baseline.

    • Which classes had the best results? Which classes had poor results?

    • Recommendations for future - more data, diverse data, better model, etc.

  • Do not include unnecessary plots and graphs.

  • No more than 5 points per slide.

  • Include only key figures. Detailed statistics can be included in an appendix section at the end.

  • When in doubt about including a point, ask yourselves this -

    • Does omitting this point break the logical flow of my presentation?

    • Is it possible to club this point with any other point?

    • Does this detail really support my argument?

  • Keep your language as simple as possible. Don’t cram too many numbers into a single point.

  • Include comparison tables for different models and highlight/mark bold the numbers you want the interviewer to see.

  • Highlight model’s prediction for edge cases.

Previous
Previous

Customize transformer models to your domain

Next
Next

BERT-ology at 100 kmph