In this blog, I will talk about some of the best and recommended practices to follow when it comes to data science projects. Before I present the best practices, it is important to understand the general workflow of a data science project. There are many variations to it, but I always prefer to use the traditional (yet most effective) workflow called CRISP-DM, which stands for Cross Industry Standard Process for Data Mining.
The intention of this blog is not to explain the CRISP-DM process. However, the best practices that I will be discussing are inspired by this process, so I will briefly touch upon this topic.
One of the important things to notice on the CRISP-DM process is the feedback loop. Data preparation to modeling stage has an iterative feedback loop- the bidirectional arrows signify that the feature engineering is an iterative process. The model evaluation stage connects to business understanding before deployment. This is very important that we evaluate and re-evaluate our models, making sure it is answering the right business questions before we deploy the models. Another factor I want to highlight is the data part, which is the center of this framework. Everything revolves around data.
Now that we have enough context set, below are the five best practices that I recommend you to follow if you are working in a data science or a machine learning project.
Business understanding or problem statement:
Make sure you spend enough time understanding the business questions you are trying to solve with data science. Spending enough time in the early stage of problem formulation will save you time in future stages. As a data scientist, you have to work closely with different functional areas of business, stakeholders, and product managers to understand the business problem from the customer’s perspective.
Data is the secret sauce of your model:
If you have heard the good old saying- ‘Garbage In, Garbage Out’, this statement is very true when it comes to the data you are using for training your model. The quality of data is important. About 80% of the data scientist’s time is spent on data preparation. The model part is important, but the success of a model is highly dependent on the quality of the data.
Communication with the subject matter or domain experts:
One of the common mistakes from data scientists is not talking to domain experts frequently enough, during the feature engineering and model building stages. Feature engineering is one of the crucial steps of a data science project. This is the stage where you are deriving or creating features for the ML model. Domain experts are your best friends when it comes to gaining additional insights on what features to create. For example, if I am building a ML model to detect credit card fraud, communicating with the fraud analyst will help me collect some insights on what information they are looking for in the data to make the decision on a fraudulent transaction.
Start with proof of concept, build MVP and iterate over it:
Always start small, and iterate over the simple model, rather than trying to build everything at once. Once you identify some of the key data sources, connect with the data owners to gather a small set of sample data, explore the data, understand the data, and build a simple MVP (minimum viable product). There is a popular saying in the IT world- If at First You Don’t Succeed Call It Version 1.0. This is a very true statement when it comes to data science projects.
Measure what matters:
The model evaluation stage (as shown in above figure) is very critical because it actually decides whether the business question or the problem statement we started with is answered by the model. Before we can answer if we met the success criteria or not, it is important to define that success criterion first. This stage also decides if the model should be deployed into production, therefore measuring the metrics that matter is very important.
If we consider the same example of fraud detection model, it is important to decide on the threshold that separates the two classes- fraud vs. not fraud. If we only rely on default PR (precision and recall) metrics, the cutoff threshold is 0.5, however, it is important to decide on what to optimize for. If the business thinks having more false positive alerts is more expensive than having false negatives, then we shall increase the threshold to improve precision.
ARTICLE WRITTEN BY OUR ADVISOR
Nirmal Budhathoki
SENIOR DATA SCIENTIST – MicroSoft
Subscribe To Our Newsletter
Stay up-to-date with the latest news, trends, and resources in GSDC
Claim Your 20% Discount from Author
Talk to our advisor to get 20% discount on GSDC Certification.