According to one of the Gartner Analysis Report, about 85% of the Data Science (DS) project fails. Here the failure means — Machine Learning (ML) models are limited to exploration in the notebooks, without making to the production.
What are the major reasons behind this ? Is there something we can learn from all these failures ?
“Once is a mistake. Twice is a decision. Any more than that has no chance of being forgiven.”
It is important to understand the difference between Exploratory Data Science Vs. Operational Data Science. Most of the companies are really good at the exploratory part, but what they are missing is the operational side.
Exploratory Data Science includes building the proof-of-concepts and experimentation, while the operational Data Science includes deploying the models into production, and integrating them with the application. Data Science team works with Data Engineers, Software Engineers, UI/UX designers and DevOps engineers for operationalizing the models.
One of the main reasons for failure could be- the companies are directly jumping into the AI race before they are data ready. However there are many other reasons related to the operational side of DS where companies are overlooking or not planning ahead of time.
I have summarized those reasons into four categories with a checklist:
Data Observability (DO)
This refers to an organization’s comprehensive understanding of the health and performance of the data. Data Science project without having DO is like trying to make a pizza without dough. In other words, DO helps us set up the foundational building blocks for data science, because every DS project begins with data. Some of the important things that a good DO should cover are: stability of data pipeline, consistency of data, quality of data, well -defined schema or data dictionary for all the data sources, data lineage to cover monitoring of the pipeline. Having an efficient DO also serves as a pre-requisite for data governance and implementing policies to cover data privacy.
Problem Statement
The first stage of any data science or machine learning workflow that we come across is the ‘problem definition’. It is important to define the problem statement in a clear and concise way, before we begin any DS project. One of the common mistakes I have seen is- Defining the problem only from the ML standpoint, not considering the business side. Data Scientists must collaborate with the product team during this stage. If possible, this is the stage where we want to partner with product managers (PMs).
This stage involves more than just the problem definition. This is where the Data Scientists need to put an engineering hat, and start thinking about the technical requirements for solution design. The requirements can be captured in different forms:
Project requirements: If you are working with PMs, you have to ask for PRD (Project Requirements Document). If you have no PM assigned, I strongly recommend to capture PRD on your own. It does not have to detailed PRD as PM writes, but at a minimum it should cover project goals, estimated timelines, and the business impacts.
Functional requirements: This document is different from project requirements. This has to be done to capture the technical details around the functionality of the tool or feature being built. Functional requirements specify what the system should do. This includes defining the input data, processing logic, and expected outputs.
Business requirements: This includes identifying the business problem, the desired outcomes, and the strategic value of the project. Collaborating with business leaders to define these requirements ensures that the project delivers measurable business value.
Success Metrics
Success metrics are pivotal in determining the value and impact of a data science project. They help bridge the gap between technical performance and business outcomes, ensuring that projects align with organizational goals and deliver tangible benefits. When it comes to success metrics it is important to understand the model vs. business metrics.
Model Metrics: Model metrics are crucial for evaluating the technical performance of a machine learning model. These metrics include accuracy, precision, recall, F1 score, AUC-ROC, and more, depending on the problem type (classification, regression, etc.). While these metrics are essential for understanding how well a model performs on a given dataset, they often do not reflect the real-world impact of the model.
Business Metrics: Business metrics measure the actual value a model brings to the organization. These could include increased revenue, cost savings, customer satisfaction, and operational efficiency. For instance, a model with high accuracy in predicting customer churn is valuable only if it translates into effective retention strategies that boost customer retention rates and revenue.
Balancing Both: To ensure a project’s success, it’s vital to balance model and business metrics. A technically excellent model might still fail if it doesn’t meet business objectives. Conversely, a model that slightly compromises on technical metrics but excels in business impact can be considered a success.
Engineering Support
Engineering support is crucial for the seamless execution and deployment of data science projects. It involves having the right team, infrastructure, and budget to support the project from inception to production.
Data team composition: Data Scientists, Data Engineers and ML Engineers.
Data Scientists are the responsible for building data science projects from ideation, problem formulation, building proof of concept, and developing the predictive model. Data Scientists take the ownership of machine learning model to build and deliver. The main idea of this article is to highlight that Data Scientists are not the only ones to have a complete the data team.
Data Engineers responsible for designing, building, and maintaining the data infrastructure. They ensure that data is collected, processed, and stored efficiently, enabling data scientists to access clean and reliable data. Data engineers play a critical role in managing data pipelines, ETL processes, and ensuring data quality.
ML Engineers: Machine learning engineers focus on deploying and maintaining machine learning models in production. They work closely with data scientists to convert models into scalable and reliable systems. Their responsibilities include model optimization, performance monitoring, and handling issues related to model drift and retraining.
Successful data science projects require seamless collaboration between data scientists, data engineers, and ML engineers.
Budgeting for Data Science Projects Under Engineering:
Resource Allocation: Allocating the right resources is essential for the success of data science projects. This includes budgeting for personnel, infrastructure, tools, and software. Ensuring that the project has adequate funding throughout its lifecycle helps in mitigating risks and addressing challenges promptly.
Infrastructure Costs: Building and maintaining the infrastructure for data science projects can be expensive. This includes costs related to cloud services, data storage, and computational resources. It’s important to plan and allocate budget for these expenses to avoid disruptions during the project.
Tooling and Software: Investing in the right tools and software can significantly enhance the productivity and efficiency of the team. This includes data processing tools, machine learning frameworks, version control systems, and collaboration platforms. Ensuring that the team has access to the latest and most effective tools can drive better project outcomes.
Additional Considerations
Change Management: Managing changes effectively is crucial for the success of data science projects. This includes handling changes in project scope, data sources, and business requirements. Implementing a robust change management process helps in adapting to new challenges and opportunities without derailing the project.
Stakeholder Engagement: Engaging stakeholders throughout the project lifecycle ensures that their expectations are met and their feedback is incorporated. Regular updates, presentations, and demos can help in maintaining transparency and building trust with stakeholders.
By focusing on these aspects, organizations can increase the likelihood of their data science projects making it to production and delivering meaningful business impact.