Missing data occurs when no value is stored for a variable in an observation, which is common in real-world datasets. Incomplete data can skew analyses, reduce the validity of conclusions, and hinder machine learning model performance. The goal of this blog is to cover how to identify, understand, and handle the missing values effectively to maintain the data integrity.
Impact: Often times, missing data can lead to:
Biased results, especially if missingness is related to the data itself.
Reduced sample size, leading to less robust analysis.
Poor model generalization, if handled improperly.
This article is divided into following 4 sections:
Identifying missing data
Types of missing data
Methods to handle missing data
Best Practices and Considerations
1. Identifying Missing Data:
Data Profiling:
We can use profiling tools like
pandas-profiling
orSweetviz
that generate automated reports with insights on missing values.Example: Use
pandas-profiling
to generate a profile report in Python.import pandas as pd from pandas_profiling import ProfileReport df = pd.read_csv("data.csv") profile = ProfileReport(df, minimal=True) profile.to_file("data_report.html")
Visualization Techniques:
Use library like missingno to create heatmaps and barplots showing missing data patterns.
import missingno as msno msno.matrix(df) # Matrix plot to view missing values pattern msno.bar(df) # Bar plot of total missing values per column
Custom Exploratory Functions:
This is my favorite, where we write custom function to display missing data counts and percentages.
def missing_data_summary(df): missing_data = df.isnull().sum() missing_percentage = (missing_data / len(df)) * 100 return pd.DataFrame({'Missing Count': missing_data, 'Missing Percentage': missing_percentage}) print(missing_data_summary(df))
2. Types of Missing Data:
Missing Completely at Random (MCAR):
Definition: The missing values are independent of both observed and unobserved data.
Example: Survey data where respondents skipped random questions unrelated to their characteristics.
Can be detected with Statistical tests (e.g., Little's MCAR test)
Missing at Random (MAR):
Definition: Missing values depend on observed data but is not related to the missing values themselves.
Example: Income data may be missing but is related to age or education level.
It can be addressed effectively with conditional imputation, as we can predict missingness based on related variables.
Missing Not at Random (MNAR):
Definition: Missingness depends on unobserved data or the missing values themselves.
Example: People may be unwilling to disclose income if it is very high or low.
Addressing MNAR is challenging, and solutions may require domain knowledge, assumptions, or advanced modeling techniques.
3. Methods To Handle Missing Data:
Listwise Deletion:
Discuss cases where removing rows with missing data may be appropriate (e.g., when missing values are very low or MCAR).
Example: If a dataset has <5% missing values, deletion may suffice, though it’s risky for larger proportions.
Mean/Median/Mode Imputation:
For numerical data, impute using mean or median values; for categorical data, use mode.
Pros & Cons: Simple but can introduce bias and reduce variance in the data. If not handled properly, the model will be impacted by biased data.
df['column'] = df['column'].fillna(df['column'].mean())
Advanced Imputation Techniques:
K-Nearest Neighbors (KNN):
Use neighbors to fill missing values based on similarity.
Pros & Cons: Maintains relationship between variables, but computationally it can be expensive.
from sklearn.impute import KNNImputer imputer = KNNImputer(n_neighbors=5) df_imputed = imputer.fit_transform(df)
Multivariate Imputation by Chained Equations (MICE):
Iteratively imputes missing values by treating each column as a function of others.
Best suited for MAR, though it can also be computationally expensive.
Sklearn has features like fancyimpute or IterativeImputer to apply MICE.
from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer imputer = IterativeImputer() df_imputed = imputer.fit_transform(df)
Photo by Bonnie Kittle on Unsplash Use Machine Learning Models:
Impute missing values by training a model on non-missing values.
Example: Predict missing income based on age, education, and occupation.
Build categories of similar data points. Find a group of similar people and use non-missing salary to get the median/ mean for missing ones.
Pros: Potentially accurate; Cons: Requires careful validation to prevent overfitting.
Data Augmentation:
For small sample sizes, consider generating synthetic data points.
Use models like GANs for more advanced augmentation.
4. Best Practices and Considerations:
Assess Method Effectiveness: Compare imputed values with actual values when possible or test multiple imputation methods to evaluate which yields better model performance.
Use Domain Knowledge: Understanding the domain can be crucial for addressing missing values, for example in MNAR data and/or identifying appropriate imputation techniques, or deciding to drop.
Monitor Impact: It is important to track model accuracy before and after handling missing data. Also we need to measure the gain based on implementation cost/ challenges.
Data Imbalance: We have to very careful while using simple methods (upsampling) for imbalanced datasets as they may not accurately reflect minority class values.
To summarize, missing values are an inevitable challenge, but addressing them effectively is key to successful data science projects. Understanding the type of missing data and choosing the right handling method—whether simple imputation or advanced techniques like MICE or KNN—is crucial. There’s no one-size-fits-all solution, so leveraging domain knowledge and validating your approach can ensure data integrity and reliable outcomes.
Hopefully you learned something today. Please subscribe to read more articles in data science and machine learning. Thank you :)