Skip to content

Latest commit

 

History

History
142 lines (104 loc) · 5.75 KB

Missing Data.md

File metadata and controls

142 lines (104 loc) · 5.75 KB

Back to ML

How to deal with missing data?

  • There is no particular approach for dealing with missing data (NULL, NaN, None)
  • NULL: Represents a missing or undefined value in a database. NULL is a marker to denote missing data.
  • NaN: Represents a value that is not a valid number. Typically a floating-point value.
  • None: Represents the absence of a value or a null value. It is an object of its own data type (NoneType)
  • The appropriate approach depends on your dataset (missing quantity), data type and the analysis goal.
  • Row | Observation | Tuple | Sample | Record
  • Column | Feature | Field | Attribute | Dimension

How to identify missing values?

# Check for None:
missing_values = [x is None for x in data]

# Using math library: 
import math 
missing_values = [math.isnan(x) for x in data]

# Using Pandas library:
import pandas as pd 
missing_values = df.isnull().any()
missing_counts = df.isnull().sum()

# Using NumPy library:
import numpy as np
missing_values = np.isnan(array)

How to handle missing values?

1. dropna(): Drop Missing Values

  • If the missing data is negligible and doesn't affect the overall analysis, drop the corresponding rows or columns.
  • Drop rows if missing values < 5% i.e. (axis = 0) | Drop columns if missing values > 70% i.e. (axis = 1)
  • Deleting irrelevant rows or columns helps to get a robust model.
  • But it's better to keep data rather than dropping, removing data may lead to loss of information.
# DataFrame.dropna():
df.dropna()

# DataFrame.Series.dropna():
df['Sales'].dropna()

# Drop rows:
df.dropna(axis=0)

# Drop the row if, any attribute value is NaN:
df.dropna(axis=0, how='any')

# Drop the row if, all the attribute values are NaN:
df.dropna(axis=0, how='all')

# Drop columns:
df.dropna(axis=1)

2. fillna(): Fill/Impute Missing Values

  • Imputation involves estimating missing values with the help of other available (non missing) rows or columns.
  • Impute the numerical missing data with the sample mean or median (SimpleImputer: strategy = 'mean' or 'median')
  • Impute the categorical missing data with the sample most frequent values (SimpleImputer: strategy = 'most_frequent')
  • SimpleImputer() is used to fill in the missing values (Univariate imputation)
  • fit(): Learn the values (Mean, Median, Mode) to be imputed and transform(): Fill in the missing values.
  • KNNImputer(): Fill missing data with the help of the K Nearest Neighbours.
  • fit_transform(): Learn and impute the values in place. Only apply on the train set.
  • Never apply fit_transform() on the test set, it will cause data leakage.
# DataFrame.fillna()
df.fillna('🌞')

# DataFrame.Series.fillna()
df['Sales'].fillna(0)

Data Leakage

  • Data was accidentally shared from the train set to the test set.

Disadvantage

  • Imputation changes the distribution and the statistical properties of the dataset.
  • Statistical Properties: Mean, median, mode, variance, standard deviation of the sample.
  • Imputation might bring skewness and new outliers in the dataset.
  • Imputation changes the correlation among features and the impact of independent variables on the target variable.

3. Assign a Unique Category (Categorical Data) | Flag (Numeric Value)

  • Assign a unique category for data with missing values or assign with a "Missing" flag or some binary flag.
  • Flag the numeric missing data with -1 or 0.
  • Create a difference between missing data and remaining non-missing data.

4. Predict Missing Value

  • Fill missing data with the help of other features by predicting them (Multivariate imputation)
  • Use the non-missing data (rows) as the train set and missing data (rows) as the test set.
  • Interpolation: Predict missing data with the range of date and time (Time series forecasting)
from sklearn.linear_model import LinearRegression
import pandas as pd

data = pd.read_csv("train.csv")
data = data[["Survived", "Sex", "Age"]]

test_set = data[data["Age"].isnull()] # Missing data will be test set.

data.dropna(inplace=True) # Remaining data (Non-Null) will be used for training the model.

y_train = data["Age"]
X_train = data.drop("Age", axis=1)
X_test = test_set.drop("Age", axis=1)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Create Test Set = 20% of Dataset.
from sklearn.model_selection import train_test_split

# X: Independent Variables, y: Dependent Variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

5. Use algorithms that work fine with missing values

  • KNN: KNN fills the missing values by taking most of the K nearest values (Focus on nearest neighbours)
  • Random forest: Weak learners are trained by non-missing data and missing values can be used for testing.
  • SVM: SVM focuses on the support vectors (data points) nearest to the hyperplane.
  • Experiment with different algorithms and compare their performance on specific datasets.

Domain Knowledge

  • It'll help us understand the reason for the missing data. Whether the data is missing at random or due to some error.
  • Understanding the cause can help us to decide on how to handle the missing data in much better.

Back to Questions