/  Top Machine learning interview questions and answers   /  What are the different ways to handle missing data and invalid data in a dataset?
What do you mean by Univariate, Bivariate and Multivariate 2 (i2tutorials)

What are the different ways to handle missing data and invalid data in a dataset?

Ignore the data row

1. This is a quick solution and typically is preferred in cases where the percentage of missing values is relatively low (<5%). It is a dirty approach as you lose data.

2. You can also select to drop the rows only if all of the values in the row are missing.

Back-fill or forward-fill to propagate next or previous values respectively

Note that the NaN value will remain even after forward filling or back filling if a next or previous value isn’t available or it is also a NaN value.

3. Replace with some constant value outside fixed value range-999,-1 etc

This method is useful as it gives the possibility to group missing values as a separate category represented by a constant value. It is a preferred option when it doesn’t make sense to try and predict a missing value.

4. Replace with mean, median value

This simple imputation method is based on treating every variable individually, ignoring any interrelationships with other variables. This method is beneficial for simple linear models and NN.

Leave a comment