Handling missing data

Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of the poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

Missing data in the Datafames are represented as NaN values (Not a Number).

Checking Missing Values

To check for the NaN values we can use the .info() method of Dataframe which give a basic summary of  the dataframe or we can use isnull() or notnull().

Here we can see there are 1309 total entries but the cabin, embarked, boat, age, and body has less than 1309 values which mean there are missing values.

isnull() and notnull() return True or False for each entry in a particular column. isnull() returns True if the value is NaN and False if the value is not NaN. The opposite is the case with notnull().

There are mainly 3 methods to solve the problem of missing values.

  1. Leave as it is
  2. Drop Them
  3. Fill missing value

Drop Missing Values

Here as we can see we lost all of our data as it removed all that row which had even a single null value so this is clearly not a good idea to deal with missing values unless we lost only a few entries from the data.

With dropna() method all that rows which contain even a single NaN value will be deleted.

Filling Missing Values with .fillna()

  • We can fill missing values with .fillna() method
  • We can fill it with user-provided values or,
  • With summary statistics like mean, median, etc.

Here we have filled the missing values of cabin column with ‘cabin’. We can also select multiple columns from the data with the help of [[ ]].

Fill Missing Value with the Statistic

This method of filling missing values with statistics is also known as interpolation.

  • We have to be careful when using the test statistic to fill
  • We have to make sure that the value we are filling makes sense
  • Median is the better statistic in the presence of outlier.

We can use all the other summary statistics also like std, median, quantile, etc.

We can also fill the values according to the nearest nonnull values of the column for that fillna method has an argument ‘method’ which takes ‘pad‘ and ‘bfill‘ as the parameter.

  • pad: fills the forward value
  • bfill: fills the backward value

Close Menu