Skip to main content

Machine Learning Handling Missing Values

· 12 min read
Gerardo Perrucci
Software Engineer

Handling missing values is a crucial step in preparing data for machine learning. This tutorial provides examples of how to manage missing values using Python, focusing on the Pandas library. We'll import the necessary libraries, read the data, and explore various methods to handle missing values.

Machine Learning Handling Missing Values

You can check the full code in the Jupyter Notebook

Importing Libraries

We begin by importing the necessary libraries for our data manipulation and analysis tasks.

import numpy as np
import pandas as pd
  • NumPy: A fundamental package for scientific computing in Python. It provides support for arrays, matrices, and numerous mathematical functions.
  • Pandas: A powerful data manipulation and analysis library that provides data structures and functions needed to work with structured data seamlessly.

References:

Reading the Data

We read the CSV file containing the NFL play-by-play data.

data = pd.read_csv("./NFLPlayByPlay2009-2017_v4.csv")

You can download the dataset from the Kaggle website

During the import, a warning indicates that some columns have mixed data types. This can be addressed by specifying the dtype option or setting low_memory=False.

Output:

/tmp/ipykernel_23803/1150844578.py:1: DtypeWarning: Columns (25,51) have mixed types. Specify dtype option on import or set low_memory=False.
data = pd.read_csv("./NFLPlayByPlay2009-2017_v4.csv")

References:

Inspecting the Data

To get an overview of the data format, we inspect the first few rows of the dataframe.

data.head()

Output:

         Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  PlayTimeDiff  SideofField  ...  yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA  airWPA  yacWPA  Season
0 2009-09-10 2009091000 1 1 NaN 15:00 15 3600 0.0 TEN ... NaN 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 NaN NaN 2009
1 2009-09-10 2009091000 1 1 1.0 14:53 15 3593 7.0 PIT ... 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 2009-09-10 2009091000 1 1 2.0 14:16 15 3556 37.0 PIT ... NaN 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 NaN NaN 2009
3 2009-09-10 2009091000 1 1 3.0 13:35 14 3515 41.0 PIT ... -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2009-09-10 2009091000 1 1 4.0 13:27 14 3507 8.0 PIT ... NaN 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 NaN NaN 2009

Counting Missing Values

We calculate the number of missing values in each column.

missing_values_per_column = data.isnull().sum()
missing_values_per_column[0:10] # taking the first ten columns

Output:

Date                0
GameID 0
Drive 0
qtr 0
down 61154
time 224
TimeUnder 0
TimeSecs 224
PlayTimeDiff 444
SideofField 528
dtype: int64

Total Missing Values

To understand the proportion of missing data, we calculate the total number of cells and the percentage of missing values.

total_cells = np.product(data.shape)
total_missing = missing_values_per_column.sum()

print('total_missing', total_missing)
print('total_cells', total_cells)
print('percentage missing', (total_missing / total_cells) * 100)

Output:

total_missing 11505187
total_cells 41584176
percentage missing 27.66722370547874

Counting Non-Missing Values

We can also count the non-missing values in each column.

data[:].count()

Output:

Date        407688
GameID 407688
Drive 407688
qtr 407688
down 346534
...
Win_Prob 382679
WPA 402147
airWPA 159187
yacWPA 158926
Season 407688
Length: 102, dtype: int64

Removing Rows with Missing Values

While not recommended, one way to handle missing values is to remove rows that contain them.

removed_rows_empty_data = data.dropna()
print(removed_rows_empty_data)

Output:

Empty DataFrame
Columns: [Date, GameID, Drive, qtr, down, time, TimeUnder, TimeSecs, PlayTimeDiff, SideofField, yrdln, yrdline100, ydstogo, ydsnet, GoalToGo, FirstDown, posteam, DefensiveTeam, desc, PlayAttempted, Yards.Gained, sp, Touchdown, ExPointResult, TwoPointConv, DefTwoPoint, Safety, Onsidekick, PuntResult, PlayType, Passer, Passer_ID, PassAttempt, PassOutcome, PassLength, AirYards, YardsAfterCatch, QBHit, PassLocation, InterceptionThrown, Interceptor, Rusher, Rusher_ID, RushAttempt, RunLocation, RunGap, Receiver, Receiver_ID, Reception, ReturnResult, Returner, BlockingPlayer, Tackler1, Tackler2, FieldGoalResult, FieldGoalDistance, Fumble, RecFumbTeam, RecFumbPlayer, Sack, Challenge.Replay, ChalReplayResult, Accepted.Penalty, PenalizedTeam, PenaltyType, PenalizedPlayer, Penalty.Yards, PosTeamScore, DefTeamScore, ScoreDiff, AbsScoreDiff, HomeTeam, AwayTeam, Timeout_Indicator, Timeout_Team, posteam_timeouts_pre, HomeTimeouts_Remaining_Pre, AwayTimeouts_Remaining_Pre, HomeTimeouts_Remaining_Post, AwayTimeouts_Remaining_Post, No_Score_Prob, Opp_Field_Goal_Prob, Opp_Safety_Prob, Opp_Touchdown_Prob, Field_Goal_Prob, Safety_Prob, Touchdown_Prob, ExPoint_Prob, TwoPoint_Prob, ExpPts, EPA, airEPA, yacEPA, Home_WP_pre, Away_WP_pre, Home_WP_post, Away_WP_post, Win_Prob, WPA, airWPA, yacWPA, ...]
Index: []
[0 rows x 102 columns]

Removing Columns with Missing Values

A more common approach is to remove columns that contain missing values.

removed_columns_empty_data = data.dropna(axis=1)
print(removed_columns_empty_data)

Output:

            Date      GameID  Drive  qtr  TimeUnder  ydstogo  ydsnet  PlayAttempted  Yards.Gained  sp  ...  AwayTeam  Timeout_Indicator  posteam_timeouts_pre HomeTimeouts_Remaining_Pre  AwayTimeouts_Remaining_Pre  HomeTimeouts_Remaining_Post  AwayTimeouts_Remaining_Post  ExPoint_Prob  TwoPoint_Prob  Season
0 2009-09-10 2009091000 1 1 15 0 0 1 39 0 ... TEN 0 3 3 3 3 3 0.0 0.0 2009
1 2009-09-10 2009091000 1 1 15 10 5

1 5 0 ... TEN 0 3 3 3 3 3 0.0 0.0 2009
2 2009-09-10 2009091000 1 1 15 5 2 1 -3 0 ... TEN 0 3 3 3 3 3 0.0 0.0 2009
3 2009-09-10 2009091000 1 1 14 8 2 1 0 0 ... TEN 0 3 3 3 3 3 0.0 0.0 2009
4 2009-09-10 2009091000 1 1 14 8 2 1 0 0 ... TEN 0 3 3 3 3 3 0.0 0.0 2009
... ... ... ... ... ... ... ... ... ... .. ... ... ... ... ... ... ... ... ... ... ...
407683 2017-12-31 2017123101 29 4 1 0 -4 1 0 0 ... CIN 1 0 3 0 2 0 0.0 0.0 2017
407684 2017-12-31 2017123101 29 4 1 14 -4 1 0 0 ... CIN 0 2 2 0 2 0 0.0 0.0 2017
407685 2017-12-31 2017123101 29 4 1 14 9 1 13 0 ... CIN 0 2 2 0 2 0 0.0 0.0 2017
407686 2017-12-31 2017123101 30 4 1 10 -1 1 -1 0 ... CIN 0 0 2 0 2 0 0.0 0.0 2017
407687 2017-12-31 2017123101 30 4 0 0 -1 1 0 0 ... CIN 0 0 2 0 2 0 0.0 0.0 2017
[407688 rows x 37 columns]

We then calculate the impact of this operation by comparing the number of columns before and after.

print("original columns: %d \n" % data.shape[1])
print("cleaned columns: %d \n" % removed_columns_empty_data.shape[1])

Output:

original columns: 102
cleaned columns: 37

Subsetting the Data

To focus on a smaller portion of the dataset, we can create a subset.

subset_nfl_data = data.loc[:, 'EPA':'Season'].head()
subset_nfl_data

Output:

         EPA    airEPA    yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season
0 2.014474 NaN NaN 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 NaN NaN 2009
1 0.077907 -1.068169 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 -1.402760 NaN NaN 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 NaN NaN 2009
3 -1.712583 3.318841 -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2.097796 NaN NaN 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 NaN NaN 2009

Basic Filling of Missing Values

A straightforward method for handling missing values is to fill them with a specific value, such as zero.

filled_basic_data = data.fillna(0)
filled_basic_data.head()

Output:

         Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  PlayTimeDiff  SideofField  ...    yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season
0 2009-09-10 2009091000 1 1 0.0 15:00 15 3600 0.0 TEN ... 0.000000 0.485675 0.514325 0.546433 0.453567 0.485675 0.060758 0.000000 0.000000 2009
1 2009-09-10 2009091000 1 1 1.0 14:53 15 3593 7.0 PIT ... 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 2009-09-10 2009091000 1 1 2.0 14:16 15 3556 37.0 PIT ... 0.000000 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 0.000000 0.000000 2009
3 2009-09-10 2009091000 1 1 3.0 13:35 14 3515 41.0 PIT ... -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2009-09-10 2009091000 1 1 4.0 13:27 14 3507 8.0 PIT ... 0.000000 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 0.000000 0.000000 2009

Column-Based Filling

Another approach is to fill missing values based on the next valid observation in the column.

column_based_fill = data.bfill(axis=0).fillna(0)
column_based_fill.head()

Output:

         Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  PlayTimeDiff  SideofField  ...    yacEPA  Home_WP_pre  Away_WP_pre  Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season
0 2009-09-10 2009091000 1 1 1.0 15:00 15 3600 0.0 TEN ... 1.146076 0.485675 0.514325 0.546433

0.453567 0.485675 0.060758 -0.032244 0.036899 2009
1 2009-09-10 2009091000 1 1 1.0 14:53 15 3593 7.0 PIT ... 1.146076 0.546433 0.453567 0.551088 0.448912 0.546433 0.004655 -0.032244 0.036899 2009
2 2009-09-10 2009091000 1 1 2.0 14:16 15 3556 37.0 PIT ... -5.031425 0.551088 0.448912 0.510793 0.489207 0.551088 -0.040295 0.106663 -0.156239 2009
3 2009-09-10 2009091000 1 1 3.0 13:35 14 3515 41.0 PIT ... -5.031425 0.510793 0.489207 0.461217 0.538783 0.510793 -0.049576 0.106663 -0.156239 2009
4 2009-09-10 2009091000 1 1 4.0 13:27 14 3507 8.0 PIT ... 0.163935 0.461217 0.538783 0.558929 0.441071 0.461217 0.097712 -0.010456 0.006029 2009

References:

These steps provide a comprehensive guide to identifying and handling missing values in a dataset, ensuring the data is ready for analysis and modeling. Each method has its pros and cons, and the choice of method depends on the specific context and requirements of your analysis.

Extra exercise from kaggle

This notebook is an exercise in the Data Cleaning course. You can reference the tutorial at this link.


In this exercise, you'll apply what you learned in the Handling missing values tutorial.

1) Take a first look at the data

Run the next code cell to load in the libraries and dataset you'll use to complete the exercise.

# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
sf_permits = pd.read_csv("./Building_Permits.csv")

# set seed for reproducibility
np.random.seed(0)

/tmp/ipykernel_33/3534875831.py:6: DtypeWarning: Columns (22,32) have mixed types. Specify dtype option on import or set low_memory=False. sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")

You can download the dataset from the Kaggle website.

Use the code cell below to print the first five rows of the sf_permits DataFrame.

# TODO: Your code here!
sf_permits.head()
Permit NumberPermit TypePermit Type DefinitionPermit Creation DateBlockLotStreet NumberStreet Number SuffixStreet NameStreet Suffix...Existing Construction TypeExisting Construction Type DescriptionProposed Construction TypeProposed Construction Type DescriptionSite PermitSupervisor DistrictNeighborhoods - Analysis BoundariesZipcodeLocationRecord ID
02015050655194sign - erect05/06/20150326023140NaNEllisSt...3.0constr type 3NaNNaNNaN3.0Tenderloin94102.0(37.785719256680785, -122.40852313194863)1380611233945
12016041951464sign - erect04/19/20160306007440NaNGearySt...3.0constr type 3NaNNaNNaN3.0Tenderloin94102.0(37.78733980600732, -122.41063199757738)1420164406718
22016052786093additions alterations or repairs05/27/201605952031647NaNPacificAv...1.0constr type 11.0constr type 1NaN3.0Russian Hill94109.0(37.7946573324287, -122.42232562979227)1424856504716
32016110721668otc alterations permit11/07/201601560111230NaNPacificAv...5.0wood frame (5)5.0wood frame (5)NaN3.0Nob Hill94109.0(37.79595867909168, -122.41557405519474)1443574295566
42016112835296demolitions11/28/20160342001950NaNMarketSt...3.0constr type 3NaNNaNNaN6.0Tenderloin94102.0(37.78315261897309, -122.40950883997789)144548169992

5 rows × 43 columns

Correct:

The first five rows of the data does show that several columns have missing values. You can see this in the "Street Number Suffix", "Proposed Construction Type" and "Site Permit" columns, among others.

2) How many missing data points do we have?

What percentage of the values in the dataset are missing? Your answer should be a number between 0 and 100. (If 1/4 of the values in the dataset are missing, the answer is 25.)

# TODO: Your code here!
total_cells = np.product(sf_permits.shape)
missing_values_count = sf_permits.isnull().sum();

total_missing_cells = missing_values_count.sum();

# print("shape", sf_permits.shape)
print('total cells', total_cells)
# print('total missing per column', missing_values_count)
print('total missing cells', total_missing_cells)

percent_missing = (total_missing_cells / total_cells) * 100

print('percent missing', percent_missing)

total cells 8552700 total missing cells 2245941 percent missing 26.26002315058403

Correct

3) Figure out why the data is missing

Look at the columns "Street Number Suffix" and "Zipcode" from the San Francisco Building Permits dataset. Both of these contain missing values.

  • Which, if either, are missing because they don't exist?
  • Which, if either, are missing because they weren't recorded?

Once you have an answer, run the code cell below.

Correct:

If a value in the "Street Number Suffix" column is missing, it is likely because it does not exist. If a value in the "Zipcode" column is missing, it was not recorded.

4) Drop missing values: rows

If you removed all of the rows of sf_permits with missing values, how many rows are left?

Note: Do not change the value of sf_permits when checking this.

# TODO: Your code here!
total_rows = sf_permits.shape[0]
total_rows_after_drop= sf_permits.dropna().shape[0]

print(total_rows)
print(total_rows_after_drop) # no rows

198900 0

Correct:

There are no rows remaining in the dataset!

5) Drop missing values: columns

Now try removing all the columns with empty values.

  • Create a new DataFrame called sf_permits_with_na_dropped that has all of the columns with empty values removed.
  • How many columns were removed from the original sf_permits DataFrame? Use this number to set the value of the dropped_columns variable below.
sf_permits_with_na_dropped = sf_permits.dropna(axis=1)

dropped_columns = sf_permits.shape[1] - sf_permits_with_na_dropped.shape[1]

6) Fill in missing values automatically

Try replacing all the NaN's in the sf_permits data with the one that comes directly after it and then replacing any remaining NaN's with 0. Set the result to a new DataFrame sf_permits_with_na_imputed.

sf_permits_with_na_imputed = sf_permits.bfill(axis=0).fillna(0)

More practice

  • Check out this noteboook on handling missing values using scikit-learn's imputer.
  • Look back at the "Zipcode" column in the sf_permits dataset, which has some missing values. How would you go about figuring out what the actual zipcode of each address should be? (You might try using another dataset. You can search for datasets about San Fransisco on the Datasets listing.)