Skip to main content

Machine Learning Visualization Missing Values

· 4 min read
Gerardo Perrucci
Software Engineer

Visualizing Missing Data: A Step-by-Step Guide

Handling missing data is crucial in data analysis and machine learning. Visualizing missing data helps to understand the extent and pattern of missingness, which can inform the choice of strategies for dealing with it. In this guide, we'll use Python and the missingno library to visualize missing data in a dataset.

Visualization missing values with missingno

You can download the Jupiter notebook of the example Visualizing Missing Data

1. Importing Libraries

First, we need to import the necessary libraries for data manipulation and visualization.

import pandas as pd
import numpy as np
import missingno as msno
  • Pandas: A powerful data manipulation library.
  • NumPy: A fundamental package for numerical computations.
  • Missingno: A library for visualizing missing data.

2. Loading the Data

We load the dataset into a Pandas DataFrame. For this example, we'll use a dataset that comes with the missingno library. Same data that we used in the previous article Machine Learning Handling Missing Values. You can download the dataset from the Kaggle website

sf_permits = pd.read_csv("./Building_Permits.csv")

sf_permits.head()

/tmp/ipykernel_50336/2707110962.py:1: DtypeWarning: Columns (22,32) have mixed types. Specify dtype option on import or set low_memory=False. sf_permits = pd.read_csv("./Building_Permits.csv")

Output:

Permit NumberPermit TypePermit Type DefinitionPermit Creation DateBlockLotStreet NumberStreet Number SuffixStreet NameStreet Suffix...Existing Construction TypeExisting Construction Type DescriptionProposed Construction TypeProposed Construction Type DescriptionSite PermitSupervisor DistrictNeighborhoods - Analysis BoundariesZipcodeLocationRecord ID
02015050655194sign - erect05/06/20150326023140NaNEllisSt...3.0constr type 3NaNNaNNaN3.0Tenderloin94102.0(37.785719256680785, -122.40852313194863)1380611233945
12016041951464sign - erect04/19/20160306007440NaNGearySt...3.0constr type 3NaNNaNNaN3.0Tenderloin94102.0(37.78733980600732, -122.41063199757738)1420164406718
22016052786093additions alterations or repairs05/27/201605952031647NaNPacificAv...1.0constr type 11.0constr type 1NaN3.0Russian Hill94109.0(37.7946573324287, -122.42232562979227)1424856504716
32016110721668otc alterations permit11/07/201601560111230NaNPacificAv...5.0wood frame (5)5.0wood frame (5)NaN3.0Nob Hill94109.0(37.79595867909168, -122.41557405519474)1443574295566
42016112835296demolitions11/28/20160342001950NaNMarketSt...3.0constr type 3NaNNaNNaN6.0Tenderloin94102.0(37.78315261897309, -122.40950883997789)144548169992

5 rows × 43 columns

3. Matrix Plot

The matrix plot visualizes missing data by representing data points with vertical bars. Each bar shows the presence (white) or absence (black) of data points.

msno.matrix(sf_permits)

Output: Matrix Plot

Why is it important? The matrix plot helps identify patterns in the missing data, such as whether missingness occurs at random or follows a specific pattern.

4. Bar Plot

The bar plot shows the number of non-missing (present) data points for each column.

msno.bar(df)

Output: Bar Plot

Why is it important? The bar plot provides a quick overview of the completeness of each column, highlighting columns with a high proportion of missing data.

5. Heatmap

The heatmap shows the correlation of missingness between different columns. A high correlation indicates that the presence of missing data in one column is related to the presence of missing data in another column.

msno.heatmap(df)

Output: Heatmap

Why is it important? The heatmap helps identify relationships in missingness between columns, which can inform decisions on how to handle missing data, such as imputing missing values based on related columns.

6. Dendrogram

The dendrogram clusters columns based on the similarity of their missing data patterns.

msno.dendrogram(df)

Output: Dendrogram

Why is it important? The dendrogram helps identify groups of columns with similar missing data patterns, which can be useful for imputation or for understanding the underlying structure of the data.

Conclusion

Visualizing missing data is a crucial step in data preprocessing. It helps understand the extent and pattern of missingness, guiding the choice of strategies for handling missing data. By using tools like missingno, you can quickly and effectively visualize and analyze missing data in your datasets.

References: