Clean Messy and Missing Data in Tables

This example shows how to find, clean, and delete table rows with missing data.

Create and Load Sample Data
Summarize Table
Find Rows with Missing Values
Replace Missing Value Indicators
Remove Rows with Missing Values
Organize Data

Create and Load Sample Data

Create a comma-separated text file, messy.csv, that contains the following data.

A,B,C,D,E
afe1,3,yes,3,3
egh3,.,no,7,7
wth4,3,yes,3,3
atn2,23,no,23,23
arg1,5,yes,5,5
jre3,34.6,yes,34.6,34.6
wen9,234,yes,234,234
ple2,2,no,2,2
dbo8,5,no,5,5
oii4,5,yes,5,5
wnk3,245,yes,245,245
abk6,563,,563,563
pnj5,463,no,463,463
wnn3,6,no,6,6
oks9,23,yes,23,23
wba3,,yes,NaN,14
pkn4,2,no,2,2
adw3,22,no,22,22
poj2,-99,yes,-99,-99
bas8,23,no,23,23
gry5,NA,yes,NaN,21

There are many different missing data indicators in messy.csv.

Empty character vector ('')
period (.)
NA
NaN
-99

Create a table from the comma-separated text file. To specify character vectors to treat as empty values, use the 'TreatAsEmpty' name-value pair argument with the readtable function.

T = readtable('messy.csv',...
              'TreatAsEmpty',{'.','NA'})

T = 

      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'egh3'     NaN    'no'        7       7
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'abk6'     563    ''        563     563
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'wba3'     NaN    'yes'     NaN      14
    'pkn4'       2    'no'        2       2
    'adw3'      22    'no'       22      22
    'poj2'     -99    'yes'     -99     -99
    'bas8'      23    'no'       23      23
    'gry5'     NaN    'yes'     NaN      21

T is a table with 21 rows and five variables. 'TreatAsEmpty' only applies to numeric columns in the file and cannot handle numeric literals, such as '-99'.

Summarize Table

View the data type, description, units, and other descriptive statistics for each variable by creating a table summary using the summary function.

summary(T)

Variables:

    A: 21×1 cell array of character vectors

    B: 21×1 double
        Values:

            min        -99 
            median     14  
            max        563 
            NaNs       3   

    C: 21×1 cell array of character vectors

    D: 21×1 double
        Values:

            min        -99 
            median     7   
            max        563 
            NaNs       2   

    E: 21×1 double
        Values:

            min       -99
            median     14
            max       563

When you import data from a file, the default is for readtable to read any variables with nonnumeric elements as a cell array of character vectors.

Find Rows with Missing Values

Display the subset of rows from the table, T, that have at least one missing value.

TF = ismissing(T,{'' '.' 'NA' NaN -99});
T(any(TF,2),:)

ans = 

      A        B       C       D      E 
    ______    ___    _____    ___    ___

    'egh3'    NaN    'no'       7      7
    'abk6'    563    ''       563    563
    'wba3'    NaN    'yes'    NaN     14
    'poj2'    -99    'yes'    -99    -99
    'gry5'    NaN    'yes'    NaN     21

readtable replaced '.' and 'NA' with NaN in the numeric variables, B, D, and E.

Replace Missing Value Indicators

Clean the data so that the missing values indicated by code -99 have the standard MATLAB^® numeric missing value indicator, NaN.

T = standardizeMissing(T,-99)

T = 

      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'egh3'     NaN    'no'        7       7
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'abk6'     563    ''        563     563
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'wba3'     NaN    'yes'     NaN      14
    'pkn4'       2    'no'        2       2
    'adw3'      22    'no'       22      22
    'poj2'     NaN    'yes'     NaN     NaN
    'bas8'      23    'no'       23      23
    'gry5'     NaN    'yes'     NaN      21

standardizeMissing replaces three instances of -99 with NaN.

Create a new table, T2, and replace missing values with values from previous rows of the table. fillmissing provides a number of ways to fill in missing values.

T2 = fillmissing(T,'previous')

T2 = 

      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'egh3'       3    'no'        7       7
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'abk6'     563    'yes'     563     563
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'wba3'      23    'yes'      23      14
    'pkn4'       2    'no'        2       2
    'adw3'      22    'no'       22      22
    'poj2'      22    'yes'      22      22
    'bas8'      23    'no'       23      23
    'gry5'      23    'yes'      23      21

Remove Rows with Missing Values

Create a new table, T3, that contains only the rows from T without missing values.

T3 = rmmissing(T)

T3 = 

      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'pkn4'       2    'no'        2       2
    'adw3'      22    'no'       22      22
    'bas8'      23    'no'       23      23

T3 contains 16 rows and 5 variables.

Organize Data

Sort the rows of T3 in descending order by C, and then sort in ascending order by A.

T3 = sortrows(T2,{'C','A'},{'descend','ascend'})

T3 = 

      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'oii4'       5    'yes'       5       5
    'oks9'      23    'yes'      23      23
    'wen9'     234    'yes'     234     234
    'wnk3'     245    'yes'     245     245
    'wth4'       3    'yes'       3       3
    'adw3'      22    'no'       22      22
    'atn2'      23    'no'       23      23
    'bas8'      23    'no'       23      23
    'dbo8'       5    'no'        5       5
    'pkn4'       2    'no'        2       2
    'ple2'       2    'no'        2       2
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6

In C, the rows are grouped first by 'yes', followed by 'no'. Then in A, the rows are listed alphabetically.

Reorder the table so that A and C are next to each other.

T3 = T3(:,{'A','C','B','D','E'})

T3 = 

      A         C       B       D       E  
    ______    _____    ____    ____    ____

    'afe1'    'yes'       3       3       3
    'arg1'    'yes'       5       5       5
    'jre3'    'yes'    34.6    34.6    34.6
    'oii4'    'yes'       5       5       5
    'oks9'    'yes'      23      23      23
    'wen9'    'yes'     234     234     234
    'wnk3'    'yes'     245     245     245
    'wth4'    'yes'       3       3       3
    'adw3'    'no'       22      22      22
    'atn2'    'no'       23      23      23
    'bas8'    'no'       23      23      23
    'dbo8'    'no'        5       5       5
    'pkn4'    'no'        2       2       2
    'ple2'    'no'        2       2       2
    'pnj5'    'no'      463     463     463
    'wnn3'    'no'        6       6       6

Related Examples

Was this topic helpful?

Documentation