Clean Messy and Missing Data in Tables

This example shows how to find, clean, and delete table rows with missing data.

Load Sample Data

Load sample data from a comma-separated text file, messy.csv. The file contains many different missing data indicators:

Empty character vector ('')
period (.)
NA
NaN
-99

To specify the character vectors to treat as empty values, use the 'TreatAsEmpty' name-value pair argument with the readtable function.

T = readtable('messy.csv','TreatAsEmpty',{'.','NA'})

T=21×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'egh3'     NaN    'no'        7       7
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'abk6'     563    ''        563     563
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'wba3'     NaN    'yes'     NaN      14

T is a table with 21 rows and five variables. 'TreatAsEmpty' only applies to numeric columns in the file and cannot handle numeric literals, such as '-99'.

Summarize Table

View the data type, description, units, and other descriptive statistics for each variable by creating a table summary using the summary function.

summary(T)

Variables:

    A: 21x1 cell array of character vectors

    B: 21x1 double

        Values:

            Min            -99 
            Median         14  
            Max            563 
            NumMissing     3   

    C: 21x1 cell array of character vectors

    D: 21x1 double

        Values:

            Min            -99 
            Median         7   
            Max            563 
            NumMissing     2   

    E: 21x1 double

        Values:

            Min       -99
            Median     14
            Max       563

When you import data from a file, the default is for readtable to read any variables with nonnumeric elements as a cell array of character vectors.

Find Rows with Missing Values

Display the subset of rows from the table, T, that have at least one missing value.

TF = ismissing(T,{'' '.' 'NA' NaN -99});
T(any(TF,2),:)

ans=5×5 table
      A        B       C       D      E 
    ______    ___    _____    ___    ___

    'egh3'    NaN    'no'       7      7
    'abk6'    563    ''       563    563
    'wba3'    NaN    'yes'    NaN     14
    'poj2'    -99    'yes'    -99    -99
    'gry5'    NaN    'yes'    NaN     21

readtable replaced '.' and 'NA' with NaN in the numeric variables, B, D, and E.

Replace Missing Value Indicators

Clean the data so that the missing values indicated by code -99 have the standard MATLAB® numeric missing value indicator, NaN.

T = standardizeMissing(T,-99)

T=21×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'egh3'     NaN    'no'        7       7
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'abk6'     563    ''        563     563
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'wba3'     NaN    'yes'     NaN      14

standardizeMissing replaces three instances of -99 with NaN.

Create a new table, T2, and replace missing values with values from previous rows of the table. fillmissing provides a number of ways to fill in missing values.

T2 = fillmissing(T,'previous')

T2=21×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'egh3'       3    'no'        7       7
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'abk6'     563    'yes'     563     563
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'wba3'      23    'yes'      23      14

Remove Rows with Missing Values

Create a new table, T3, that contains only the rows from T without missing values.

T3 = rmmissing(T)

T3=16×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'afe1'       3    'yes'       3       3
    'wth4'       3    'yes'       3       3
    'atn2'      23    'no'       23      23
    'arg1'       5    'yes'       5       5
    'jre3'    34.6    'yes'    34.6    34.6
    'wen9'     234    'yes'     234     234
    'ple2'       2    'no'        2       2
    'dbo8'       5    'no'        5       5
    'oii4'       5    'yes'       5       5
    'wnk3'     245    'yes'     245     245
    'pnj5'     463    'no'      463     463
    'wnn3'       6    'no'        6       6
    'oks9'      23    'yes'      23      23
    'pkn4'       2    'no'        2       2
    'adw3'      22    'no'       22      22
    'bas8'      23    'no'       23      23

T3 contains 16 rows and five variables.

Organize Data

Sort the rows of T3 in descending order by C, and then sort in ascending order by A.

T3 = sortrows(T2,{'C','A'},{'descend','ascend'})

T3=21×5 table
      A        B        C       D       E  
    ______    ____    _____    ____    ____

    'abk6'     563    'yes'     563     563
    'afe1'       3    'yes'       3       3
    'arg1'       5    'yes'       5       5
    'gry5'      23    'yes'      23      21
    'jre3'    34.6    'yes'    34.6    34.6
    'oii4'       5    'yes'       5       5
    'oks9'      23    'yes'      23      23
    'poj2'      22    'yes'      22      22
    'wba3'      23    'yes'      23      14
    'wen9'     234    'yes'     234     234
    'wnk3'     245    'yes'     245     245
    'wth4'       3    'yes'       3       3
    'adw3'      22    'no'       22      22
    'atn2'      23    'no'       23      23
    'bas8'      23    'no'       23      23
    'dbo8'       5    'no'        5       5

In C, the rows are grouped first by 'yes', followed by 'no'. Then in A, the rows are listed alphabetically.

Reorder the table so that A and C are next to each other.

T3 = T3(:,{'A','C','B','D','E'})

T3=21×5 table
      A         C       B       D       E  
    ______    _____    ____    ____    ____

    'abk6'    'yes'     563     563     563
    'afe1'    'yes'       3       3       3
    'arg1'    'yes'       5       5       5
    'gry5'    'yes'      23      23      21
    'jre3'    'yes'    34.6    34.6    34.6
    'oii4'    'yes'       5       5       5
    'oks9'    'yes'      23      23      23
    'poj2'    'yes'      22      22      22
    'wba3'    'yes'      23      23      14
    'wen9'    'yes'     234     234     234
    'wnk3'    'yes'     245     245     245
    'wth4'    'yes'       3       3       3
    'adw3'    'no'       22      22      22
    'atn2'    'no'       23      23      23
    'bas8'    'no'       23      23      23
    'dbo8'    'no'        5       5       5

Documentation