This example shows how to find, clean, and delete table rows with missing data.
Create a comma-separated text file, messy.csv
,
that contains the following data.
A,B,C,D,E afe1,3,yes,3,3 egh3,.,no,7,7 wth4,3,yes,3,3 atn2,23,no,23,23 arg1,5,yes,5,5 jre3,34.6,yes,34.6,34.6 wen9,234,yes,234,234 ple2,2,no,2,2 dbo8,5,no,5,5 oii4,5,yes,5,5 wnk3,245,yes,245,245 abk6,563,,563,563 pnj5,463,no,463,463 wnn3,6,no,6,6 oks9,23,yes,23,23 wba3,,yes,NaN,14 pkn4,2,no,2,2 adw3,22,no,22,22 poj2,-99,yes,-99,-99 bas8,23,no,23,23 gry5,NA,yes,NaN,21
There are many different missing data indicators in messy.csv
.
Empty character vector (''
)
period (.
)
NA
NaN
-99
Create a table from the comma-separated text file. To
specify character vectors to treat as empty values, use the 'TreatAsEmpty'
name-value
pair argument with the readtable
function.
T = readtable('messy.csv',... 'TreatAsEmpty',{'.','NA'})
T = A B C D E ______ ____ _____ ____ ____ 'afe1' 3 'yes' 3 3 'egh3' NaN 'no' 7 7 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'abk6' 563 '' 563 563 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'wba3' NaN 'yes' NaN 14 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'poj2' -99 'yes' -99 -99 'bas8' 23 'no' 23 23 'gry5' NaN 'yes' NaN 21
T
is a table with 21 rows and five variables. 'TreatAsEmpty'
only
applies to numeric columns in the file and cannot handle numeric literals,
such as '-99'
.
View the data type, description, units, and other descriptive
statistics for each variable by creating a table summary using the summary
function.
summary(T)
Variables: A: 21×1 cell array of character vectors B: 21×1 double Values: min -99 median 14 max 563 NaNs 3 C: 21×1 cell array of character vectors D: 21×1 double Values: min -99 median 7 max 563 NaNs 2 E: 21×1 double Values: min -99 median 14 max 563
When you import data from a file, the default is for readtable
to
read any variables with nonnumeric elements as a cell array of character
vectors.
Display the subset of rows from the table, T
,
that have at least one missing value.
TF = ismissing(T,{'' '.' 'NA' NaN -99}); T(any(TF,2),:)
ans = A B C D E ______ ___ _____ ___ ___ 'egh3' NaN 'no' 7 7 'abk6' 563 '' 563 563 'wba3' NaN 'yes' NaN 14 'poj2' -99 'yes' -99 -99 'gry5' NaN 'yes' NaN 21
readtable
replaced '.'
and 'NA'
with NaN
in
the numeric variables, B
, D
,
and E
.
Clean the data so that the missing values indicated by code -99
have
the standard MATLAB® numeric missing value indicator, NaN
.
T = standardizeMissing(T,-99)
T = A B C D E ______ ____ _____ ____ ____ 'afe1' 3 'yes' 3 3 'egh3' NaN 'no' 7 7 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'abk6' 563 '' 563 563 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'wba3' NaN 'yes' NaN 14 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'poj2' NaN 'yes' NaN NaN 'bas8' 23 'no' 23 23 'gry5' NaN 'yes' NaN 21
standardizeMissing
replaces three instances
of -99
with NaN
.
Create a new table, T2
, and replace missing
values with values from previous rows of the table. fillmissing
provides
a number of ways to fill in missing values.
T2 = fillmissing(T,'previous')
T2 = A B C D E ______ ____ _____ ____ ____ 'afe1' 3 'yes' 3 3 'egh3' 3 'no' 7 7 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'abk6' 563 'yes' 563 563 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'wba3' 23 'yes' 23 14 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'poj2' 22 'yes' 22 22 'bas8' 23 'no' 23 23 'gry5' 23 'yes' 23 21
Create a new table, T3
, that contains only
the rows from T
without missing values.
T3 = rmmissing(T)
T3 = A B C D E ______ ____ _____ ____ ____ 'afe1' 3 'yes' 3 3 'wth4' 3 'yes' 3 3 'atn2' 23 'no' 23 23 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'wen9' 234 'yes' 234 234 'ple2' 2 'no' 2 2 'dbo8' 5 'no' 5 5 'oii4' 5 'yes' 5 5 'wnk3' 245 'yes' 245 245 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6 'oks9' 23 'yes' 23 23 'pkn4' 2 'no' 2 2 'adw3' 22 'no' 22 22 'bas8' 23 'no' 23 23
T3
contains 16 rows and 5 variables.
Sort the rows of T3
in descending order by C
,
and then sort in ascending order by A
.
T3 = sortrows(T2,{'C','A'},{'descend','ascend'})
T3 = A B C D E ______ ____ _____ ____ ____ 'afe1' 3 'yes' 3 3 'arg1' 5 'yes' 5 5 'jre3' 34.6 'yes' 34.6 34.6 'oii4' 5 'yes' 5 5 'oks9' 23 'yes' 23 23 'wen9' 234 'yes' 234 234 'wnk3' 245 'yes' 245 245 'wth4' 3 'yes' 3 3 'adw3' 22 'no' 22 22 'atn2' 23 'no' 23 23 'bas8' 23 'no' 23 23 'dbo8' 5 'no' 5 5 'pkn4' 2 'no' 2 2 'ple2' 2 'no' 2 2 'pnj5' 463 'no' 463 463 'wnn3' 6 'no' 6 6
In C
, the rows are grouped first by 'yes'
,
followed by 'no'
. Then in A
,
the rows are listed alphabetically.
Reorder the table so that A
and C
are
next to each other.
T3 = T3(:,{'A','C','B','D','E'})
T3 = A C B D E ______ _____ ____ ____ ____ 'afe1' 'yes' 3 3 3 'arg1' 'yes' 5 5 5 'jre3' 'yes' 34.6 34.6 34.6 'oii4' 'yes' 5 5 5 'oks9' 'yes' 23 23 23 'wen9' 'yes' 234 234 234 'wnk3' 'yes' 245 245 245 'wth4' 'yes' 3 3 3 'adw3' 'no' 22 22 22 'atn2' 'no' 23 23 23 'bas8' 'no' 23 23 23 'dbo8' 'no' 5 5 5 'pkn4' 'no' 2 2 2 'ple2' 'no' 2 2 2 'pnj5' 'no' 463 463 463 'wnn3' 'no' 6 6 6
fillmissing
| ismissing
| readtable
| rmmissing
| sortrows
| standardizeMissing
| summary