knnimpute

Impute missing data using the nearest-neighbor method

Syntax

knnimpute(Data)
knnimpute(Data, k)
knnimpute(..., 'distance', distfun)
knnimpute(..., 'distargs', args)
knnimpute(...,'weights',w)
knnimpute(...,'median',true)

Description

knnimpute(Data)replaces NaNs in Data with the corresponding value from the nearest-neighbor column. The nearest-neighbor column is the closest column in Euclidean distance. If the corresponding value from the nearest-neighbor column is also NaN, the next nearest column is used.

knnimpute(Data, k)replaces NaNs in Data with a weighted mean of the k nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns.

knnimpute(..., 'distance', distfun) computes nearest-neighbor columns using the distance metric distfun. The choices for distfun are

'euclidean'Euclidean distance — the default
'seuclidean'Standardized Euclidean distance — each coordinate in the sum of squares is inversely weighted by the sample variance of that coordinate.
'cityblock'City block distance
'mahalanobis'Mahalanobis distance
'minkowski'Minkowski distance with exponent 2
'cosine'One minus the cosine of the included angle
'correlation'One minus the sample correlation between observations, treated as sequences of values
'hamming'Hamming distance — the percentage of coordinates that differ
'jaccard'One minus the Jaccard coefficient — the percentage of nonzero coordinates that differ
'chebychev'Chebychev distance (maximum coordinate difference)
function handleA handle to a distance function, specified using @, for example @distfun

See pdist for more details.

knnimpute(..., 'distargs', args) passes the arguments args to the function distfun. args can be a single value or a cell array of values.

knnimpute(...,'weights',w) enables you to specify the weights used in the weighted mean calculation. w should be a vector of length k.

knnimpute(...,'median',true) uses the median of the k nearest neighbors instead of the weighted mean.

Examples

Example 1

A = [1 2 5;4 5 7;NaN -1 8;7 6 0]

A =

     1     2     5
     4     5     7
   NaN    -1     8
     7     6     0

Note that A(3,1) = NaN. Because column 2 is the closest column to column 1 in Euclidean distance, knnimpute imputes the (3,1) entry of column 1 to be the corresponding entry of column 2, which is -1.

knnimpute(A)

ans =

     1     2     5
     4     5     7
    -1    -1     8
     7     6     0

Example 2

The following example loads the data set yeastdata and imputes missing values in the array yeastvalues.

load yeastdata
% Remove data for empty spots
emptySpots = strcmp('EMPTY',genes);
yeastvalues(emptySpots,:) = [];
genes(emptySpots) = [];
% Impute missing values
imputedValues = knnimpute(yeastvalues);

See Also

isnan, knnclassify, nanmean, nanmedian, pdist

References

[1] Speed, T., , Chapman & Hall, 2003.


© 1994-2005 The MathWorks, Inc.