| Bioinformatics Toolbox | ![]() |
Impute missing data using the nearest-neighbor method
knnimpute(Data)
knnimpute(Data, k)
knnimpute(..., 'distance', distfun)
knnimpute(..., 'distargs', args)
knnimpute(...,'weights',w)
knnimpute(...,'median',true)
knnimpute(Data)replaces NaNs in Data with the corresponding value from the nearest-neighbor column. The nearest-neighbor column is the closest column in Euclidean distance. If the corresponding value from the nearest-neighbor column is also NaN, the next nearest column is used.
knnimpute(Data, k)replaces NaNs in Data with a weighted mean of the k nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns.
knnimpute(..., 'distance', distfun) computes nearest-neighbor columns using the distance metric distfun. The choices for distfun are
| 'euclidean' | Euclidean distance — the default |
| 'seuclidean' | Standardized Euclidean distance — each coordinate in the sum of squares is inversely weighted by the sample variance of that coordinate. |
| 'cityblock' | City block distance |
| 'mahalanobis' | Mahalanobis distance |
| 'minkowski' | Minkowski distance with exponent 2 |
| 'cosine' | One minus the cosine of the included angle |
| 'correlation' | One minus the sample correlation between observations, treated as sequences of values |
| 'hamming' | Hamming distance — the percentage of coordinates that differ |
| 'jaccard' | One minus the Jaccard coefficient — the percentage of nonzero coordinates that differ |
| 'chebychev' | Chebychev distance (maximum coordinate difference) |
| function handle | A handle to a distance function, specified using @, for example @distfun |
See pdist for more details.
knnimpute(..., 'distargs', args) passes the arguments args to the function distfun. args can be a single value or a cell array of values.
knnimpute(...,'weights',w) enables you to specify the weights used in the weighted mean calculation. w should be a vector of length k.
knnimpute(...,'median',true) uses the median of the k nearest neighbors instead of the weighted mean.
A = [1 2 5;4 5 7;NaN -1 8;7 6 0]
A =
1 2 5
4 5 7
NaN -1 8
7 6 0Note that A(3,1) = NaN. Because column 2 is the closest column to column 1 in Euclidean distance, knnimpute imputes the (3,1) entry of column 1 to be the corresponding entry of column 2, which is -1.
knnimpute(A)
ans =
1 2 5
4 5 7
-1 -1 8
7 6 0
The following example loads the data set yeastdata and imputes missing values in the array yeastvalues.
load yeastdata
% Remove data for empty spots
emptySpots = strcmp('EMPTY',genes);
yeastvalues(emptySpots,:) = [];
genes(emptySpots) = [];
% Impute missing values
imputedValues = knnimpute(yeastvalues);
isnan, knnclassify, nanmean, nanmedian, pdist
[1] Speed, T., , Chapman & Hall, 2003.
| knnclassify | randfeatures | ![]() |
© 1994-2005 The MathWorks, Inc.