| Bioinformatics Toolbox | ![]() |
Classify data using the nearest–neighbor method
Class = knnclassify(Sample, Training, Group)
Class = knnclassify(Sample, Training, Group,
k)
Class = knnclassify(Sample, Training, Group,
k, distance)
Class = knnclassify(Sample, Training, Group,
k, distance, rule)
Class = knnclassify(Sample, Training, Group) classifies the rows of the data matrix Sample into groups, based on the grouping of the rows of Training. Sample and Training must be matrices with the same number of columns. Group is a vector whose distinct values define the grouping of the rows in Training. Each row of Training belongs to the group whose value is the corresponding entry of Group. knnclassify assigns each row of Sample to the group for the closest row of Training. Group can be a numeric vector, a string array, or a cell array of strings. Training and Group must have the same number of rows. knnclassify treats NaNs or empty strings in Group as missing values, and ignores the corresponding rows of Training. Class indicates which group each row of Sample has been assigned to, and is of the same type as Group.
Class = knnclassify(Sample, Training, Group, k) enables you to specify k, the number of nearest neighbors used in the classification. The default is 1.
Class = knnclassify(Sample, Training, Group, k, distance) enables you to specify the distance metric. The choices for distance are
| 'euclidean' | Euclidean distance — the default |
| 'cityblock' | Sum of absolute differences |
| 'cosine' | One minus the cosine of the included angle between points (treated as vectors) |
| 'correlation' | One minus the sample correlation between points (treated as sequences of values) |
| 'hamming' | Percentage of bits that differ (only suitable for binary data) |
Class = knnclassify(Sample, Training, Group, k, distance, rule) enables you to specify the rule used to decide how to classify the sample. The choices for rule are
| 'nearest' | Majority rule with nearest point tie-break — the default |
| 'random' | Majority rule with random point tie-break |
| 'consensus' | Consensus rule |
The default behavior is to use majority rule. That is, a sample point is assigned to the class the majority of the k nearest neighbors are from. Use 'consensus' to require a consensus, as opposed to majority rule. When using the 'consensus' option, points where not all of the k nearest neighbors are from the same class are not assigned to one of the classes. Instead the output Class for these points is NaN for numerical groups or '' for string named groups. When classifying to more than two groups or when using an even value for k, it might be necessary to break a tie in the number of nearest neighbors. Options are 'random', which selects a random tiebreaker, and 'nearest', which uses the nearest neighbor among the tied groups to break the tie. The default behavior is majority rule, with nearest tie-break.
The following example classifies the rows of the matrix sample:
sample = [.9 .8;.1 .3;.2 .6]
sample =
0.9000 0.8000
0.1000 0.3000
0.2000 0.6000
training=[0 0;.5 .5;1 1]
training =
0 0
0.5000 0.5000
1.0000 1.0000
group = [1;2;3]
group =
1
2
3
class = knnclassify(sample, training, group)
class =
3
1
2Row 1 of sample is closest to row 3 of Training, so class(1) = 3. Row 2 of sample is closest to row 1 of Training, so class(2) = 1. Row 3 of sample is closest to row 2 of Training, so class(3) = 2.
The following example classifies each row of the data in sample into one of the two groups in training. The following commands create the matrix training and the grouping variable group, and plot the rows of training in two groups.
training = [mvnrnd([ 1 1], eye(2), 100); ...
mvnrnd([-1 -1], 2*eye(2), 100)];
group = [repmat(1,100,1); repmat(2,100,1)];
gscatter(training(:,1),training(:,2),group,'rb',+x');
legend('Training group 1', 'Training group 2');
hold on;

The following commands create the matrix sample, classify its rows into two groups, and plot the result.
sample = unifrnd(-5, 5, 100, 2);
% Classify the sample using the nearest neighbor classification
c = knnclassify(sample, training, group);
gscatter(sample(:,1),sample(:,2),c,'mc'); hold on;
legend('Training group 1','Training group 2', ...
'Data in group 1','Data in group 2');
hold off;

The following example uses the same data as in Example 2, but classifies the rows of sample using three nearest neighbors instead of one.
gscatter(training(:,1),training(:,2),group,'rb',+x');
hold on;
c3 = knnclassify(sample, training, group, 3);
gscatter(sample(:,1),sample(:,2),c3,'mc','o');
legend('Training group 1','Training group 2','Data in group 1','Data in group 2');

If you compare this plot with the one in Example 2, you see that some of the data points are classified differently using three nearest neighbors.
Statistical Toolbox functions classify, knnimpute
[1] Mitchell, Tom,, McGraw-Hill, 1997.
| crossvalind | knnimpute | ![]() |
© 1994-2005 The MathWorks, Inc.