| Bioinformatics Toolbox | ![]() |
Identify key features using sequential forward selection
[IDX, Z] = sqtlfeatures(X, Group, 'PropertyName', PropertyValue...)
sqtlfeatures(..., 'Criterion', C)
sqtlfeatures(..., 'CCWeighting', ALPHA)
sqtlfeatures(..., 'NWeighting', BETA)
sqtlfeatures(..., 'NumberOfIndices', N)
sqtlfeatures(..., 'CrossNorm', CN)
[IDX, Z] = sqtlfeatures(X, Group, 'PropertyName', PropertyValue...) performs sequential forward feature selection with independent evaluation criterion for binary classification.X is a matrix where every column is an observed vector and the number of rows corresponds to the original number of features. Group contains the class labels.
IDX is the list of indices to the rows in X with the most significant features. Z is the absolute value of the criterion used (see below).
Group can be a numeric vector or a cell array of strings; numel(Group) is the same as the number of columns in X, and numel(unique(Group)) is equal to 2.
sqtlfeatures(..., 'Criterion', C)sets the criterion used to assess the significance of every feature for separating two labeled groups. Options are
| 'ttest' (default) | Absolute value two-sample T-test with pooled variance estimate |
| 'entropy' | Relative entropy, also known as Kulback-Lieber distance or divergence |
| 'brattacharyya' | Minimum attainable classification error or Chernoff bound |
| 'roc' | Area under the empirical receiver operating characteristic (ROC) curve |
| 'wilcoxon' | Absolute value of the u-statistic of a two-sample unpaired Wilcoxon test, also known as Mann-Whitney |
Notes: 1) 'ttest', 'entropy', and 'brattacharyya' assume normal distributed classes while 'roc' and 'wilcoxon' are nonparametric tests. 2) All tests are feature independent.
sqtlfeatures(..., 'CCWeighting', ALPHA) uses correlation information to outweigh the Z value of potential features using Z * (1-ALPHA*(RHO)) where RHO is the average of the absolute values of the cross-correlation coefficient between the candidate feature and all previously selected features. ALPHA sets the weighting factor. It is a scalar value between 0 and 1. When ALPHA is 0 (default) potential features are not weighted. A large value of RHO (close to 1) outweighs the significance statistic; this means that features that are highly correlated with the features already picked are less likely to be included in the output list.
sqtlfeatures(..., 'NWeighting', BETA) uses regional information to outweigh the Z value of potential features using Z * (1-exp(-(DIST/BETA).^2)) where DIST is the distance (in rows) between the candidate feature and previously selected features. BETA sets the weighting factor. It is greater than or equal to 0. When BETA is 0 (default) potential features are not weighted. A small DIST (close to 0) outweighs the significance statistics of only close features. This means that features that are close to already picked features are less likely to be included in the output list. This option is useful for extracting features from time series with temporal correlation.
BETA can also be a function of the feature location, specified using @ or an anonymous function. In both cases sqtlfeatures passes the row position of the feature to BETA() and expects back a value greater than or equal to 0.
Note: You can use CCWeighting and NWeighting together.
sqtlfeatures(..., 'NumberOfIndices', N) sets the number of output indices in IDX. Default is the same as the number of features when ALPHA and BETA are 0, or 20 otherwise.
sqtlfeatures(..., 'CrossNorm', CN) applies independent normalization across the observations for every feature. Cross-normalization ensures comparability among different features, although it is not always necessary because the selected criterion might already account for this. Options are
| 'none' (default) | Intensities are not cross-normalized. |
| 'meanvar' | x_new = (x - mean(x))/std(x) |
| 'softmax' | x_new = (1+exp((mean(x)-x)/std(x)))^-1 |
| 'minmax' | x_new = (x - min(x))/(max(x)-min(x)) |
Find a reduced set of genes that is sufficient for differentiating breast cancer cells from all other types of cancer in the t-matrix NCI60 data set. Load sample data.
load NCI60tmatrix
Get a logical index vector to the breast cancer cells.
BC = GROUP == 8;
Select features.
I = sqtlfeatures(X,BC,'NumberOfIndices',12);
Test features with a linear discriminant classifier.
C = classify(X(I,:)',X(I,:)',double(BC)); cp = classperf(BC,C); cp.CorrectRate
Use cross-correlation weighting to further reduce the required number of genes.
I = sqtlfeatures(X,BC,'CCWeighting',0.7,'NumberOfIndices',8); C = classify(X(I,:)',X(I,:)',double(BC)); cp = classperf(BC,C); cp.CorrectRate
Find the discriminant peaks of two groups of signals with Gaussian pulses modulated by two different sources load GaussianPulses
f = sqtlfeatures(y',grp,'NWeighting',@(x) x/10+5,'NumberOfIndices',5); plot(t,y(grp==1,:),'b',t,y(grp==2,:),'g',t(f),1.35,'vr')
Statistical Toolbox functions classify, classperf, crossvalind, randfeatures, svmclassify
| randfeatures | svmclassify | ![]() |
© 1994-2005 The MathWorks, Inc.