| Bioinformatics Toolbox | ![]() |
The data set is quite large and a lot of the information corresponds to genes that do not show any interesting changes during the experiment. To make it easier to find the interesting genes, reduce the size of the data set by removing genes with expression profiles that do not show anything of interest. There are 6400 expression profiles. You can use a number of techniques to reduce the number of expression profiles to some subset that contains the most significant genes.
If you look through the gene list you will see several spots marked as 'EMPTY'. These are empty spots on the array, and while they might have data associated with them, for the purposes of this example, you can consider these points to be noise. These points can be found using the strcmp function and removed from the data set with indexing commands..
emptySpots = strcmp('EMPTY',genes);
yeastvalues(emptySpots,:) = [];
genes(emptySpots) = [];
numel(genes)
MATLAB displays
ans =
6314
In the yeastvalues data you will also see several places where the expression level is marked as NaN. This indicates that no data was collected for this spot at the particular time step. One approach to dealing with these missing values would be to impute them using the mean or median of data for the particular gene over time. This example uses a less rigorous approach of simply throwing away the data for any genes where one or more expression levels were not measured.
Use function isnan to identify the genes with missing data and then use indexing commands to remove the genes.
nanIndices = any(isnan(yeastvalues),2); yeastvalues(nanIndices,:) = []; genes(nanIndices) = []; numel(genes)
MATLAB displays
ans =
6276
If you were to plot the expression profiles of all the remaining profiles, you would see that most profiles are flat and not significantly different from the others. This flat data is obviously of use as it indicates that the genes associated with these profiles are not significantly affected by the diauxic shift. However, in this example, you are interested in the genes with large changes in expression accompanying the diauxic shift. You can use filtering functions in the Bioinformatics Toolbox to remove genes with various types of profiles that do not provide useful information about genes affected by the metabolic change.
Use the function genevarfilter to filter out genes with small variance over time. The function returns a logical array of the same size as the variable genes with ones corresponding to rows of yeastvalues with variance greater than the 10th percentile and zeros corresponding to those below the threshold.
mask = genevarfilter(yeastvalues); % Use the mask as an index into the values to remove the % filtered genes. yeastvalues = yeastvalues(mask,:); genes = genes(mask); numel(genes)
MATLAB displays
ans =
5648
The function genelowvalfilter removes genes that have very low absolute expression values. Note that the gene filter functions can also automatically calculate the filtered data and names.
[mask, yeastvalues, genes] = genelowvalfilter(yeastvalues,genes,
'absval',log2(4));
numel(genes)
MATLAB displays
ans = 423
Use the function geneentropyfilter to remove genes whose profiles have low entropy:
[mask, yeastvalues, genes] = geneentropyfilter(yeastvalues,genes,...
'prctile',15);
numel(genes)
MATLAB displays
ans = 310
| Exploring the Data Set | Clustering Genes | ![]() |
© 1994-2005 The MathWorks, Inc.