Principal Component Analysis

Principal-component analysis(PCA) is a useful technique you can use to reduce the dimensionality of large data sets, such as those from microarray analysis. PCA can also be used to find signals in noisy data.

  1. You can use the The function princomp in the Statistics Toolbox to calculate the principal components of a data set.

    [pc, zscores, pcvars] = princomp(yeastvalues)
    

    MATLAB displays

    pc =
    
      Columns 1 through 4 
    
       -0.0245   -0.3033   -0.1710   -0.2831
        0.0186   -0.5309   -0.3843   -0.5419
        0.0713   -0.1970    0.2493    0.4042
        0.2254   -0.2941    0.1667    0.1705
        0.2950   -0.6422    0.1415    0.3358
        0.6596    0.1788    0.5155   -0.5032
        0.6490    0.2377   -0.6689    0.2601
    
      Columns 5 through 7 
    
       -0.1155    0.4034    0.7887
       -0.2384   -0.2903   -0.3679
       -0.7452   -0.3657    0.2035
       -0.2385    0.7520   -0.4283
        0.5592   -0.2110    0.1032
       -0.0194   -0.0961    0.0667
       -0.0673   -0.0039    0.0521
    
  2. You can use the function cumsum to see the cumulative sum of the variances.

    cumsum(pcvars./sum(pcvars) * 100)
    

    MATLAB displays

    ans =
       78.3719
       89.2140
       93.4357
       96.0831
       98.3283
       99.3203
      100.0000
    

    This shows that almost 90% of the variance is accounted for by the first two principal components.

  3. A scatter plot of the scores of the first two principal components shows that there are two distinct regions. This is not unexpected, because the filtering process removed many of the genes with low variance or low information. These genes would have appeared in the middle of the scatter plot.

    figure
    scatter(zscores(:,1),zscores(:,2));
    xlabel('First Principal Component');
    ylabel('Second Principal Component');
    title('Principal Component Scatter Plot');
    

    MATLAB plots the figure.

  4. The function gname from the Statistics Toolbox can be used to identify genes on a scatter plot. You can select as many points as you like on the scatter plot.

    gname(genes);
    

    When you have finished selecting points, press Enter.

  5. An alternative way to create a scatter plot is with the function gscatter from the Statistics Toolbox. gscatter creates a grouped scatter plot where points from each group have a different color or marker. You can use clusterdata, or any other clustering function, to group the points.

    figure
    pcclusters = clusterdata(zscores(:,1:2),6);
    gscatter(zscores(:,1),zscores(:,2),pcclusters)
    xlabel('First Principal Component');
    ylabel('Second Principal Component');
    title('Principal Component Scatter Plot with Colored Clusters');
    gname(genes)  % Press enter when you finish selecting genes.
    

    MATLAB plots the figure.


© 1994-2005 The MathWorks, Inc.