%% Working with Affymetrix Data
% This example demonstrates the functions in the Bioinformatics Toolbox
% for working with Affymetrix GeneChip data. 

%%
% Note that the *affyread* function is only supported on Microsoft Windows.

%% Affymetrix data files
% The function *affyread* can read four types of Affymetrix data files.
% These are DAT files, which contain raw image data, CEL files that contain
% information about the expression levels of the individual probes, CHP
% files that contain information about probe sets, and EXP files which
% contain information about experimental conditions and protocols.
% *affyread* can also read CDF and GIN library files. The CDF file contains
% information about which probes belong to which probe set and the GIN file
% contains information about the probe sets such as the gene name with
% which the probe set is associated. To learn more about the actual files,
% you can download sample data files from:
% http://www.affymetrix.com/support/technical/sample_data/demo_data.affx

if playbiodemo, return; end % Open in the editor or run step-by-step

%% Image files (DAT files)
% The raw image data from chip scanner is saved in the DAT file. If you
% use *affyread* to read a DAT file you will see that it creates a MATLAB
% structure.

datStruct = affyread('Drosophila-121502.dat')

%%
% Fields of the structure can be accessed using the dot notation.

datStruct.NumRows

%% Displaying an image file
% The *imagesc* command can be used to display the image.
datFigure = figure;
imagesc(datStruct.Image);

%%
% You can change the *colormap* from the default *jet* to another using the
% *colormap* command.

colormap pink

%%
% You can zoom in on a particular area with the mouse, or with the *axis*
% command. Notice that this stretches the y-axis.

axis([500 1500 150 350])

%% 
% You can force the axes to be equal using the *axis image* command.

axis image
axis([500 1500 150 350])


%% Probe results files (CEL files)
% The information about each probe on the chip is extracted from the image
% data by the Affymetrix image analysis software. The information is stored
% in the CEL file. *affyread* reads a CEL file into a structure. Notice
% that many of the fields are the same as those in the DAT structure.

celStruct = affyread('Drosophila-121502.cel')

%%
% The CEL file contains information about where each probe is on the chip
% and also the intensity values for the probe. You can use the *maimage*
% function to display the chip.

celFigure = figure;
maimage(celStruct)

%%
% Again, you can zoom in on a specific region.  

axis([30 200 0 25])

%%
% If you compare the image created from the CEL file and the image created
% from the DAT file, you will notice that the CEL image is lower
% resolution. This is because there is only one pixel per probe in this
% image, whereas the DAT file image has many pixels per probe.

%%
% The structures created by *affyread* can be very large. It is a good idea
% to clear them from memory once they are no longer needed.

clear datStruct
close(datFigure); close(celFigure);

%%
% The Probes field of the CEL structure contains information about the
% individual probes. There are eight values per probe. These are stored in
% the ProbeColumnNames field of the structure.

celStruct.ProbeColumnNames

%%
% So if look at one row of the Probes field of the CEL structure you will
% see eight values corresponding to the X position, Y position, intensity,
% and so forth.

celStruct.Probes(1,:)

%% Library files (CDF files)
% The probe set information in the CEL file by itself is not particularly
% useful as there is no indication in the file as to which probe set a
% probe belongs. This information is stored in the CDF library file
% associated with a chip type. The CDF files are typically stored in a
% central library directory. 

% libDir is the directory for the library files.
libDir = 'D:\Affymetrix\LibFiles\DrosGenome1';
cdfStruct = affyread('DrosGenome1.cdf',libDir)

%%
% Most of the information in the file is about the probe sets. In this
% example there are 14010 regular probe sets and 13 QC probe sets. The
% ProbeSets field of the structure is a 14023 array of structures.

cdfStruct.ProbeSets

%%
% A probe set record contains information about the name, type and number
% of probe pairs in the probe set.

cdfStruct.ProbeSets(100)

%%
% The information about where the probes for a probe set are on the chip is
% stored in the ProbePairs field. This is a matrix with one row for each
% probe pair and six columns. The information in the columns corresponds to
% the ProbeSetColumnNames of the CDF structure.

cdfStruct.ProbeSetColumnNames
cdfStruct.ProbeSets(100).ProbePairs(1,:)

%%
% The first column shows the probe set ID number, the second column, the probe
% pair number is 0. This is because Affymetrix uses zero based indexing for
% their probe numbering. The remaining columns give the X and Y coordinates
% of the PM and MM probes on the chip. You can use these coordinates to
% look up the values for a probe from the celStruct

PMX = cdfStruct.ProbeSets(100).ProbePairs(1,3);
PMY = cdfStruct.ProbeSets(100).ProbePairs(1,4);
theProbe = find((celStruct.Probes(:,1) == PMX) & ...
                       (celStruct.Probes(:,2) == PMY))
    
%%
% You can then extract all the information about this probe from the CEL
% structure.

celStruct.Probes(theProbe,:)

%%
% If you want to do this lookup for all probes, you can use the function
% *probelibraryinfo*. This creates a matrix with one row per probe and
% three columns. The first column is index of the probe set to which the
% probe belongs. The second column contains the probe pair index and the
% third column indicates if the probe is a perfect match (1) or mismatch
% (-1) probe. Notice that index of the probe pair index is 1 based.

probeinfo = probelibraryinfo(celStruct,cdfStruct);

probeinfo(theProbe,:)

%%
% The function *probesetvalues* does the reverse of this lookup and creates
% a matrix of information from the CEL and CDF structures containing all
% the information about a given probe set. This matrix has 18 columns
% corresponding to ProbeSetNumber, ProbePairNumber, UseProbePair,
% Background, PMPosX, PMPosY, PMIntensity, PMStdDev, PMPixels, PMOutlier,
% PMMasked, MMPosX, MMPosY, MMIntensity, MMStdDev, MMPixels, MMOutlier, and
% MMMasked.

psvals = probesetvalues(celStruct,cdfStruct,'142176_at')

%%
% You can extract the intensity values from the matrix and look at some of
% the statistics of the data.

pmIntensity = psvals(:,7);
mmIntensity = psvals(:,14);
boxplot([pmIntensity,mmIntensity],'labels',{'Perfect Match','Mismatch'})
title('Boxplot of raw intensity values for probe set 142176_at',...
    'interpreter','none')
% Use interpreter none to prevent the TeX interpreter treating the _ as
% subscript.

%%
% Or plot the values for the individual probes.

figure
plot(pmIntensity,'b'); hold on
plot(mmIntensity,'r'); hold off
title('Probe intensity values for probe set 142176_at',...
    'interpreter','none')

%% Gene names and probe set IDs
% The Affymetrix probe set IDs are not particularly descriptive. The
% mapping between the IDs and the gene names is stored in the GIN file.
% This is a text file so you can open it in an editor and browse through
% the file, or you can use *affyread* to read the information into a
% structure.

ginStruct = affyread('DrosGenome1.gin',libDir)

%%
% You can search through the structure for a particular probe set.
% Alternatively, you can use the function *probesetlookup* to find out
% information the gene name for a probe set. 

info = probesetlookup(cdfStruct,'142176_at')

%%
% You can use the Identifier field to add the gene name to the plot title.

newTitle = sprintf('Probe intensity values for probe set %s',...
                                                        info.Identifier);
title(newTitle)

%% Getting sequence information about a probe set.
% The output from *probesetlookup* includes a link to the source web site
% for the gene associated with a probe set.

web(info.SourceURL,'-browser')

%%
% The function *probesetlink* will link out to the NetAffx web site to show
% the actual sequences used for the probes. Note that you will need to be a
% registered user of NetAffx to access this information.

probesetlink(cdfStruct,'142176_at');

%% Results files (CHP files)
% The CHP file contains the results of the experiment. These include
% the average signal measures for each probe set as determined by the
% Affymetrix software and information about which probe sets are called as
% present, absent or marginal and the p-values for these calls.

chpStruct = affyread('Drosophila-121502.chp',libDir)

%%
% The ProbeSets field contains information about the probe sets. This
% includes some of the library information similar such as the ID and the
% type of probe set, and also results information such as the calculated
% signal value and the preset/absent/marginal call information. The call is
% given in the 'Detection' field of the ProbeSets structure. The
% '142176_at' probe set is called as being 'Present'.

chpStruct.ProbeSets(100)
            
%%
% However, the '142177_at' probe set is called 'Absent'.

chpStruct.ProbeSets(101)

%%
% You can calculate how many probe sets are called as being 'Present',

numPresent = sum(strcmp('Present',{chpStruct.ProbeSets.Detection}))

%%
% 'Absent',

numAbsent = sum(strcmp('Absent',{chpStruct.ProbeSets.Detection}))
%%
% and 'Marginal'.

numMarginal = sum(strcmp('Marginal',{chpStruct.ProbeSets.Detection}))

%%
% *maboxplot* will display a box plot of the log2 signal values for all
% probe sets.

maboxplot(chpStruct,'Signal','title',chpStruct.Name)

%%
% The CHP structure also contains information about the probe values for
% the probe set. These results are generally the same as the information
% created by the *probesetvalues* function though there are some
% differences. The major difference is that the intensity values in the
% probe pair values are normalized, and some of the placeholder columns for
% the *probesetinfo* results are populated. 

chpVals = chpStruct.ProbeSets(info.CDFIndex).ProbePairs(1,:)

%%
% Compare these with the psvals from the *probesetvalues*. Notice the
% normalization factor in the seventh and fourteenth columns. 

% Some of the psvals are zero so disable the divide by zero warning.

ws = warning('off','MATLAB:divideByZero');
chpVals./psvals(1,:)
warning(ws);

%% Plotting the probe set values.
% The function *probesetplot* creates a plot of the probes in a probe set
% from the CHP structure. The showstats option adds the mean, and lines for
% +/- one standard deviation for both the perfect match and the mismatch
% probes to the plot.

figure
probesetplot(chpStruct,'142176_at','showstats',true);

%%
% Affymetrix, GeneChip, and NetAffx are registered trademarks of
% Affymetrix, Inc.
%
% Copyright 2004 The MathWorks, Inc.
% $Revision: 1.1.10.1 $  $Date: 2004/12/24 20:38:10 $
