seqpdist

Calculate pairwise distance between sequences

Syntax

D = seqpdist(Seqs,'PropertyName', PropertyValue ...)

seqpdist(..., 'Method', MethodValue)
seqpdist(..., 'Indels', IndelsValue)
seqpdist(..., 'Optargs', OptargsValue)
seqpdist(..., 'PairwiseAlignment',PairwiseAlignmentValue)
seqpdist(..., 'Squareform', SquareformValue)
seqpdist(..., 'Alphabet', AlphabetValue)

seqpdist(..., 'ScoringMatrix', ScoringMatrixValue)
seqpdist(..., 'Scale', ScaleValue
seqpdist(..., 'GapOpen', GapOpenValue)
seqpdist(..., 'ExtendGap', ExtendGapValue)

Arguments

Seqs

Cell array with nucleotide or amino acid sequences.

MethodProperty to select the method for calculating pariwise distances.
IndelsProperty to indicate treatment of gaps.
OptargsProperty to pass required arguments by the distance method selected with the property Method
PairwiseAlignmentProperty to force pariwise alignment.
SquareForm

Property to control formatting the output as a square or triangular matrix.

Alphabet

Property to select an alphabet. Enter either 'NT' for nucleotides or 'AA' for amino acids.

ScoringMatrix

Property to select a scoring matrix for pariwise alignment.

Scale

Property to select a scale factor for the scoring matrix.

GapOpen

Property to select a gap penalty.

ExtendedGap

Property to select a penalty for extending a gap.

Description

D = seqpdist(Seqs, 'PropertyName', PropertyValue ...) returns a vector D containing biological distances between each pair of sequences stored in the M elements of the cell Seqs.

D is an (M*(M-1)/2)-by-1 vector corresponding to the M*(M-1)/2 pairs of sequences in Seqs. The output D is arranged in the order ((2,1),(3,1),..., (M,1),(3,2),...(M,2),.....(M,M-1)). This is the lower left triangle of the full M-by-M distance matrix. To get the distance between the Ith and the Jth sequences for I > J, use the formula D((J-1)*(M-J/2)+I-J). Seqs can also be a vector of structures with the field Sequence or a matrix of chars.

seqpdist(..., 'Method', MethodValue) selects a method (MethodValue) to compute distances between every pair of sequences.

Distances defined for both nucleotides and amino acids:

'p-distance'

Proportion of sites at which the two sequences are different. p —> 1 for poorly related and p —> 0 for similar sequences.

'Jukes-Cantor' (default)

Maximum likelihood estimate of the number of substitutions between two sequences. For

NT d = -3/4 log(1–p * 4/3)

AA d = -19/20 log(1–p * 20/19)

'alignment-score'

Distance (d) between two sequences (1 and 2) is computed from the pairwise alignment score (s) as follows:

d(1,2) = (1-s(1,2)/s(1,1)) 
          * (1-s(1,2)/s(2,2))
    

This option does not imply that prealigned input sequences will be realigned, it only scores them. Use with care; this distance method does not comply with the ultrametric condition. In the rare case where s(x,y)>s(x,x), then d(x,y)=0.

Distances defined only for nucleotides and no scoring of gaps:

'Tajima-Nei'

Maximum likelihood estimate considering the background nucleotide frequencies. It can be computed from the input sequences or given by setting 'OPTARGS' to [gA gC gG gT].

'Kimura'

Considers separately the transitional and transversion nucleotide substitution.

'Tamura'

Considers separately the transitional and transversion nucleotide substitution and the GC content. GC content can be computed from the input sequences or given by setting 'OPTARGS'.

'Hasegawa'

Considers separately the transitional and transversional nucleotide substitution and the background nucleotide frequencies. Background frequencies can be computed from the input sequences or given by setting 'OPTARGS' to [gA gC gG gT].

'Nei-Tamura'

Considers separately the transitional substitution between purines, the transitional substitution between pyramidines and the transversional substitution and the background nucleotide frequencies. Background frequencies can be computed from the input sequences or given by setting 'OPTARGS' to [gA gC gG gT].

Distances defined only for amino acids and no scoring of gaps:

'Poisson'

Asumes that the number of amino acid substitutions at each site has a Poisson distribution.

'Gamma'

Assumes that the number of amino acid substitutions at each site has a Gamma distribution with parameter 'a'. 'a' can be set by 'OPTARGS'. The default value is 2.

A user defined distance function can also be specified using @, for example, @distfun, the distance function must be of the form:

function D = distfun(S1, S2, OPTARGS)

Taking as arguments two same-length sequences (NT or AA) plus zero or more additional problem-dependent arguments in OPTARGS, and returning a scalar that represents the distance between S1 and S2.

seqpdist(..., 'Indels', IndelsValue) indicates how to treat sites with gaps. Options are

seqpdist(..., 'Optargs', OptargsValue) some distance methods require or accept optional arguments. Use a cell array to pass more than one input argument (for example, The nucleotide frequencies in the Tajima-Nei distance function can be specified instead of computing them from the input sequences).

seqpdist(..., 'PairwiseAlignment', PairwiseAlignmentValue), when PairwiseAlignment is true, ignores multiple alignment of the input sequences (if any) and forces a pairwise alignment of input sequences. If the input sequences are not prealigned, this flag is set automatically. Pairwise alignment can be slow for a large number of sequences. The default value is false.

seqpdist(..., 'Squareform', SquareformValue), when SquareForm is true, converts the output into a square formatted matrix so the D(I,J) denotes the distance between the Ith and Jth sequences. The output matrix is symmetric and has a zero diagonal. Setting the property Squareform to true is the same as using the function squareform in the Statistical Toolbox.

seqpdist(..., 'Alphabet', AlphabetValue) specifies whether the sequences are amino acids ('AA') or nucleotides ('NT'). The default value is 'AA'.

The remaining input properties are analogous to the function nwalign and are used when the property PairwiseAlignment = true or the property Method = 'alignment-score'. For more information about these properties, see nwalign.

seqpdist(..., 'ScoringMatrix', ScoringMatrixValue) specifies the scoring matrix to be used for the alignment. The default value is BLOSUM50 for AA and NUC44 for NT.

seqpdist(..., 'Scale', ScaleValue) indicates the scale factor of the scoring matrix to return the score using arbitrary units. If the scoring matrix info also provides a scale factor, then both are used.

seqpdist(..., GapOpen', GapOpenValue) specifies the penalty for opening a gap in the alignment. The default gap open penalty is 8.

seqpdist(..., 'ExtendGap', ExtendGapValue) specifies the penalty for extending a gap in the alignment. If ExtendGap is not specified, then extensions to gaps are scored with the same value as GapOpen.

Examples

 % Load a multiple alignment of amino acids:
 seqs = fastaread('pf00002.fa');
 
 % For every possible pair of sequences in the multiple 
 % alignment removes sites with gaps and scores with the 
 % substitution matrix PAM250:
 
 dist = seqpdist(seqs,'method','alignment-score',...
                 'indels','pairwise-delete',...
                 'scoringmatrix','pam250')
 
 % To force the realignment of every pair of sequences 
 % ignoring the provided multiple alignment:
 
 dist = seqpdist(seqs,'method','alignment-score',...
                 'indels','pairwise-delete',...
                 'scoringmatrix','pam250',...
                 'pairwisealignment',true)
 
 % To measure the 'Jukes-Cantor' pairwise distances after 
 % realigning every pair of sequences, counting the gaps as 
 % point mutations:
 
 dist = seqpdist(seqs,'method','jukes-cantor',...
                               'indels','score',...
                               'scoringmatrix','pam250',...
                               'pairwisealignment',true)

See Also

Bioinformatics Toolbox functions fastaread, seqlinkage

phytree methods phytree (phytree), pdist (phytree)

Statistical Toolbox functions pdist, squareform


© 1994-2005 The MathWorks, Inc.