4.6 Statistical Significance: When Is a Match Meaningful?
4.6 Statistical Significance: When Is a Match Meaningful?
Section titled “4.6 Statistical Significance: When Is a Match Meaningful?”Learning Objectives
Section titled “Learning Objectives”After reading this section, you should be able to:
- explain why raw alignment scores are insufficient for interpreting search results
- describe the role of a null model in assessing similarity by chance
- understand why extreme value statistics arise in database search
- interpret p-values and E-values as measures of statistical significance
- explain how sequence length and database size affect the significance of a match
At this point, we have developed a powerful mechanism for identifying similar sequences in large databases. Algorithms such as BLAST return alignments together with associated scores, and these scores allow us to rank candidate matches. Yet a fundamental question remains:
How do we know whether a high-scoring alignment is biologically meaningful, or simply a result of chance?
This question becomes unavoidable in large-scale search. When millions of sequences are compared, even random sequences will occasionally produce high alignment scores. Without a statistical framework, it is impossible to distinguish signal from noise.
Why Raw Scores Are Not Enough
Section titled “Why Raw Scores Are Not Enough”Recall that an alignment score is computed as the sum of substitution scores and gap penalties:
While this score reflects the quality of an alignment under a chosen scoring system, it lacks a crucial reference point. A score of, say, may be highly significant in one context and entirely unremarkable in another.
Two key factors influence this:
-
Sequence length Longer sequences have more opportunities for random matches.
-
Database size The more sequences we search, the higher the chance of observing a high score by coincidence.
Thus, a raw score cannot be interpreted in isolation. It must be evaluated relative to what we would expect under a null model of random sequences.
The Null Model: Random Alignments
Section titled “The Null Model: Random Alignments”To assess significance, we imagine a scenario in which the query sequence is compared against sequences that are unrelated, that is, generated at random under some background distribution.
Under this null hypothesis, any observed alignment is purely accidental. The question then becomes:
How likely is it to observe an alignment score at least as high as the one we obtained, purely by chance?
That probability provides a natural measure of significance.
Extreme Values and Alignment Scores
Section titled “Extreme Values and Alignment Scores”An important insight from statistical theory is that alignment scores behave in a characteristic way under the null model.
While individual comparisons may resemble familiar distributions, the maximum score over many comparisons follows an extreme value distribution, often approximated by a Gumbel distribution.
Intuitively, this arises because similarity search does not consider just one alignment, but many possible alignments across many sequences. We are effectively asking about the maximum of many random variables, and maxima behave differently from individual observations.
This leads to a characteristic distribution of high scores in which extreme values occur more frequently than one might expect under a normal distribution.
P-Values: Probability of Random Occurrence
Section titled “P-Values: Probability of Random Occurrence”One way to quantify significance is through the p-value.
Formally, the p-value is defined as:
that is, the probability that a random alignment achieves a score at least as high as the observed score .
A small p-value indicates that such a score is unlikely to occur by chance, suggesting that the alignment may reflect a real biological relationship.
However, while p-values are conceptually clear, they are not always the most practical measure in large-scale database searches.
E-Values: Expected Number of Random Hits
Section titled “E-Values: Expected Number of Random Hits”BLAST therefore uses a closely related but more intuitive measure: the E-value.
The E-value is defined as:
The expected number of alignments with score at least that would occur by chance in the database search.
It can be approximated by an expression of the form:
where:
- and are the lengths of the query and database sequences,
- and are parameters depending on the scoring system and sequence composition.
This formula reflects two important dependencies:
- Higher scores lead to exponentially smaller E-values
- Larger databases lead to larger E-values
Interpreting the E-Value
Section titled “Interpreting the E-Value”The interpretation of the E-value is both straightforward and practical:
- → one such match is expected by chance
- → the match is unlikely to be random
- → many such matches are expected by chance
For example:
- suggests a highly significant match
- suggests that similar scores occur frequently by chance
This interpretation directly connects statistical theory to practical decision-making.
The Role of Database Size
Section titled “The Role of Database Size”An important consequence of this framework is that significance depends on the size of the database.
As the database grows, the number of random comparisons increases, and therefore the likelihood of observing high scores by chance also increases.
This leads to a somewhat counterintuitive effect:
The same alignment may appear less significant when searched against a larger database.
This is not a flaw, but a natural consequence of statistical reasoning.
From Scores to Biological Interpretation
Section titled “From Scores to Biological Interpretation”Statistical significance provides a critical filter between computational output and biological interpretation.
A high alignment score is not enough on its own. Only when the score is unlikely under the null model can we begin to consider biological explanations such as homology.
At the same time, significance does not guarantee biological relevance. Additional factors must be considered:
- alignment length,
- conservation of functional residues,
- domain structure,
- and biological context.
Thus, statistical evaluation is a necessary, but not sufficient, step in interpreting similarity search results.
Summary
Section titled “Summary”Statistical significance transforms raw alignment scores into interpretable quantities:
- p-values measure the probability of observing a score by chance,
- E-values estimate how many such matches we expect in the database,
- both rely on a null model of random sequences,
- and both depend on sequence length and database size.
This framework allows us to distinguish meaningful similarity from random coincidence, which is essential in large-scale search.
Self-Check Questions
Section titled “Self-Check Questions”-
Why is a raw alignment score insufficient to assess biological relevance?
-
What is the null model in the context of similarity search?
-
Why do alignment scores follow an extreme value distribution rather than a normal distribution?
-
How does the E-value differ conceptually from the p-value?
-
Why does increasing the database size affect the significance of a match?