4.6 Statistical Significance: When Is a Match Meaningful?

Learning Objectives

After reading this section, you should be able to:

explain why raw alignment scores are insufficient for interpreting search results
describe the role of a null model in assessing similarity by chance
understand why extreme value statistics arise in database search
interpret p-values and E-values as measures of statistical significance
explain how sequence length and database size affect the significance of a match

At this point, we have developed a powerful mechanism for identifying similar sequences in large databases. Algorithms such as BLAST return alignments together with associated scores, and these scores allow us to rank candidate matches. Yet a fundamental question remains:

How do we know whether a high-scoring alignment is biologically meaningful, or simply a result of chance?

This question becomes unavoidable in large-scale search. When millions of sequences are compared, even random sequences will occasionally produce high alignment scores. Without a statistical framework, it is impossible to distinguish signal from noise.

Why Raw Scores Are Not Enough

Recall that an alignment score is computed as the sum of substitution scores and gap penalties:

S = \sum \text{substitution scores} + \sum \text{gap penalties}.

While this score reflects the quality of an alignment under a chosen scoring system, it lacks a crucial reference point. A score of, say, $50$ may be highly significant in one context and entirely unremarkable in another.

Two key factors influence this:

Sequence length Longer sequences have more opportunities for random matches.
Database size The more sequences we search, the higher the chance of observing a high score by coincidence.

Thus, a raw score cannot be interpreted in isolation. It must be evaluated relative to what we would expect under a null model of random sequences.

The Null Model: Random Alignments

To assess significance, we imagine a scenario in which the query sequence is compared against sequences that are unrelated, that is, generated at random under some background distribution.

Under this null hypothesis, any observed alignment is purely accidental. The question then becomes:

How likely is it to observe an alignment score at least as high as the one we obtained, purely by chance?

That probability provides a natural measure of significance.

Extreme Values and Alignment Scores

An important insight from statistical theory is that alignment scores behave in a characteristic way under the null model.

While individual comparisons may resemble familiar distributions, the maximum score over many comparisons follows an extreme value distribution, often approximated by a Gumbel distribution.

Intuitively, this arises because similarity search does not consider just one alignment, but many possible alignments across many sequences. We are effectively asking about the maximum of many random variables, and maxima behave differently from individual observations.

This leads to a characteristic distribution of high scores in which extreme values occur more frequently than one might expect under a normal distribution.

P-Values: Probability of Random Occurrence

One way to quantify significance is through the p-value.

Formally, the p-value is defined as:

p = \mathbb{P}(\text{score} \geq S \mid \text{null model}),

that is, the probability that a random alignment achieves a score at least as high as the observed score $S$ .

A small p-value indicates that such a score is unlikely to occur by chance, suggesting that the alignment may reflect a real biological relationship.

However, while p-values are conceptually clear, they are not always the most practical measure in large-scale database searches.

E-Values: Expected Number of Random Hits

BLAST therefore uses a closely related but more intuitive measure: the E-value.

The E-value is defined as:

The expected number of alignments with score at least $S$ that would occur by chance in the database search.

It can be approximated by an expression of the form:

E = K \cdot m \cdot n \cdot e^{-\lambda S},

where:

$m$ and $n$ are the lengths of the query and database sequences,
$K$ and $\lambda$ are parameters depending on the scoring system and sequence composition.

This formula reflects two important dependencies:

Higher scores lead to exponentially smaller E-values
Larger databases lead to larger E-values

Interpreting the E-Value

The interpretation of the E-value is both straightforward and practical:

$E \approx 1$ → one such match is expected by chance
$E \ll 1$ → the match is unlikely to be random
$E \gg 1$ → many such matches are expected by chance

For example:

$E = 10^{-5}$ suggests a highly significant match
$E = 10$ suggests that similar scores occur frequently by chance

This interpretation directly connects statistical theory to practical decision-making.

The Role of Database Size

An important consequence of this framework is that significance depends on the size of the database.

As the database grows, the number of random comparisons increases, and therefore the likelihood of observing high scores by chance also increases.

This leads to a somewhat counterintuitive effect:

The same alignment may appear less significant when searched against a larger database.

This is not a flaw, but a natural consequence of statistical reasoning.

From Scores to Biological Interpretation

Statistical significance provides a critical filter between computational output and biological interpretation.

A high alignment score is not enough on its own. Only when the score is unlikely under the null model can we begin to consider biological explanations such as homology.

At the same time, significance does not guarantee biological relevance. Additional factors must be considered:

alignment length,
conservation of functional residues,
domain structure,
and biological context.

Thus, statistical evaluation is a necessary, but not sufficient, step in interpreting similarity search results.

Summary

Statistical significance transforms raw alignment scores into interpretable quantities:

p-values measure the probability of observing a score by chance,
E-values estimate how many such matches we expect in the database,
both rely on a null model of random sequences,
and both depend on sequence length and database size.

This framework allows us to distinguish meaningful similarity from random coincidence, which is essential in large-scale search.

Self-Check Questions

Why is a raw alignment score insufficient to assess biological relevance?
What is the null model in the context of similarity search?
Why do alignment scores follow an extreme value distribution rather than a normal distribution?
How does the E-value differ conceptually from the p-value?
Why does increasing the database size affect the significance of a match?