4.7 Interpreting Similarity Search Results

Learning Objectives

After reading this section, you should be able to:

interpret ranked similarity search hits in both statistical and biological terms
explain why the top hit alone may be insufficient for robust biological inference
evaluate how alignment length, gaps, and conserved residues affect interpretation
distinguish between sequence identity and sequence similarity
assess similarity search output as evidence rather than definitive proof of homology

After performing a similarity search and evaluating statistical significance, we obtain a list of candidate matches. At first glance, this output may appear straightforward: sequences are ranked by score or E-value, and the top hits seem to be the most relevant. In practice, however, interpretation requires careful reasoning.

Similarity search does not produce definitive answers. It produces evidence, which must be evaluated in a biological and statistical context.

From Output to Insight

A typical search result consists of:

a ranked list of hits,
associated scores and E-values,
and one or more alignments for each hit.

The ranking reflects the statistical strength of similarity, with lower E-values indicating more significant matches. Yet interpreting this ranking requires more than simply selecting the top hit.

A central question is:

What does it mean for two sequences to be “similar enough”?

The answer depends on multiple factors, including statistical significance, alignment structure, and biological plausibility.

Beyond the Best Hit

It is tempting to focus exclusively on the highest-ranking match. In many cases, this is indeed the most informative result. However, several pitfalls arise from this approach.

First, the top hit may not represent the full biological context. For example, a query sequence may contain multiple domains, each with different evolutionary origins. In such cases, different parts of the query may align to different database sequences.

Second, closely related sequences often appear as clusters of similar hits. Observing a consistent pattern across multiple top-ranked matches can provide stronger evidence than a single high-scoring alignment.

This leads to an important principle:

Robust biological interpretation relies on patterns across multiple hits, not just the best match.

Alignment Structure Matters

The numerical score or E-value provides only a summary of the alignment. To fully interpret a result, one must examine the alignment itself.

Several aspects are particularly informative:

Length of the aligned region Short alignments may achieve high scores by chance, especially in large databases.
Distribution of matches and mismatches True homologs often show structured conservation rather than scattered matches.
Presence of gaps Insertions and deletions may reflect evolutionary events.
Conservation of key residues Functionally important positions are often highly conserved.

These features provide context that no single score can capture.

Identity, Similarity, and Their Interpretation

Two commonly reported quantities are:

Sequence identity: the fraction of exactly matching residues
Sequence similarity: the fraction of residues with favorable substitution scores

These measures are useful summaries, but they must be interpreted with care.

For example:

High identity over a short region may be less meaningful than moderate identity over a long region.
Similarity depends on the substitution matrix and may reflect biochemical properties rather than exact matches.

Thus, identity and similarity provide useful descriptors, but they are not definitive indicators of biological relationship.

False Positives and Low-Complexity Regions

One of the most common sources of misleading results arises from low-complexity sequences.

Biological sequences often contain repetitive or compositionally biased regions, such as:

microsatellites,
transposable elements,
or simple amino acid repeats.

These regions can produce artificially high alignment scores, even in unrelated sequences.

As a result, they may appear as significant hits despite lacking biological relevance.

Modern search tools often attempt to mask or down-weight such regions, but the user must remain aware of their potential impact.

Similarity Is Not Transitive

A particularly subtle pitfall in interpreting search results is the assumption of transitivity.

Suppose:

sequence $A$ is similar to sequence $B$ ,
and sequence $A$ is also similar to sequence $C$ .

It does not necessarily follow that $B$ is similar to $C$ .

This situation arises because alignments may involve different regions of the sequences. Two sequences may each share similarity with a third sequence, but not with each other.

This observation reinforces an important conceptual distinction:

Similarity relationships are local and context-dependent, not globally transitive.

From Similarity to Homology

Ultimately, the goal of similarity search is often to infer homology, that is, shared evolutionary origin. However, as emphasized earlier, homology cannot be directly observed. It must be inferred from evidence.

A typical reasoning process is:

A significant alignment is detected (low E-value).
The alignment shows consistent structure and sufficient length.
Functional or structural features are conserved.
Multiple related sequences support the relationship.

Only when these pieces of evidence point in the same direction can we confidently infer homology.

This highlights a key point:

Similarity search provides hypotheses, not conclusions.

A Practical Example of Interpretation

Consider a query sequence that yields the following results:

Several hits with E-values below $10^{-20}$ , all annotated as kinases
Alignments covering a conserved catalytic domain
High similarity in functionally important residues

In this case, the evidence strongly suggests that the query sequence is also a kinase.

In contrast, consider:

A single hit with E-value near $1$
Alignment restricted to a short repetitive region
No consistent pattern across other hits

Here, the evidence is weak, and the match is likely incidental.

Summary

Interpreting similarity search results requires integrating multiple sources of information:

statistical significance (E-values),
alignment structure,
sequence identity and similarity,
consistency across hits,
and biological context.

No single metric is sufficient on its own. Meaningful interpretation emerges from the combination of evidence.

Self-Check Questions

Why is it insufficient to consider only the top hit in a similarity search?
What aspects of an alignment provide insight beyond the raw score?
Why can low-complexity regions lead to false positives?
What does it mean that similarity is not transitive?
How does one move from similarity to a hypothesis of homology?