3.7 Interpreting Alignments

Learning Objectives

After reading this section, you should be able to:

Distinguish clearly between similarity and homology
Interpret alignment scores in a biological context
Understand the limitations of sequence alignment
Recognize when an alignment is likely to be biologically meaningful
Identify common pitfalls in interpreting alignment results

From Computation Back to Biology

At this point, we have developed a complete computational framework for sequence alignment:

we can define similarity through scoring schemes
we can compute optimal alignments using dynamic programming
we can adapt the model to detect global or local similarity

This naturally leads to a critical question:

What does an alignment actually tell us about biological reality?

It is tempting to treat an alignment as a direct answer. After all, the algorithms we developed are mathematically rigorous and guarantee optimality under the chosen scoring system.

However, this guarantee is subtle:

An alignment is optimal only with respect to the model we have defined.

It does not necessarily reflect the true evolutionary history or functional relationship between sequences.

Similarity Is Not Homology

One of the most common misunderstandings in bioinformatics is the confusion between similarity and homology.

Similarity is a measurable quantity. It is defined through alignment scores, percent identity, or other metrics. It can be computed directly from sequence data.

Homology, in contrast, is a biological statement. It means that two sequences share a common evolutionary origin.

This distinction is crucial:

similarity is quantitative
homology is qualitative

An alignment can demonstrate high similarity, but it cannot prove homology in a strict logical sense. Instead, similarity provides evidence for homology.

Thus, it is more precise to say:

Sequences are inferred to be homologous based on significant similarity.

What Does an Alignment Represent?

An alignment can be interpreted as a hypothesis about how two sequences are related.

More specifically, it proposes:

which residues correspond to each other
where insertions and deletions have occurred
which regions are conserved and which are variable

From this perspective, the alignment is not merely a result but a model of evolutionary transformation.

This interpretation connects directly back to the edit operations introduced earlier:

substitutions represent point mutations
gaps represent insertions or deletions
aligned residues suggest shared ancestry or functional constraints

However, it is important to remember that this model is simplified. Real evolutionary processes are more complex and may not be fully captured by the scoring system.

Alignment Scores and Their Meaning

The alignment score quantifies how well two sequences match under a given model. Higher scores indicate better agreement with the assumptions encoded in the scoring system.

However, the absolute value of the score is difficult to interpret in isolation.

For example:

a score of 50 may be highly significant for short sequences
the same score may be unremarkable for longer sequences

This is because longer sequences provide more opportunities for both matches and mismatches.

Therefore, interpreting alignment scores requires context. In practice, this often involves:

comparing scores against random sequences
normalizing scores
estimating statistical significance

While a full treatment of statistical significance is beyond the scope of this chapter, it is important to recognize that:

Raw alignment scores are not directly comparable across different sequence lengths or scoring systems.

The Role of Local Alignment

Local alignment plays a particularly important role in interpretation.

As we have seen, biological sequences are often modular. Conserved domains may be embedded within otherwise unrelated sequences.

Local alignment allows us to detect these regions of high similarity without being distracted by unrelated flanking regions.

In practice, this means that:

a strong local alignment can indicate a conserved functional domain
weak or absent global similarity does not rule out biological relatedness

This insight is essential when analyzing complex proteins such as NRPS enzymes, where functional modules are rearranged and reused across different organisms.

When Alignments Mislead

Despite their usefulness, alignments can also be misleading if interpreted uncritically.

One important issue is that alignment algorithms will always produce a result, even for unrelated sequences. Given enough flexibility in gap placement, it is possible to construct alignments that appear structured but arise purely by chance.

This leads to a practical warning:

If an alignment “looks wrong,” it often is.

Another source of error arises from inappropriate scoring systems. If the scoring scheme does not reflect the underlying biology, the resulting alignment may be formally optimal but biologically meaningless.

For example:

using a protein substitution matrix for DNA sequences
applying gap penalties that encourage unrealistic fragmentation
aligning sequences that are too distantly related

These issues highlight the importance of model choice in alignment.

Dependence on Assumptions

Every alignment is the result of a set of assumptions:

the scoring matrix encodes substitution preferences
gap penalties reflect structural constraints
the choice of global or local alignment defines the scope of comparison

Changing any of these assumptions can lead to different alignments.

This reinforces a central theme of this book:

Bioinformatics methods do not reveal truth directly. They generate interpretations under specific models.

From Pairwise to Multiple Alignments

Finally, it is worth noting that pairwise alignment is often only the first step in a broader analysis.

In many applications, we are interested in comparing multiple sequences simultaneously, for example to:

identify conserved motifs across a family of proteins
infer evolutionary relationships
build consensus sequences

However, extending alignment from two sequences to many introduces additional challenges:

pairwise optimality does not guarantee global consistency
errors can propagate in progressive alignment methods
results depend strongly on initial assumptions

We will revisit these ideas later when discussing multiple sequence alignment.

Conceptual Summary

Sequence alignment provides a powerful framework for comparing biological sequences, but its results must be interpreted with care.

similarity is a measurable quantity, while homology is an inference
alignments represent hypotheses about sequence relationships
scores depend on the chosen model and are not absolute
inappropriate assumptions can lead to misleading conclusions

Understanding these limitations is essential for using alignment as a scientific tool rather than treating it as a black box.

Self-Check Questions

Why is similarity not equivalent to homology?
In what sense is an alignment a hypothesis rather than a fact?
Why are raw alignment scores difficult to interpret across different sequences?
How can local alignment improve biological interpretation?
What are common sources of misleading alignments?