3.7 Interpreting Alignments
3.7 Interpreting Alignments
Section titled “3.7 Interpreting Alignments”Learning Objectives
Section titled “Learning Objectives”After reading this section, you should be able to:
- Distinguish clearly between similarity and homology
- Interpret alignment scores in a biological context
- Understand the limitations of sequence alignment
- Recognize when an alignment is likely to be biologically meaningful
- Identify common pitfalls in interpreting alignment results
From Computation Back to Biology
Section titled “From Computation Back to Biology”At this point, we have developed a complete computational framework for sequence alignment:
- we can define similarity through scoring schemes
- we can compute optimal alignments using dynamic programming
- we can adapt the model to detect global or local similarity
This naturally leads to a critical question:
What does an alignment actually tell us about biological reality?
It is tempting to treat an alignment as a direct answer. After all, the algorithms we developed are mathematically rigorous and guarantee optimality under the chosen scoring system.
However, this guarantee is subtle:
An alignment is optimal only with respect to the model we have defined.
It does not necessarily reflect the true evolutionary history or functional relationship between sequences.
Similarity Is Not Homology
Section titled “Similarity Is Not Homology”One of the most common misunderstandings in bioinformatics is the confusion between similarity and homology.
Similarity is a measurable quantity. It is defined through alignment scores, percent identity, or other metrics. It can be computed directly from sequence data.
Homology, in contrast, is a biological statement. It means that two sequences share a common evolutionary origin.
This distinction is crucial:
- similarity is quantitative
- homology is qualitative
An alignment can demonstrate high similarity, but it cannot prove homology in a strict logical sense. Instead, similarity provides evidence for homology.
Thus, it is more precise to say:
Sequences are inferred to be homologous based on significant similarity.
What Does an Alignment Represent?
Section titled “What Does an Alignment Represent?”An alignment can be interpreted as a hypothesis about how two sequences are related.
More specifically, it proposes:
- which residues correspond to each other
- where insertions and deletions have occurred
- which regions are conserved and which are variable
From this perspective, the alignment is not merely a result but a model of evolutionary transformation.
This interpretation connects directly back to the edit operations introduced earlier:
- substitutions represent point mutations
- gaps represent insertions or deletions
- aligned residues suggest shared ancestry or functional constraints
However, it is important to remember that this model is simplified. Real evolutionary processes are more complex and may not be fully captured by the scoring system.
Alignment Scores and Their Meaning
Section titled “Alignment Scores and Their Meaning”The alignment score quantifies how well two sequences match under a given model. Higher scores indicate better agreement with the assumptions encoded in the scoring system.
However, the absolute value of the score is difficult to interpret in isolation.
For example:
- a score of 50 may be highly significant for short sequences
- the same score may be unremarkable for longer sequences
This is because longer sequences provide more opportunities for both matches and mismatches.
Therefore, interpreting alignment scores requires context. In practice, this often involves:
- comparing scores against random sequences
- normalizing scores
- estimating statistical significance
While a full treatment of statistical significance is beyond the scope of this chapter, it is important to recognize that:
Raw alignment scores are not directly comparable across different sequence lengths or scoring systems.
The Role of Local Alignment
Section titled “The Role of Local Alignment”Local alignment plays a particularly important role in interpretation.
As we have seen, biological sequences are often modular. Conserved domains may be embedded within otherwise unrelated sequences.
Local alignment allows us to detect these regions of high similarity without being distracted by unrelated flanking regions.
In practice, this means that:
- a strong local alignment can indicate a conserved functional domain
- weak or absent global similarity does not rule out biological relatedness
This insight is essential when analyzing complex proteins such as NRPS enzymes, where functional modules are rearranged and reused across different organisms.
When Alignments Mislead
Section titled “When Alignments Mislead”Despite their usefulness, alignments can also be misleading if interpreted uncritically.
One important issue is that alignment algorithms will always produce a result, even for unrelated sequences. Given enough flexibility in gap placement, it is possible to construct alignments that appear structured but arise purely by chance.
This leads to a practical warning:
If an alignment “looks wrong,” it often is.
Another source of error arises from inappropriate scoring systems. If the scoring scheme does not reflect the underlying biology, the resulting alignment may be formally optimal but biologically meaningless.
For example:
- using a protein substitution matrix for DNA sequences
- applying gap penalties that encourage unrealistic fragmentation
- aligning sequences that are too distantly related
These issues highlight the importance of model choice in alignment.
Dependence on Assumptions
Section titled “Dependence on Assumptions”Every alignment is the result of a set of assumptions:
- the scoring matrix encodes substitution preferences
- gap penalties reflect structural constraints
- the choice of global or local alignment defines the scope of comparison
Changing any of these assumptions can lead to different alignments.
This reinforces a central theme of this book:
Bioinformatics methods do not reveal truth directly. They generate interpretations under specific models.
From Pairwise to Multiple Alignments
Section titled “From Pairwise to Multiple Alignments”Finally, it is worth noting that pairwise alignment is often only the first step in a broader analysis.
In many applications, we are interested in comparing multiple sequences simultaneously, for example to:
- identify conserved motifs across a family of proteins
- infer evolutionary relationships
- build consensus sequences
However, extending alignment from two sequences to many introduces additional challenges:
- pairwise optimality does not guarantee global consistency
- errors can propagate in progressive alignment methods
- results depend strongly on initial assumptions
We will revisit these ideas later when discussing multiple sequence alignment.
Conceptual Summary
Section titled “Conceptual Summary”Sequence alignment provides a powerful framework for comparing biological sequences, but its results must be interpreted with care.
- similarity is a measurable quantity, while homology is an inference
- alignments represent hypotheses about sequence relationships
- scores depend on the chosen model and are not absolute
- inappropriate assumptions can lead to misleading conclusions
Understanding these limitations is essential for using alignment as a scientific tool rather than treating it as a black box.
Self-Check Questions
Section titled “Self-Check Questions”- Why is similarity not equivalent to homology?
- In what sense is an alignment a hypothesis rather than a fact?
- Why are raw alignment scores difficult to interpret across different sequences?
- How can local alignment improve biological interpretation?
- What are common sources of misleading alignments?