3.1 From Biological Function to Sequence Comparison

Learning Objectives

After reading this section, you should be able to:

Explain how molecular function can be encoded in protein sequence
Understand how non-ribosomal peptide synthesis motivates sequence comparison
Describe how alignments emerge naturally from biological reasoning
Recognize why intuitive comparison must be formalized computationally

A Molecular Puzzle: How Do Enzymes Encode Specificity?

One of the central questions in molecular biology is how proteins achieve their remarkable functional specificity. Enzymes are not only catalysts, they are highly selective systems that recognize substrates, control reaction pathways, and produce precise molecular outputs.

A particularly striking example of such specificity is found in non-ribosomal peptide synthesis (NRPS).

Unlike ribosomal protein synthesis, which follows the genetic code in a linear and uniform manner, NRPS systems operate as modular enzymatic assembly lines. Each module in such a system is responsible for incorporating a specific amino acid into a growing peptide chain.

Different organisms possess different NRPS systems, producing biologically active compounds such as:

Tyrocidine
Daptomycin
Vancomycin

These molecules differ substantially in structure and function, yet they are built using a shared architectural principle.

This raises a fundamental question:

Where, in the sequence of these enzymes, is the information that determines which amino acid is incorporated?

Adenylation Domains as Carriers of Specificity

Within each NRPS module, the key component responsible for substrate selection is the adenylation domain. These domains bind specific amino acids and activate them for incorporation into the peptide chain.

If we extract the amino acid sequences of several such domains from different organisms, we obtain sequences like:

s1: YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTA...
s2: AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYIY...
s3: IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA...

At first glance, these sequences appear quite different. There is no obvious one-to-one correspondence between positions. Yet experimentally, we know that all of these proteins perform essentially the same function: selecting and activating amino acids.

This creates a tension:

Function is conserved
Sequence appears variable

The challenge is to reconcile these two observations.

Looking for Patterns: The First Attempt

A natural first attempt is to compare the sequences position by position, aligning them without modification. Doing so reveals only a few columns in which the same amino acids appear across sequences.

Some of these matches may arise purely by chance. Indeed, short identical patterns are expected even in unrelated sequences.

If we stopped here, we might conclude that these sequences are only weakly related. However, this would contradict our biological knowledge.

This suggests that our method of comparison is inadequate.

Alignment by Eye: Introducing Gaps

Now consider a simple but powerful idea: what if the sequences are not aligned correctly?

Suppose we shift one sequence slightly to the right and introduce a gap symbol “–” at the beginning:

s1: YAFDLGYTCMFPVLLGGGELHIVQKET...
s2: -AFDVSAGDFARALLTGGQLIVCPNEV...
s3: IAFDASSWEIYAPLLNGGTVVCIDYY...

This small adjustment already increases the number of conserved positions.

Encouraged by this, we continue the process. We insert additional gaps at positions where the sequences appear misaligned:

s1: YAFDLGYTCMFPVLLGGGELHIVQKET...
s2: -AFDVSAGDFARALLTGGQLIVCPNEV...
s3: ----IAFDASSWEIYAPLLNGGTVVC...

With each adjustment, more columns become conserved. Eventually, we uncover a pattern:

a conserved core shared across sequences
variable regions surrounding it
specific short motifs that correlate with functional differences

This process, performed informally, is what we call alignment by eye.

What Did We Just Do?

Although this manual procedure may seem ad hoc, it implicitly performs several important operations:

Shifting sequences relative to each other → hypothesizing insertions or deletions
Matching residues across sequences → hypothesizing functional or evolutionary correspondence
Maximizing the number of “good” matches → implicitly optimizing a notion of similarity

In other words, even without formal definitions, we are already solving an optimization problem.

From Alignment to Biological Insight

The aligned sequences reveal more than just similarity. They allow us to identify functionally important regions.

In the case of adenylation domains, specific short segments, often around 8 amino acids long, form what is known as the non-ribosomal code. These segments determine which amino acid is selected by the enzyme.

For example:

LTKVGHIG → Aspartate
VGEIGSID → Ornithine
AWMFAAVL → Valine

These motifs would be extremely difficult to detect without proper alignment. Their positions are not fixed in the raw sequences but become apparent only after introducing gaps and aligning homologous regions.

This illustrates a key principle:

Alignment is the mechanism that makes functional signals visible.

The Need for Formalization

While alignment by eye is powerful, it has clear limitations:

It is subjective and not reproducible
It does not scale to long sequences or large datasets
It provides no guarantee of optimality

Most importantly, it lacks a precise definition of what constitutes a “good” alignment.

To overcome these limitations, we must formalize the problem:

What operations are allowed when transforming one sequence into another?
How do we assign scores to matches, mismatches, and gaps?
How do we find the best possible alignment among many alternatives?

These questions will lead us to a mathematical framework in which sequence alignment becomes a well-defined optimization problem.

Conceptual Summary

The comparison of biological sequences emerges naturally from questions about molecular function. In the case of NRPS systems, we are interested in identifying sequence features that determine substrate specificity.

By manually aligning sequences, we discover that similarity is not a simple positional property. Instead, it depends on how sequences are arranged relative to each other.

This insight transforms sequence comparison into a modeling problem:

We must define what an alignment is
We must decide how to evaluate it
We must develop algorithms to compute it

Self-Check Questions

Why do adenylation domain sequences appear dissimilar at first glance despite having similar functions?
What implicit assumptions are made when introducing gaps during alignment by eye?
Why does alignment reveal conserved motifs that are not visible in raw sequences?
In what sense does manual alignment already solve an optimization problem?