6.1 From Motif Search to Generative Models of Biological Sequences

In the previous chapter, we approached motif discovery as a problem of identifying recurring patterns in biological sequences. We introduced position-specific models that allow us to represent such patterns probabilistically and to scan sequences for likely motif occurrences. While this already provides a powerful framework, it still rests on a simplifying assumption: that motifs can be described as static patterns with independent positions.

In practice, however, biological sequences are not generated by static templates. They arise from underlying biological processes that introduce variability, dependencies, and structural organization. To capture this, we now take a conceptual step forward. Instead of asking whether a sequence matches a motif, we ask a more fundamental question:

Which process could have generated this sequence?

From Pattern Matching to Model-Based Reasoning

Consider again the problem of motif discovery in DNA. A genome is a long sequence composed of nucleotides, within which short functional regions such as promoters, splice sites, or transcription factor binding sites are embedded. These regions are typically short and weakly conserved, and they are surrounded by large stretches of background sequence.

This motivates a simple but powerful abstraction. We assume that different parts of the sequence are generated by different underlying processes:

a motif-generating process, producing biologically meaningful patterns
a background process, producing non-specific sequence

Each position in the sequence is thus associated with one of these processes, although this assignment is not directly observable.

Likelihood as a Measure of Compatibility

To make this idea operational, we quantify how well a sequence fits a given model using the concept of likelihood.

Given a model $M$ and a sequence $S$ , the likelihood

P(S \mid M)

measures how probable it is that the sequence was generated by the model.

In the case of position-specific models, the model is represented as a position probability matrix (PPM). Each position $i$ specifies a probability distribution over nucleotides. Under the assumption of positional independence, the likelihood of a sequence of length $L$ is given by

P(S \mid M) = \prod_{i=1}^{L} P(x_i \mid M_i)

where $x_i$ is the nucleotide at position $i$ , and $M_i$ denotes the distribution at that position.

A Worked Example

To illustrate this concretely, consider a motif model of length five described by the following position-specific probabilities:

Position	A	C	G	T
1	0.1	0.5	0.2	0.2
2	0.3	0.2	0.2	0.3
3	0.1	0.1	0.6	0.2
4	0.2	0.1	0.5	0.2
5	0.2	0.6	0.1	0.1

Now consider the sequence

S = \text{GAGGT}

We compute its likelihood under the model by multiplying the position-wise probabilities:

Position 1: ( P(G) = 0.2 )
Position 2: ( P(A) = 0.3 )
Position 3: ( P(G) = 0.6 )
Position 4: ( P(G) = 0.5 )
Position 5: ( P(T) = 0.1 )

Thus,

P(S \mid M) = 0.2 \cdot 0.3 \cdot 0.6 \cdot 0.5 \cdot 0.1 = 0.0018

This value quantifies how compatible the sequence is with the motif model. A higher likelihood indicates a better match.

From Likelihood to Motif Detection

This calculation enables a simple scanning procedure. Given a long sequence, we slide a window of length five along it and compute the likelihood for each subsequence. This produces a function

\text{position} \mapsto P(S_{i:i+L-1} \mid M)

that can be interpreted as a likelihood landscape.

Regions where the likelihood is high correspond to subsequences that are well explained by the motif model. In practice, motif discovery can therefore be framed as identifying peaks in this landscape.

This perspective is conceptually powerful: rather than directly searching for patterns, we evaluate how well a generative model explains different parts of the sequence.

Limitations of the Model

Despite its elegance, this approach has important limitations.

First, it assumes that positions are independent. From a biological perspective, this is rarely true. Structural and functional constraints introduce dependencies between positions, particularly in protein sequences and regulatory elements.

Second, the model assumes a fixed motif length. Insertions and deletions, which are common in biological sequences, cannot be handled naturally. Even a single insertion can drastically reduce the likelihood of an otherwise valid motif instance.

Finally, the model does not explicitly represent transitions between different regions of a sequence. It treats motifs as isolated objects rather than as parts of a larger generative process.

Towards Generative Sequence Models

These limitations motivate a more expressive framework. Instead of describing motifs as static probability tables, we model the process that generates sequences.

In such a model:

the sequence is generated step by step
at each step, the system is in a certain internal state
the state determines which symbols are likely to be emitted
the system can transition between states over time

Crucially, these internal states are not directly observable. We only see the emitted sequence, not the underlying process.

Preview: Hidden Structure in Sequences

This leads to the central idea of this chapter.

We model biological sequences as being generated by a system that moves through a sequence of hidden states, each of which emits observable symbols. Some states may correspond to biologically meaningful regions, such as promoters, while others represent background sequence.

Because the sequence of states is not directly observable, it remains hidden.

This type of model is known as a Hidden Markov Model (HMM).

Conceptual Transition

With this shift, motif discovery becomes a problem of inferring hidden structure rather than simply detecting patterns.

We are no longer only interested in whether a motif occurs, but in reconstructing:

which regions belong to which functional class
how likely a sequence is under a given model
how to learn such models from data

In the following sections, we will develop this framework step by step, beginning with the Markov property and its implications for modeling sequence dependencies.

Self-Check Questions

How is likelihood used to evaluate whether a sequence matches a motif model?
In the worked example, which position contributes most strongly to the likelihood, and why?
Why does the independence assumption simplify computation, and why is it problematic biologically?
How does sliding-window likelihood scanning transform motif discovery into a signal detection problem?
What key limitations of position-specific models motivate the transition to generative sequence models?

Position	A	C	G	T
1	0.1	0.5	0.2	0.2
2	0.3	0.2	0.2	0.3
3	0.1	0.1	0.6	0.2
4	0.2	0.1	0.5	0.2
5	0.2	0.6	0.1	0.1

Position	A	C	G	T
1	0.1	0.5	0.2	0.2
2	0.3	0.2	0.2	0.3
3	0.1	0.1	0.6	0.2
4	0.2	0.1	0.5	0.2
5	0.2	0.6	0.1	0.1

Position	A	C	G	T
1	0.1	0.5	0.2	0.2
2	0.3	0.2	0.2	0.3
3	0.1	0.1	0.6	0.2
4	0.2	0.1	0.5	0.2
5	0.2	0.6	0.1	0.1