6.1 From Motif Search to Generative Models of Biological Sequences
6.1 From Motif Search to Generative Models of Biological Sequences
Section titled “6.1 From Motif Search to Generative Models of Biological Sequences”In the previous chapter, we approached motif discovery as a problem of identifying recurring patterns in biological sequences. We introduced position-specific models that allow us to represent such patterns probabilistically and to scan sequences for likely motif occurrences. While this already provides a powerful framework, it still rests on a simplifying assumption: that motifs can be described as static patterns with independent positions.
In practice, however, biological sequences are not generated by static templates. They arise from underlying biological processes that introduce variability, dependencies, and structural organization. To capture this, we now take a conceptual step forward. Instead of asking whether a sequence matches a motif, we ask a more fundamental question:
Which process could have generated this sequence?
From Pattern Matching to Model-Based Reasoning
Section titled “From Pattern Matching to Model-Based Reasoning”Consider again the problem of motif discovery in DNA. A genome is a long sequence composed of nucleotides, within which short functional regions such as promoters, splice sites, or transcription factor binding sites are embedded. These regions are typically short and weakly conserved, and they are surrounded by large stretches of background sequence.
This motivates a simple but powerful abstraction. We assume that different parts of the sequence are generated by different underlying processes:
- a motif-generating process, producing biologically meaningful patterns
- a background process, producing non-specific sequence
Each position in the sequence is thus associated with one of these processes, although this assignment is not directly observable.
Likelihood as a Measure of Compatibility
Section titled “Likelihood as a Measure of Compatibility”To make this idea operational, we quantify how well a sequence fits a given model using the concept of likelihood.
Given a model and a sequence , the likelihood
measures how probable it is that the sequence was generated by the model.
In the case of position-specific models, the model is represented as a position probability matrix (PPM). Each position specifies a probability distribution over nucleotides. Under the assumption of positional independence, the likelihood of a sequence of length is given by
where is the nucleotide at position , and denotes the distribution at that position.
A Worked Example
Section titled “A Worked Example”To illustrate this concretely, consider a motif model of length five described by the following position-specific probabilities:
| Position | A | C | G | T |
|---|---|---|---|---|
| 1 | 0.1 | 0.5 | 0.2 | 0.2 |
| 2 | 0.3 | 0.2 | 0.2 | 0.3 |
| 3 | 0.1 | 0.1 | 0.6 | 0.2 |
| 4 | 0.2 | 0.1 | 0.5 | 0.2 |
| 5 | 0.2 | 0.6 | 0.1 | 0.1 |
Now consider the sequence
We compute its likelihood under the model by multiplying the position-wise probabilities:
- Position 1: ( P(G) = 0.2 )
- Position 2: ( P(A) = 0.3 )
- Position 3: ( P(G) = 0.6 )
- Position 4: ( P(G) = 0.5 )
- Position 5: ( P(T) = 0.1 )
Thus,
This value quantifies how compatible the sequence is with the motif model. A higher likelihood indicates a better match.
From Likelihood to Motif Detection
Section titled “From Likelihood to Motif Detection”This calculation enables a simple scanning procedure. Given a long sequence, we slide a window of length five along it and compute the likelihood for each subsequence. This produces a function
that can be interpreted as a likelihood landscape.
Regions where the likelihood is high correspond to subsequences that are well explained by the motif model. In practice, motif discovery can therefore be framed as identifying peaks in this landscape.
This perspective is conceptually powerful: rather than directly searching for patterns, we evaluate how well a generative model explains different parts of the sequence.
Limitations of the Model
Section titled “Limitations of the Model”Despite its elegance, this approach has important limitations.
First, it assumes that positions are independent. From a biological perspective, this is rarely true. Structural and functional constraints introduce dependencies between positions, particularly in protein sequences and regulatory elements.
Second, the model assumes a fixed motif length. Insertions and deletions, which are common in biological sequences, cannot be handled naturally. Even a single insertion can drastically reduce the likelihood of an otherwise valid motif instance.
Finally, the model does not explicitly represent transitions between different regions of a sequence. It treats motifs as isolated objects rather than as parts of a larger generative process.
Towards Generative Sequence Models
Section titled “Towards Generative Sequence Models”These limitations motivate a more expressive framework. Instead of describing motifs as static probability tables, we model the process that generates sequences.
In such a model:
- the sequence is generated step by step
- at each step, the system is in a certain internal state
- the state determines which symbols are likely to be emitted
- the system can transition between states over time
Crucially, these internal states are not directly observable. We only see the emitted sequence, not the underlying process.
Preview: Hidden Structure in Sequences
Section titled “Preview: Hidden Structure in Sequences”This leads to the central idea of this chapter.
We model biological sequences as being generated by a system that moves through a sequence of hidden states, each of which emits observable symbols. Some states may correspond to biologically meaningful regions, such as promoters, while others represent background sequence.
Because the sequence of states is not directly observable, it remains hidden.
This type of model is known as a Hidden Markov Model (HMM).
Conceptual Transition
Section titled “Conceptual Transition”With this shift, motif discovery becomes a problem of inferring hidden structure rather than simply detecting patterns.
We are no longer only interested in whether a motif occurs, but in reconstructing:
- which regions belong to which functional class
- how likely a sequence is under a given model
- how to learn such models from data
In the following sections, we will develop this framework step by step, beginning with the Markov property and its implications for modeling sequence dependencies.
Self-Check Questions
Section titled “Self-Check Questions”- How is likelihood used to evaluate whether a sequence matches a motif model?
- In the worked example, which position contributes most strongly to the likelihood, and why?
- Why does the independence assumption simplify computation, and why is it problematic biologically?
- How does sliding-window likelihood scanning transform motif discovery into a signal detection problem?
- What key limitations of position-specific models motivate the transition to generative sequence models?