Skip to content

6.1 From Motif Search to Generative Models of Biological Sequences

6.1 From Motif Search to Generative Models of Biological Sequences

Section titled “6.1 From Motif Search to Generative Models of Biological Sequences”

In the previous chapter, we approached motif discovery as a problem of identifying recurring patterns in biological sequences. We introduced position-specific models that allow us to represent such patterns probabilistically and to scan sequences for likely motif occurrences. While this already provides a powerful framework, it still rests on a simplifying assumption: that motifs can be described as static patterns with independent positions.

In practice, however, biological sequences are not generated by static templates. They arise from underlying biological processes that introduce variability, dependencies, and structural organization. To capture this, we now take a conceptual step forward. Instead of asking whether a sequence matches a motif, we ask a more fundamental question:

Which process could have generated this sequence?


From Pattern Matching to Model-Based Reasoning

Section titled “From Pattern Matching to Model-Based Reasoning”

Consider again the problem of motif discovery in DNA. A genome is a long sequence composed of nucleotides, within which short functional regions such as promoters, splice sites, or transcription factor binding sites are embedded. These regions are typically short and weakly conserved, and they are surrounded by large stretches of background sequence.

This motivates a simple but powerful abstraction. We assume that different parts of the sequence are generated by different underlying processes:

  • a motif-generating process, producing biologically meaningful patterns
  • a background process, producing non-specific sequence

Each position in the sequence is thus associated with one of these processes, although this assignment is not directly observable.


To make this idea operational, we quantify how well a sequence fits a given model using the concept of likelihood.

Given a model MM and a sequence SS, the likelihood

P(SM)P(S \mid M)

measures how probable it is that the sequence was generated by the model.

In the case of position-specific models, the model is represented as a position probability matrix (PPM). Each position ii specifies a probability distribution over nucleotides. Under the assumption of positional independence, the likelihood of a sequence of length LL is given by

P(SM)=i=1LP(xiMi)P(S \mid M) = \prod_{i=1}^{L} P(x_i \mid M_i)

where xix_i is the nucleotide at position ii, and MiM_i denotes the distribution at that position.


To illustrate this concretely, consider a motif model of length five described by the following position-specific probabilities:

PositionACGT
10.10.50.20.2
20.30.20.20.3
30.10.10.60.2
40.20.10.50.2
50.20.60.10.1

Now consider the sequence

S=GAGGTS = \text{GAGGT}

We compute its likelihood under the model by multiplying the position-wise probabilities:

  • Position 1: ( P(G) = 0.2 )
  • Position 2: ( P(A) = 0.3 )
  • Position 3: ( P(G) = 0.6 )
  • Position 4: ( P(G) = 0.5 )
  • Position 5: ( P(T) = 0.1 )

Thus,

P(SM)=0.20.30.60.50.1=0.0018P(S \mid M) = 0.2 \cdot 0.3 \cdot 0.6 \cdot 0.5 \cdot 0.1 = 0.0018

This value quantifies how compatible the sequence is with the motif model. A higher likelihood indicates a better match.


This calculation enables a simple scanning procedure. Given a long sequence, we slide a window of length five along it and compute the likelihood for each subsequence. This produces a function

positionP(Si:i+L1M)\text{position} \mapsto P(S_{i:i+L-1} \mid M)

that can be interpreted as a likelihood landscape.

Regions where the likelihood is high correspond to subsequences that are well explained by the motif model. In practice, motif discovery can therefore be framed as identifying peaks in this landscape.

This perspective is conceptually powerful: rather than directly searching for patterns, we evaluate how well a generative model explains different parts of the sequence.


Despite its elegance, this approach has important limitations.

First, it assumes that positions are independent. From a biological perspective, this is rarely true. Structural and functional constraints introduce dependencies between positions, particularly in protein sequences and regulatory elements.

Second, the model assumes a fixed motif length. Insertions and deletions, which are common in biological sequences, cannot be handled naturally. Even a single insertion can drastically reduce the likelihood of an otherwise valid motif instance.

Finally, the model does not explicitly represent transitions between different regions of a sequence. It treats motifs as isolated objects rather than as parts of a larger generative process.


These limitations motivate a more expressive framework. Instead of describing motifs as static probability tables, we model the process that generates sequences.

In such a model:

  • the sequence is generated step by step
  • at each step, the system is in a certain internal state
  • the state determines which symbols are likely to be emitted
  • the system can transition between states over time

Crucially, these internal states are not directly observable. We only see the emitted sequence, not the underlying process.


This leads to the central idea of this chapter.

We model biological sequences as being generated by a system that moves through a sequence of hidden states, each of which emits observable symbols. Some states may correspond to biologically meaningful regions, such as promoters, while others represent background sequence.

Because the sequence of states is not directly observable, it remains hidden.

This type of model is known as a Hidden Markov Model (HMM).


With this shift, motif discovery becomes a problem of inferring hidden structure rather than simply detecting patterns.

We are no longer only interested in whether a motif occurs, but in reconstructing:

  • which regions belong to which functional class
  • how likely a sequence is under a given model
  • how to learn such models from data

In the following sections, we will develop this framework step by step, beginning with the Markov property and its implications for modeling sequence dependencies.


  1. How is likelihood used to evaluate whether a sequence matches a motif model?
  2. In the worked example, which position contributes most strongly to the likelihood, and why?
  3. Why does the independence assumption simplify computation, and why is it problematic biologically?
  4. How does sliding-window likelihood scanning transform motif discovery into a signal detection problem?
  5. What key limitations of position-specific models motivate the transition to generative sequence models?