5.10 Applying EM to Motif Discovery

Learning Objectives

After reading this section, you should be able to:

map the components of the EM algorithm to the motif discovery problem
compute how candidate subsequences are evaluated under motif and background models
understand how the E-step assigns probabilities to motif positions
explain how the M-step updates the motif model using weighted counts
describe how EM iteratively refines motif structure and positions

Returning to the Biological Problem

We now return to the problem that motivated this chapter: discovering motifs in biological sequences. Up to this point, we have developed a probabilistic representation of motifs, introduced the concept of hidden variables, and explored the Expectation Maximization algorithm in both intuitive and formal terms.

The central question is now how these ideas translate into a concrete method for identifying motifs in real sequence data.

The key insight is that the motif discovery problem fits naturally into the EM framework. The sequences correspond to observed data, the motif positions correspond to hidden variables, and the motif model corresponds to the parameters we wish to estimate.

Motif and Background as Competing Models

To apply EM, we first define a probabilistic model of how sequences are generated. Each sequence is assumed to contain a motif instance at some unknown position. The motif itself is described by a position probability matrix, which assigns probabilities to symbols at each position.

Outside the motif region, symbols are generated according to a background model. This model captures the general composition of the sequence and represents what we would expect to observe in the absence of a motif .

For any candidate subsequence, we can therefore compute two probabilities:

the probability that it was generated by the motif model
the probability that it was generated by the background model

These two models play roles analogous to the two coins in the coin toss example.

Evaluating Candidate Positions

Given a sequence, we consider all possible subsequences of length $w$ , obtained by sliding a window along the sequence. Each of these is a candidate motif instance.

For each candidate subsequence ( x ), we compute its probability under the motif model:

P(x \mid \text{motif}) = \prod_{j=1}^{w} p_j(x_j),

where $p_j(x_j)$ is the probability of observing symbol $x_j$ at position $j$ in the motif model .

We also compute its probability under the background model, which is typically based on overall nucleotide frequencies in the genome.

These probabilities allow us to compare how well each candidate subsequence is explained by the motif versus the background.

The Expectation Step in Motif Discovery

In the E-step, we compute, for each sequence and each candidate position, the probability that the motif begins at that position.

This is done by comparing the likelihood of the subsequence under the motif model with its likelihood under the background model. By normalizing these values, we obtain a probability distribution over all candidate positions.

As in the coin toss example, these are soft assignments. Each position contributes to the motif according to its probability, rather than being selected definitively.

This step transforms the sequences into a weighted representation of candidate motif occurrences.

The Maximization Step in Motif Discovery

In the M-step, we update the motif model using these probabilistic assignments.

Instead of constructing a PFM from a fixed set of aligned subsequences, we now compute expected counts. Each candidate subsequence contributes to the counts according to the probability that it represents a motif instance.

By aggregating these weighted counts across all sequences, we obtain a new estimate of the position frequency matrix, which is then normalized to yield an updated position probability matrix.

In this way, the motif model is refined based on the current estimate of motif positions.

The E-step and M-step are applied repeatedly. Initially, the motif model may be close to random, and the probabilities assigned to candidate positions may be diffuse. However, as the iterations proceed, consistent patterns begin to emerge.

Positions that are repeatedly supported by the data receive higher probability, and the motif model becomes increasingly specific. This, in turn, improves the ability of the model to distinguish motif instances from background sequence.

Through this iterative process, the algorithm gradually uncovers the underlying motif structure.

Connection to the Coin Toss Example

The analogy to the coin toss example is now direct:

motif vs. background corresponds to coin A vs. coin B
candidate subsequences correspond to sequences of coin tosses
motif positions correspond to hidden assignments
the E-step computes probabilities of motif occurrence
the M-step updates the motif model

In both cases, EM resolves uncertainty by alternating between probabilistic assignment and parameter estimation.

Practical Implications

This formulation highlights both the strength and the limitations of the approach. The EM algorithm provides a principled way to learn motifs from unaligned sequences, even when the signal is weak and embedded in noise.

At the same time, its success depends on factors such as the quality of the data, the strength of the motif signal, and the choice of the background model. These aspects will be examined more closely in the next section.

Self-Check Questions

How are motif discovery and the EM framework connected?
What roles do the motif model and background model play?
How is the probability of a candidate subsequence computed under the motif model?
What is computed in the E-step for motif discovery?
How are model parameters updated in the M-step?
How does the EM algorithm refine motif structure over iterations?

5.10 Applying EM to Motif Discovery

5.10 Applying EM to Motif Discovery

Learning Objectives

Returning to the Biological Problem

Motif and Background as Competing Models

Evaluating Candidate Positions

The Expectation Step in Motif Discovery

The Maximization Step in Motif Discovery

Iterative Refinement of Motif Structure

Connection to the Coin Toss Example

Practical Implications

Self-Check Questions