5.10 Applying EM to Motif Discovery
5.10 Applying EM to Motif Discovery
Section titled “5.10 Applying EM to Motif Discovery”Learning Objectives
Section titled “Learning Objectives”After reading this section, you should be able to:
- map the components of the EM algorithm to the motif discovery problem
- compute how candidate subsequences are evaluated under motif and background models
- understand how the E-step assigns probabilities to motif positions
- explain how the M-step updates the motif model using weighted counts
- describe how EM iteratively refines motif structure and positions
Returning to the Biological Problem
Section titled “Returning to the Biological Problem”We now return to the problem that motivated this chapter: discovering motifs in biological sequences. Up to this point, we have developed a probabilistic representation of motifs, introduced the concept of hidden variables, and explored the Expectation Maximization algorithm in both intuitive and formal terms.
The central question is now how these ideas translate into a concrete method for identifying motifs in real sequence data.
The key insight is that the motif discovery problem fits naturally into the EM framework. The sequences correspond to observed data, the motif positions correspond to hidden variables, and the motif model corresponds to the parameters we wish to estimate.
Motif and Background as Competing Models
Section titled “Motif and Background as Competing Models”To apply EM, we first define a probabilistic model of how sequences are generated. Each sequence is assumed to contain a motif instance at some unknown position. The motif itself is described by a position probability matrix, which assigns probabilities to symbols at each position.
Outside the motif region, symbols are generated according to a background model. This model captures the general composition of the sequence and represents what we would expect to observe in the absence of a motif .
For any candidate subsequence, we can therefore compute two probabilities:
- the probability that it was generated by the motif model
- the probability that it was generated by the background model
These two models play roles analogous to the two coins in the coin toss example.
Evaluating Candidate Positions
Section titled “Evaluating Candidate Positions”Given a sequence, we consider all possible subsequences of length , obtained by sliding a window along the sequence. Each of these is a candidate motif instance.
For each candidate subsequence ( x ), we compute its probability under the motif model:
where is the probability of observing symbol at position in the motif model .
We also compute its probability under the background model, which is typically based on overall nucleotide frequencies in the genome.
These probabilities allow us to compare how well each candidate subsequence is explained by the motif versus the background.
The Expectation Step in Motif Discovery
Section titled “The Expectation Step in Motif Discovery”In the E-step, we compute, for each sequence and each candidate position, the probability that the motif begins at that position.
This is done by comparing the likelihood of the subsequence under the motif model with its likelihood under the background model. By normalizing these values, we obtain a probability distribution over all candidate positions.
As in the coin toss example, these are soft assignments. Each position contributes to the motif according to its probability, rather than being selected definitively.
This step transforms the sequences into a weighted representation of candidate motif occurrences.
The Maximization Step in Motif Discovery
Section titled “The Maximization Step in Motif Discovery”In the M-step, we update the motif model using these probabilistic assignments.
Instead of constructing a PFM from a fixed set of aligned subsequences, we now compute expected counts. Each candidate subsequence contributes to the counts according to the probability that it represents a motif instance.
By aggregating these weighted counts across all sequences, we obtain a new estimate of the position frequency matrix, which is then normalized to yield an updated position probability matrix.
In this way, the motif model is refined based on the current estimate of motif positions.
Iterative Refinement of Motif Structure
Section titled “Iterative Refinement of Motif Structure”The E-step and M-step are applied repeatedly. Initially, the motif model may be close to random, and the probabilities assigned to candidate positions may be diffuse. However, as the iterations proceed, consistent patterns begin to emerge.
Positions that are repeatedly supported by the data receive higher probability, and the motif model becomes increasingly specific. This, in turn, improves the ability of the model to distinguish motif instances from background sequence.
Through this iterative process, the algorithm gradually uncovers the underlying motif structure.
Connection to the Coin Toss Example
Section titled “Connection to the Coin Toss Example”The analogy to the coin toss example is now direct:
- motif vs. background corresponds to coin A vs. coin B
- candidate subsequences correspond to sequences of coin tosses
- motif positions correspond to hidden assignments
- the E-step computes probabilities of motif occurrence
- the M-step updates the motif model
In both cases, EM resolves uncertainty by alternating between probabilistic assignment and parameter estimation.
Practical Implications
Section titled “Practical Implications”This formulation highlights both the strength and the limitations of the approach. The EM algorithm provides a principled way to learn motifs from unaligned sequences, even when the signal is weak and embedded in noise.
At the same time, its success depends on factors such as the quality of the data, the strength of the motif signal, and the choice of the background model. These aspects will be examined more closely in the next section.
Self-Check Questions
Section titled “Self-Check Questions”- How are motif discovery and the EM framework connected?
- What roles do the motif model and background model play?
- How is the probability of a candidate subsequence computed under the motif model?
- What is computed in the E-step for motif discovery?
- How are model parameters updated in the M-step?
- How does the EM algorithm refine motif structure over iterations?