5.6 The EM Idea

5.6 The EM Idea (Conceptual)

Learning Objectives

After reading this section, you should be able to:

explain why the motif discovery problem cannot be solved directly
understand the idea of replacing hard assignments with probabilistic reasoning
describe the roles of the Expectation step and Maximization step
explain how EM resolves circular dependencies iteratively
relate the EM idea back to motif discovery

A Circular Problem Without a Direct Solution

Returning to the motif discovery problem, we are faced with a fundamental obstacle. To estimate the motif model, we need to know where the motif occurs in each sequence. At the same time, to determine these positions, we need a model of the motif. This circular dependency prevents us from applying standard estimation procedures directly.

If either component were known, the problem would be straightforward. Given the motif positions, we could construct the model using simple frequency counts. Given the model, we could identify the most likely positions by scanning the sequences. However, because neither is available, we are effectively caught between two incomplete solutions.

At this point, we are stuck. The problem cannot be resolved by a single direct computation. Instead, we require a strategy that allows us to gradually refine both components.

From Hard Assignments to Probabilistic Reasoning

A natural but problematic approach would be to guess the motif positions and proceed as if those guesses were correct. However, such hard assignments are fragile. An incorrect early decision can bias the model and lead the entire process away from the true solution.

The key idea is to avoid committing too early. Instead of making definitive decisions about motif positions, we represent our uncertainty explicitly. Each candidate position is assigned a probability that reflects how well it is supported by the current model. In this way, we replace hard assignments with probabilistic reasoning.

This shift is crucial. Rather than selecting a single explanation, we consider all possible explanations and weight them according to their plausibility.

Estimating Hidden Structure: The Expectation Step

Given an initial guess of the model parameters, we can evaluate how well each candidate position explains the observed sequences. This leads to a probability distribution over possible motif positions for each sequence.

Positions that align well with the current motif model receive higher probabilities, while less consistent positions receive lower ones. Importantly, all positions remain possible, but they contribute differently depending on their likelihood.

This step corresponds to estimating the hidden structure of the data. Because we compute expected contributions rather than making definitive assignments, each candidate subsequence contributes proportionally to its probability. This is known as the Expectation step, or E-step.

Updating the Model: The Maximization Step

Once we have these probabilistic assignments, we use them to update the model parameters. Instead of counting symbols from a fixed set of motif instances, we compute weighted counts, where each candidate subsequence contributes according to the probability that it represents a true motif occurrence.

Using these weighted counts, we re-estimate the position-specific probabilities of the motif model. This step treats the expected values from the E-step as if they were observed data and selects parameters that best explain them. Because this involves optimizing the model, it is called the Maximization step, or M-step.

An Iterative Resolution of the Circular Dependency

Individually, neither the E-step nor the M-step solves the problem completely. However, when combined in an iterative procedure, they provide a powerful mechanism for resolving the circular dependency.

We begin with an initial guess of the model parameters. Using this model, we estimate the hidden variables. These estimates are then used to update the model, which in turn leads to improved estimates of the hidden variables. This process is repeated until the model stabilizes.

At first, both the assignments and the model may be uncertain. However, as the iterations proceed, consistent patterns begin to emerge. Positions that are repeatedly supported by the data gain higher probability, and the model becomes increasingly specific. Through this feedback loop, the model and the inferred structure gradually reinforce each other.

This iterative refinement is the core idea of the Expectation Maximization algorithm. It transforms a problem that cannot be solved directly into a sequence of manageable steps that progressively improve the solution.

In the next section, we will make this idea concrete. By examining a simple example in detail, we will see how the alternating steps of expectation and maximization operate numerically and how they uncover hidden structure in data.

Self-Check Questions

Why can the motif discovery problem not be solved directly using standard methods?
What is the problem with making hard assignments of motif positions early in the process?
What is the key idea behind replacing hard assignments with probabilistic reasoning?
What is computed in the Expectation step, and what does it represent?
What is updated in the Maximization step, and how is it computed?
How does the iterative combination of E-step and M-step resolve the circular dependency?