Design for a local branch predictor

This post is about computer science, and in particular about branch prediction. It describes a local branch predictor with just eight bits of state, that exhibits some sophisticated behaviour.

This is an idea I wrote up years ago, but never published until now.

Branch prediction

A branch predictor is a component of a CPU. It is an optimisation; if you remove it the CPU will behave correctly, but will go slowly. The purpose of a branch predictor is to guess which way conditional branches ("if" statements) will go. The desired outcome is a sequence of 0s and 1s, where 1 indicates that a branch will be taken and 0 indicates that it will be skipped.

If the predictions are mostly correct, the CPU can speculatively fetch and execute instructions far in advance of the program counter. This allows instructions to be executed in parallel that would otherwise have to be executed in series.

When a conditional branch instruction is executed, its outcome is compared to the predicted outcome. If the prediction is correct, there is nothing further to do, and execution can continue. Otherwise all instructions speculated after that point are discovered to be nonsense. Their results must be discarded and their side-effects reversed. Then, execution restarts after the mispredicted branch.

Mispredictions are expensive. However, executing without a branch predictor is just as slow as executing with a bad branch predictor that gets every prediction wrong. Branch prediction is always beneficial, though it can be costly depending on the algorithm used.

Common branch prediction algorithms

Static

The simplest kind of branch predictor is a static predictor. One idea is to predict that backward branches will be taken and forward branches will not. This results in more than 50% of predictions being correct, but it is easy to do better.

Local

Most real CPUs include a local branch predictor, which attempts to predict the outcome of a branch instruction based on the previous history of that branch instruction. A naive idea is to predict that the branch will be taken if it was taken previously, and skipped if it was skipped previously. Branches that have not recently been executed must be predicted using a static predictor.

The naive local predictor can easily be fooled. For example, imagine a branch instruction at the end of a loop that typically performs three iterations. The desired outcome of this particular branch instruction is a sequence 110110110.... A static predictor would predict 111111111...: 67% correct. The naive local predictor would predict 011011011...: only 33% correct.

A better predictor will monitor its own performance, update its predictions more cautiously. A common idea is to keep a saturating counter for each branch instruction, and to increment it when the branch is taken and decrement it when the branch is skipped.

An even better predictor could maybe even decide when a more sophisticated algorithm is needed. A common idea is to use static prediction until the first misprediction, then predict the opposite outcome until the second misprediction, and then to switch to a history-based predictor thereafter.

History-based

Some successful (and expensive) branch prediction algorithms in use in real CPUs are based on the global branch history. The most recent (say) 14 branch outcomes are concatenated into a 14-bit number, which is used to index a table of predictions. When a misprediction occurs, the table is modified in the hope of correcting the prediction next time.

A history-based predictor is typically only used when local prediction fails. It does not store information about particular branch instructions, but only about particular branch histories. It is able to exploit correlations between near-consecutive branch instructions, or between repeated executions of the same branch instruction.

For each entry in the table (corresponding to a 14-bit branch history) an algorithm is needed to make predictions based on the outcomes when that history has occurred previously. Again, a saturating counter is a common choice. The design requirements of this algorithm are essentially the same as those of a local branch predictor.

Thus, it is desirable to invent a good local branch predictor, as the design can also be used in history-based predictors.

State

Because the state of a local branch predictor must be replicated for every branch instruction (or for every branch history), it is desirable to keep the state as small as possible. The predictor described here has eight bits of state:

b, a bimodal flag. This one-bit flag indicates whether the branch has recently been "mostly taken" or "mostly not taken".
h, a history register. The recent history of the branch is dictionary-compressed into three bits.
t, a pattern table. Intuitively, this is a mapping from histories to predictions. However, only a few mappings are allowed, and the table is dictionary-compressed into three bits.
s, a "sticky" flag. This one-bit flag is set to the actual outcome of the most recent mis-predicted branch, and in one other situation. On a subsequent misprediction, update of t is suppressed unless the misprediction is in the same direction as s. This helps the predictor to ignore rare events.

The use of a history register and a pattern table makes this local predictor look a bit like a minature history-based predictor. A true history-based predictor with a 3-bit history would have way more than eight bits of state: probably at least 19 bits (3 + (2<<3)). The predictor described here makes much more effcient use of limited resources.

The bimodal flag

The prediction algorithm has a symmetry with respect to replacing all "taken" branches with "not-taken" branches and vice versa. This symmetry is captured by the flag b. Toggling b negates all future predictions. To put it another way, the rest of the predictor does not predict whether the branch will be taken or not taken, but instead predicts whether the branch outcome will agree or disagree with b.

The value of b intuitively captures the most common outcome of the branch. b is set when the branch is taken three times in a row, and cleared when it is skipped three times in a row. The decision to toggle b is based on the value of the history register (see below).

For chaotic branches, which are taken roughly 50% of the time, b can get toggled quite often. In such situations, b can be imagined as an extra bit concatenated to the history register.

The history register

Assuming the predictor manages to learn a pattern of branch outcomes, the history register is needed to remember where it has got to in the pattern. To learn a pattern that repeats after n branches, at least n distinct values of the history register are needed. Any pattern which causes the history register to repeat a value will suffer at least one mis-prediction as a result, though it may still be possible to predict the rest of the pattern correctly.

Often patterns are strongly biased towards taking the branch, or towards not taking the branch. For example, the branch at the end of a tight loop is almost always taken. To capture such patterns, the set of possible values for the history register is biased towards histories in which most branches match the bimodal flag b.

The history register has eight possible values:

A: The bimodal flag has been right six or more times in a row.
5: The bimodal flag has been right five times, but was wrong before that.
4: The bimodal flag has been right four times, but was wrong before that.
3: The bimodal flag has been right three times, but was wrong before that.
2: The bimodal flag has been right twice, but was wrong before that.
B: The bimodal flag was right last time but wrong before that.
C: The bimodal flag was wrong last time but right before that.
D: The bimodal flag has been wrong twice, but was right before that.

The numbered states can be used to track loops of length up to six, provided the branch outcome differs from b only once during the pattern. Longer loops are approximated as infinite loops (using state A), and therefore always incur one mis-prediction at the end of the loop.

The states B, C and D can be used to track short patterns with a more chaotic pattern of taken and not taken branches. Using these history states, the predictor can track patterns such as 011011011... and 001100110011... .

In summary, the transition table for the history register is as follows (depending on whether the actual outcome of the branch agrees or disagrees with b):

    Old state       A  5  4  3  2  B  C  D
    ------------    ----------------------
    If agrees       A  A  5  4  3  2  B  B
    If disagrees    C  C  C  C  C  C  D  3

The bottom right entry 3 deserves some explanation. The history value D indicates that the last two branch outcomes disagreed with b. If a third branch outcome disagrees with b, then b is toggled, as explained above. At this point history needs to be rewritten, since now the last three outcomes agree with the new value of b. Therefore, the new value of the history should be 3.

The prediction table

The purpose of the prediction table is to map history values to branch predictions. It is the part of the state which represents the pattern that the predictor has learned.

The simplest design would be to store a prediction for each of the eight possible history values. This would make a total of 256 states, and would occupy eight bits of storage. However:

Many of those 256 states predict exactly the same pattern, because they do not use all the possible history values (in the absence of mis-predictions).
Some patterns are a priori improbable. For example, patterns that are both long and chaotic are unlikely to occur, as are patterns which toggle b.

Therefore, it is possible to make simplifications and approximations to shrink the set of states, and to fit it into a smaller number of bits. The predictor only allows prediction tables with the following properties:

The state A always predicts b, i.e. that the run of outcomes that agree with the bimodal flag will continue.
The state D always predicts b, i.e. that the bimodal flag will not toggle.
If any numbered state predicts b, then all lower-numbered states also predict b, otherwise the higher-numbered state would never be reached.
If states B or C predict not b then the pattern will be chaotic, so it is unlikely to be long, so state 2 also predicts not b.

Two further properties follow logically from the above:

3 => If any numbered state predicts not b then all higher-numbered states also predict not b.
3 & 4 => If any numbered state predicts b then states B and C also predict b.

There are eight prediction tables satisfying these rules. Each predicts a particular pattern of branch outcomes. Writing "0" to mean "agrees with the bimodal flag", and "1" to mean "disagrees with the bimodal flag", the eight possible tables and their predicted patterns are as follows:

    A  5  4  3  2  B  C  D      Pattern     Name
    ----------------------      -------     ----
    0  0  0  0  0  0  0  0      0...        Z
    0  1  0  0  0  0  0  0      000001...   6
    0  1  1  0  0  0  0  0      00001...    5
    0  1  1  1  0  0  0  0      0001...     4
    0  1  1  1  1  0  0  0      001...      3
    0  1  1  1  1  0  1  0      0011...     X
    0  1  1  1  1  1  0  0      01...       2
    0  1  1  1  1  1  1  0      011...      Y

The set of patterns that can be learned includes all patterns up to length 4 (including the infinite loop) and simple loops of length 5 and 6. Note that this in some cases exceeds the capabilities of a history-based predictor with a 3-bit history. The bottom four patterns can also be represented with b toggled.

Learning

The history register is updated every time the predictor is used, as described above. This advances through a fixed repeating sequence of predictions. As long as the predictions turn out to be correct, the sequence will not change.

The update rule for the bimodal flag is also described above. It occurs only rarely, when the pattern changes from "mostly taken" to "mostly skipped" or vice-versa.

It remains to explain how to update the prediction table and the sticky bit.

Updating the prediction table

When the branch predictor makes an incorrect prediction, and if the misprediction is in the same direction as the sticky flag, the prediction table is modified so as to avoid making the same mistake again.

The table is changed in such a way that the prediction changes for only one history value at a time. The allowed transitions are as follows:

                    2 - Y
                    |   |
    Z - 6 - 5 - 4 - 3 - X

The transition that is chosen depends on the history state in which the mis-prediction occurred, as follows:

    History     Z  6  5  4  3  X  2  Y
    -------     ----------------------
       5        6  Z  6  5  4  3  3  X
       4        6  5  6  5  4  3  3  X
       3        6  5  4  5  4  3  3  X
       2        6  5  4  3  4  3  3  X
       B        6  5  4  3  2  Y  3  X
       C        6  5  4  3  X  3  Y  2

Mis-prediction in history state 'A' has no effect. Mis-prediction in history state 'D' causes the bimodal flag 'b' to toggle. When 'b' toggles, regardless of the value of the sticky flag, the table state must be mapped as follows, in order to preserve as much information as possible:

    History     Z  6  5  4  3  X  2  Y
    -------     ----------------------
       D        Y  Y  Y  Y  Y  X  2  3

The sticky flag

The purpose of the sticky flag is to prevent the table being updated when it is already doing well. It comes into play when the pattern is too complicated for the predictor to learn it exactly. In such circumstances, prediction is usually better if the table is held constant than if it is regularly changed.

For example, the sequence 00101... can be approximated by table 2 (01...) or by table 3 (001...) and either way the predictor will make four correct predictions out of five, as long as the table is not updated. However, if the table is updated after every misprediction, it will oscillate between table 2 and table 3, and as a result will only make three correct predictions out of five.

The rules are as follows:

If a mis-prediction occurs in the same direction as the sticky flag, then the table is updated as above and the sticky flag is preserved.
If a mis-prediction occurs in the opposite direction to the sticky flag, then the sticky flag is toggled to allow a future update to the table, but the table is not updated.
If a correct prediction occurs, and if the state, history and sticky flag are such that a mis-prediction would have changed the prediction, then the sticky flag is toggled to prevent a future update to the table.

In summary, the sticky flag is written on every misprediction, and some correct predictions, and it is always set to the actual outcome of the branch. The situations (history, table pairs) in which a correct prediction causes a write to the sticky flag are defined by the third rule above, but they deserve to be tabulated:

    History     Z  6  5  4  3  X  2  Y
    -------     ----------------------
       5        *  *
       4           *  *
       3              *  *
       2                 *  *
       B                    *  *  *  *
       C                    *  *  *  *

Summary

A good local branch predictor can also be used as part of a history-based predictor. Typical designs use between two and four bits of state and exhibit rather simple behaviours. In contrast, the design proposed here requires eight bits of state, and exhibits some quite sophisticated behaviour. It can correctly predict tight loops of length up to six, and arbitrary patterns of length up to 4. It uses a sticky bit to detect and ignore rare cases. In chaotic situations, it behaves a bit like a saturating counter with six states.

Last updated 2025/04/19