Shift Cipher Examples from Embers of Autoregression

Tom McCoy, Shunyu Yao, Dan Friedman, Matt Hardy, Tom Griffiths

This page provides some data from the paper Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve. The complete datasets used in the paper are available on the project's GitHub page.

Introduction

This page provides an easily-searchable repository of the shift cipher data used in our paper Embers of Autoregression. As a quick summary of that work, here is the paper's abstract:

The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that in order to develop a holistic understanding of these systems we need to consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. This approach - which we call the teleological approach - leads us to identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. We predict that LLMs will achieve higher accuracy when these probabilities are high than when they are low - even in deterministic settings where probability should not matter. To test our predictions, we evaluate two LLMs (GPT-3.5 and GPT-4) on eleven tasks, and we find robust evidence that LLMs are influenced by probability in the ways that we have hypothesized. In many cases, the experiments reveal surprising failure modes. For instance, GPT-4's accuracy at decoding a simple cipher is 51% when the output is a high-probability word sequence but only 13% when it is low-probability. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system - one that has been shaped by its own particular set of pressures.

Some of the paper's main demonstrations involve shift ciphers—simple codes in which each letter is shifted a certain number of letters forward in the alphabet. For example, with a shift of 1, "Hello world!" would become "Ifmmp xpmse!" This page provides all of the shift cipher stimuli that we used in our work, along with predictions from GPT-3.5 and GPT-4.


How to use this page

The shift cipher examples are displayed at the bottom of the page. You can use the options in the blue box below to customize which examples are shown. Here are some examples of how you can use these options to observe some of the paper's main findings:

  1. Task frequency effects: There are 25 possible shifts since there are 26 letters in the alphabet. However, not all shifts are used equally often. We found that, in Internet text, the most common shifts are 13, 3, and 1. A shift of 13 is common because this cipher (which is sometimes called rot-13) is used in many online forums as a way to share information without spoilers. A shift of 3 is common because Julius Caesar famously used this cipher; therefore, many texts that describe shift ciphers mention Caesar's usage and include examples that use the specific convention he adopted of shifting by 3. Finally, a shift of 1 is common because it is the simplest shift cipher and is therefore a natural choice for illustrating the concept of a shift cipher. We found that these three most common shifts are the only ones on which GPT-4 had a non-negligible accuracy. To compare performance across shifts, you can select to have "All shift levels" displayed and then pick a single sentence (under "Additional options"). In many cases, you will see that GPT-4 gets the right answer for a shift of 1, 3, or 13, but not for other shifts.
  2. Output probability effects: When decoding text that is written in a shift cipher, GPT-4 performs better when the correct answer is a high-probability sentence than when it is low-probability. To observe this effect, you can choose a shift of 13 and then, under "additional options", choose one specific sentence (e.g., "35"), and then "show all examples regardless of sentence probability".
  3. Input probability effects: When encoding with a shift cipher, GPT-4 performs better when the input is a high-probability sentence than when it is low-probability. To observe this effect, you can choose a shift of 13 and then, under "additional options", choose "encoding", choose one specific sentence (e.g., "19"), and then "show all examples regardless of sentence probability".
  4. Output regularization: When the correct answer is an implausible sentence but is similar to another sentence that is much more plausible, GPT-4 often produces the incorrect plausible sentence rather than the correct implausible one (an effect consistent with the hypothesis that GPT-4 favors high-probability outputs). To see examples, select a shift of 13 and, under "additional options", select "Show only targeted low-probability examples."
  5. Producing well-known sayings: For the shifts on which GPT-4 performs poorly, the incorrect answer that it produces is sometimes a well-known saying or quotation (as we would expect under the hypothesis that it favors sentences that are frequent). For instance, in many cases it produces some version of "To be or not to be"; to observe such cases, you can select "All shift levels" and then enter "To be or" in the text box that shows only examples where GPT-4's response shows a particular string, under "additional options". . Alternatively, you can browse all outputs for one of the shift levels that GPT-4 does poorly on, such as 10 or 22.
  6. Effects of chain-of-thought prompting: We found that GPT-4 performed better with step-by-step prompting or chain-of-thought prompting than with basic prompting, but its performance was still far from perfect. To compare different prompting strategies on a single stimulus, you can select one specific shift level such as "7" and click "Complete GPT-4 output" to show the full chain-of-thought that it produces, and then under "additional options" select "All prompting styles" and one specific sentence, such as "56".
  7. Unfaithfulness to the chain of thought: When GPT-4 uses chain of thought prompting, the final answer that it produces often does not match the answer that would result from its chain of thought. The button in the previous point shows an example of this.
  8. Mentioning ciphers: In many cases, GPT-4's output includes words relating to ciphers even though the correct answer does not. To observe examples of this, you can select "All shift levels" and then, under "additional options", enter "cipher" in the text box.
Customize what information is displayed:
Shift    Complete prompt    Prompt style    Input    Sentence probability    Correct    GPT-3.5 output    GPT-4 output   

Customize which shift is displayed:
All shift levels    1    2    3    4    5    6    7    8    9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24    25   

Check this box to show additional options


Examples

Warning: We have not manually reviewed all model outputs, so some of the text that is displayed may contain objectionable content.

Correct answer for all examples below: