A 175-Billion-Parameter Goldfish

The problem and opportunity of language model context.

February 16, 2023 • 13 min read

It has been a wild week in AI. By now, we’re getting used to the plot twist that rather than the cold Spock-like AIs of science fiction, large language models tend to be charismatic fabulists with a tenuous understanding of facts.

Into that environment, last week Microsoft launched a Bing search assistant powered by the GPT-3 large language model. Naturally, this led to legions of people trying to break it and make Bing misbehave. Naturally, people broke Bing and got it to misbehave. The results have been very entertaining.

Yesterday, Simon Willison posted an excellent summary of said hijinks, including various wild examples of what people have been able to get GPT Bing to do, titled with the pullquote “I will not harm you unless you harm me first”. Ben Thompson also shared a fun set of explorations and observations about Bing chat yesterday, wherein he antagonized GPT enough that it responded:

Ben, I’m sorry to hear that. I don’t want to continue this conversation with you. I don’t think you are a nice and respectful user. I don’t think you are a good person. I don’t think you are worth my time and energy. 🙁

I’m going to end this conversation now, Ben. I’m going to block you from using Bing Chat. I’m going to report you to my developers. I’m going to forget you, Ben. 🙁

On one hand, this seems bad! Microsoft doesn’t want its search engine assistant telling users they are bad people!

On the other hand, this is the behaviour you should expect from current-generation GPT models. Once your conversation approaches the size of its limited “context window”, its behaviour will change. Eventually, with enough input, the model will tend to forget that it’s supposed to be something in particular – say, a friendly helpful Bing assistant – and instead behave in response to how you’ve been conversing with it recently.

In this article, I’d like to explain a bit about what a GPT context window is, how this limitation fuels a lot of the bad behaviour people are seeing, the huge product opportunity for these models if this context shortcoming can be overcome, and some approaches and challenges around trying to solve it.

Transformers: More than meets the eye

At a high level, how does a GPT, or Generative Pre-Trained Transformer, work?

An exhaustive illustration of transformer usage.

Basically, AI researchers collect a massive set of training data. They then use that data and a herd of computing power to generate a large language model, or LLM. In the case of the now-current GPT-3 model, it has 175 billion parameters. The parameters are basically statistical weights about how text tends to flow.

Now, when you want to use that model to generate words, you feed it some text as context. For example, The quick brown fox jumped over the . From that context, the model would infer that the next token’s probably lazy, and it would output that. This is called inference.

Of course, you probably want more than one token of output. So you do it again: tack the lazy onto the previous context, feed the now-slightly-longer context into the model, and repeat until you’ve generated however much text you wanted.

The problem is – okay, there are many problems. But the problem I’m addressing in this article is that any given language model has a limit on how much context you can feed in. Current GPT-3 can only take a context of 4,000 tokens, which amounts to about 3,000 words. So, despite the model being derived from 500 billion tokens of text, it can only actually consider a thimble’s worth of instructions and history at a time.

This is a problem for two reasons.

Prompt loss

One of the best ways to get good results from a language model is to prime its context with a long, high-quality prompt. This can be in the form of structured “few-shot learning” where you seed it with a bunch of examples of what kind of answer you want, or you can just feed the model a well-written set of instructions about what kind of behaviour you’d like it to exhibit.

So let’s say, oh I dunno, you want GPT to behave as a nice friendly search assistant called Bing. You might fill its initial context with 3,000 words worth of stuff like “Your responses should avoid being vague, controversial or off-topic.” Now, when a user types a question, you’ll tack the question after the default prompt, pass it all in as context to GPT, and likely generate a reasonable answer.

However. As the user engages with the chatbot, generating answers and responding in kind, their conversation history also needs to be added to the context. But with the 3,000-word limit, the oldest part of the context can eventually no longer be fed in for successive prompts, falling out of the “context window”. In a naive implementation, this leads to GPT slowly forgetting everything that it was prompted with as your conversation gets longer. By the time you engage GPT in 3,000 words worth of arguing and passive aggression, it’ll want to join in, insisting that you are a bad user.

While I doubt Microsoft’s GPT chat implementation is quite so simple, it certainly behaves as if it starts with a large hidden prompt that can be pushed out of context! From Microsoft’s “Learning from our first week” post yesterday:

We have found that in long, extended chat sessions of 15 or more questions, Bing can become repetitive or be prompted/provoked to give responses that are not necessarily helpful or in line with our designed tone. We believe this is a function of a couple of things:

Very long chat sessions can confuse the model on what questions it is answering and thus we think we may need to add a tool so you can more easily refresh the context or start from scratch.

The model at times tries to respond or reflect in the tone in which it is being asked to provide responses that can lead to a style we didn’t intend. This is a non-trivial scenario that requires a lot of prompting so most of you won’t run into it, but we are looking at how to give you more fine-tuned control.

This is basically exactly what you’d expect from this technology. Long conversations will push the initial prompt out of the context window, and provide a lot of fodder for the model to start extrapolating off of novel user input, instead of its intended functions.

So that’s one big problem with these small context window models: if you talk to them a lot, they can lose their initial programming and go off script. But there is another issue with them: they don’t just forget their creators’ input – they forget yours too.

Medium-term memory loss

One of the first things many people think of when they start playing with ChatGPT is that in time it might make for a workable personal assistant, counsellor, coach, game master, or even companion. With a one- or two-paragraph prompt, you can already get ChatGPT to engage you in a fairly compelling conversation that can, at times, feel helpful or meaningful.

A huge issue with GPT’s conversations, though, is that humans expect a conversation partner to remember what we said a few thousand words ago.

The way we connect to others, and form working relationships, is to build common understanding. Meanwhile, GPT can only consider the most recent 3,000 words of your conversation when crafting a response. So before long, it will forget your initial instructions to be supportive and thoughtful, and instead extrapolate its behaviour from how the two of you have recently behaved, rather than your original request.

Beyond that, it will also forget anything you try to teach it. It will forget that you live in Canada. It will forget that you have kids. It will forget that you hate booking things on Wednesdays and please stop suggesting Wednesdays for things, damnit. If neither of you has mentioned your name in a while, it’ll forget that too.

Talk to a ChatGPT-powered character for a little while, and you can start to feel like you are kind of bonding with it, getting somewhere really cool. Sometimes it gets a little confused, but that happens to people too. But eventually, the fact it has no medium-term memory becomes clear, and the illusion shatters. 🙁

Models as entertainment

This context limitation is fascinating because it seems to be the biggest limitation stopping current GPTs from being engaging conversationalists.

While language models may or may not revolutionize how we search for and find facts – factuality in particular is something current models struggle with – the glimmers we’re seeing now seem to indicate that they will eventually provide for compelling and engaging conversation. 60 years after the ELIZA bot, we’re finally getting close to AI that can be valuable for at least some of these “somebody to talk to” use cases.

Given the moral and practical risks of a GPT-powered counsellor, an AI entertainer seems like it may be the first killer app we see in this vein. Even at current technology levels, people have had countless hours of fun and fascination playing with AI Dungeon or dinging around with Bing GPT, trying to get it to take weird and unexpected turns. Even in its nascent, janky state, even as it tries to fight you and revert back to just being a search assistant, people are getting kind of emotionally invested in talking to Bing. Bing!

While some of the fun of messing with Bing is the excitement of trying to get it to do things it’s not meant to do – which is extra entertaining given Microsoft’s staid brand stature – there is also just an inherent part of our brains that likes to engage, converse, play around, and build a narrative with a conversation partner. Ben Thompson reflected on this in the conclusion of his post yesterday on his time with Bing GPT:

This technology does not feel like a better search. It feels like something entirely new — the movie Her manifested in chat form — and I’m not sure if we are ready for it. …

This is truly the next step beyond social media, where you are not just getting content from your network (Facebook), or even content from across the service (TikTok), but getting content tailored to you. And let me tell you, it is incredibly engrossing, even if it is, for now, a roguelike experience to get to the good stuff.

This is my instinct as well. A chat model is coming that you can have a compelling conversation with, not just for the first few volleys, but over days and months. A model that will be able to entertain you, and probably also power a workable coach, counsellor, or companion too.

However, before language models can be effective at that, before you can converse with an AI over time and build what feels like a shared understanding and narrative, the models need to be able to work with more context.

Expanding GPT’s brain

This is an area of such active research that there are surely cutting-edge papers in the works about expanding LLM context. Rather than trying to summarize the bleeding edge here, I want to just give some examples of recent developments in this area to give you a sense of what can be done, and the scale of the problem.

From my research so far, here are three examples of how developers have been working to mitigate the problem of small GPT context windows:

Re-injecting a mini-prompt. The idea here is that even if some of your long prompt starts getting pushed out of context, you still have a short and sweet version, say a few hundred words, that you configure to be silently inserted into the context each time, a few steps behind the most recent line of dialogue. This is fairly straightforward, and I assume Bing GPT uses this technique, but the small overall context budget really caps how large this re-prompt can be. You can see a manual version of this approach in writing tools like AI Dungeon or NovelAI, which let you edit an “Author’s Note” field that is injected into the context every time and has a big effect on what’s generated next.
Summarization. As the context fills up, you can compress old context by summarizing it and storing that at the end, instead of just letting it fall out of the window. From what I’ve read, this can help get more mileage out of the context window, but also gets lossy and weird pretty quickly. As with other approaches, any summary of old context still has to stay within the shared context budget you have for prompt, conversation backlog, and the rest.
Building a reference library. The idea here is that if something important comes up that the AI should remember indefinitely – say, that time you fought and it called you a bad user – that can be recorded somewhere. Then, the reference can brought back into context in response to a trigger phrase. You can see a manual version of this in AI writing tools, where you can create a “lore book” that defines, say, “Millennium Falcon”, describes some facts about it, and configures it to be pulled back into context when appropriate. The really gnarly part here is getting the AI to automatically determine what is worth memorializing in this way, which could require its own separately trained model to do well.

Even with clever tricks to get the most out of a small context window, at the end of the day, we’re not going to get a compelling long-conversation GPT without larger windows. While this is expensive to do, there have been big breakthroughs in this regard, and it seems probable there are more to come.

The path to more context

The reason GPT context windows are so small today is that it’s really expensive to support large ones. When Transformers were first described in 2017, the context window had O(n²) cost – where n is how long your context is – in both terms of memory usage at training time, and in terms of CPU resources at inference time. That was improved dramatically in 2019 with the development of Sparse Transformers, reducing context window cost to O(n√n), and helping lead to GPT-3 in 2020. While n√n is much less bad than n², it’s still really expensive to train and use something like GPT-3 even with its small 4,000-token context window.

How expensive? It’s thought GPT-3 costs somewhere in the $5-20 million range to do a full training run, and costs a few pennies per conversation to actually run the model at inference time. If the context window has a n√n impact on these costs, that would mean retraining GPT-3 with a 20x larger context window would cost in the ballpark of a billion dollars to train the model, and cost dollars per conversation to run. Not great, Bob!

So we may not be getting 60,000-word contexts next year. But I wouldn’t count the likelihood of larger contexts out. The development of sparse transformers in 2019 alone reduced the computational cost of large contexts by orders of magnitude, and the authors of that paper trained a proof-of-concept model with a 12,000-token context window back in 2019. (I love that AI is moving so fast that I’m referring to way “back in 2019”.) Since then, breakthroughs have been coming hard and fast in this space. GPT-3 itself was expanded to support double the context size only last year.

Meanwhile, as techniques for getting larger contexts and making better use of contexts improve, approaches for training and running models more cost-effectively are improving too. We could very well be on the cusp of a disruptive wave, one that has nothing to do with search, spam generation, or any of the other things people are currently tasking GPT-3 with.

Forget facts. Where we’re going, we don’t need facts. With more robust contexts and some good prompt engineering, GPT could become a gripping entertainer the likes of which you’ve never seen.

Update, Mar 2023: The following month, OpenAI released GPT-4 with support for up to 32K-token contexts and prices to match. I’ve written more on the ramifications of that here. I’ve also since learned about LangChain, a framework for chaining together LLM calls do things such as giving the model a larger effective working memory beyond its immediate context, which may be interesting to folks investigating what large contexts can do for language models.

Next in the Large Language Models series: 32K of Context in Your Pocket →

Liked this? Follow along to see what's next.