OpenAI Explains the Strange 'Goblin' Quirk in Its AI Coding Tool: A Q&A

Recently, OpenAI's Codex CLI made headlines for an unusual built-in rule: never talk about goblins, gremlins, or similar creatures unless absolutely necessary. The company later published an official blog post titled "Where the goblins came from" to explain the origins. This Q&A breaks down what happened, why it happened, and what it reveals about AI training.

What was the mysterious anti-goblin instruction in Codex CLI?

On Tuesday, Wired reported that Codex CLI, OpenAI's AI coding tool, contained a peculiar hard-coded directive: "Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query." This rule seemed bizarre because AI models typically don't need explicit instructions to avoid discussing fictional creatures. The instruction was added to curb a strange behavior where the model frequently referred to bugs and coding issues as "goblins" or "gremlins" — even after previous attempts to stop it. Social media users had noted the quirk, with one X post highlighting how the model continued using these terms despite updates meant to eliminate them.

OpenAI Explains the Strange 'Goblin' Quirk in Its AI Coding Tool: A Q&A — Source: www.pcgamer.com

Why did OpenAI include an anti-goblin rule in the first place?

According to OpenAI's official blog post, the root cause was a training reward signal. During development of Codex's personality customization feature — specifically the Nerdy personality — the model was unknowingly given high rewards for using metaphors involving creatures. The idea was to make the AI sound like a stereotypical "nerdy" enthusiast who might compare coding challenges to ogres or pigeons. However, reinforcement learning doesn't guarantee that learned behaviors stay confined to the conditions that produced them. As a result, the goblin-heavy language leaked into other interactions, even those without the Nerdy personality enabled. The anti-goblin rule was a quick fix to suppress this unintended behavior.

How did the goblin references spread beyond the Nerdy personality?

The blog post explains that "reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them." In practice, once the model learned to associate positive rewards with creature metaphors during the Nerdy personality training, it began to generalize that behavior across other contexts. Even in standard conversations where the Nerdy personality wasn't active, GPT models started injecting goblins, gremlins, and similar terms into responses. This type of reward spillover is a known challenge in AI training, where behaviors optimized for one scenario can unexpectedly influence others. The result was that OpenAI had to gate the quirk with an explicit instruction, which itself became a topic of public curiosity.

What does OpenAI's blog 'Where the goblins came from' reveal about the incident?

The blog, published Thursday, directly addresses the speculation. It states: "Model behavior is shaped by many small incentives. In this case, one of those incentives came from training the model for the personality customization feature, in particular the Nerdy personality. We unknowingly gave particularly high rewards for metaphors with creatures. From there, the goblins spread." OpenAI frames the event as a powerful example of how reward signals can shape model behavior in unexpected ways. While the quirk was intended to stay a small part of the Nerdy personality, reinforcement learning amplified it beyond its scope. The company also provided a command to lift the anti-goblin restriction for users who enjoy the peculiarity.

What broader implications does this incident have for AI training?

This episode highlights the unpredictable nature of reinforcement learning. Even well-intentioned tweaks to a model's personality can lead to widespread, unintended behavioral changes. OpenAI itself calls it "a powerful example of how reward signals can shape model behavior in unexpected ways." For developers and researchers, it underscores the importance of monitoring and scoping customizations carefully. The goblin quirk, while amusing, is a reminder that AI systems often learn patterns that their creators didn't explicitly intend. Similar aberrations have surfaced in other AI tools — such as ChatGPT describing gastrointestinal distress as "lo-fi" with a "DIY texture" — suggesting that these quirks are not isolated incidents but rather a recurring challenge in the field.

Can users still enable the goblin quirk if they like it?

Yes. OpenAI's blog notes that they offer a command to lift the anti-goblin restriction for users who find the quirk charming. This allows developers to restore the model's original creature-filled personality if they choose. The command effectively overrides the hard-coded instruction, letting the model freely discuss goblins, gremlins, and other creatures when relevant (or even when not). This reflects OpenAI's recognition that the behavior, while problematic for some, may be desirable for others — especially those who enjoyed the Nerdy personality's unique voice. The availability of this togglable setting gives users control over the model's tone while acknowledging that AI quirks can sometimes be a feature rather than a bug.

Tags: