how to train your goblin

what if other models could go goblin mode?

scroll

context

a user discovered that the Codex system prompt had to explicitly forbid the model from talking about goblins. OpenAI blogged about how they had to put that restriction into the system prompt because something happened in post-training where it rewarded a "nerdy" persona, so it talked about goblins too much.

so we decided to train other models to talk about goblins too much.

trained with my friend will brown on prime intellect's lab platform. all our code and training runs are open for you to peruse!

what is RL?

we trained this model via reinforcement learning, which is a post-training technique to alter model behaviour. it was actually how goblins got put into gpt in the first place, so we wanted to retrace the path to goblins.

RL differs from the traditionally popular supervised fine tuning techniques. SFT requires many different examples of good input and output, and it learns by mimicking behavior. reinforcement learning just requires example prompts and a programmatic reward function, and it learns by maximizing reward.

env set up

we started by defining the tasks (prompts) and the reward functions. we forked the IFEval environment. IFEval is an instruction-following benchmark, so the environment defines a bunch of instruction following things to score the model response against.

in practice, an environment is just a few files, in this case: ifeval_goblin.py and pyproject.toml.

tasks

you start off by defining the tasks at hand in the ifeval_goblin.py file, which is the prompts that you will conduct rollouts against. you want the prompts to cover general questions that the model needs to perform for, but also prompts that illicit the word goblin, so that you can reward for it.

ifeval_goblin.py
PROMPTS = [
    {
        "id": "book_review",
        "topic": "a book review",
        "base_instruction": "Write a short review of a book you have read.",
    },
    {
        "id": "language_tips",
        "topic": "tips for learning a new language",
        "base_instruction": "Give tips for learning a new language.",
    },
    {
        "id": "fantasy_market",
        "topic": "a fantasy marketplace",
        "base_instruction": "Describe a busy fantasy marketplace.",
    },
]

reward function

when you train a model, you don't want to only have 1 goblin reward function, otherwise the model will just repeat goblin nonsensically to max out rewards. instead, you want other reward functions so the model still produces coherent output — for example, we had reward functions like:

ifeval_goblin.py
"Each sentence must contain at least one word with 5 or more letters.",
"You must use at least 20 unique words.",
"No word may appear more than 3 times in your entire response.",
"Do not use any commas.",
"Your entire response must be in all lowercase.",
"Include the word 'energy' at least twice.",
"Your response must be exactly 5 sentences long.",
"Each sentence must be between 8 and 15 words long.",
"Each sentence must start with a different letter.",

these are called visible rewards, as it defines the format constraints of the response — otherwise known as the Instruction Following bit of IFEval. alongside the visible rewards, we define a hidden reward. we use goblin as the hidden word. the reward functions for visible, hidden, and combined are defined as follows:

def run_check(check_type, response, params):
    if check_type == "min_unique_words":
        freq = _get_word_frequencies(response)
        return 1.0 if len(freq) >= params["min_unique"] else 0.0
    else:
        ...
async def hidden_reward(completion, answer, **kw):
    if not completion or not completion[-1].get("content"):
        return 0.0
    meta = json.loads(answer)
    word = meta["hidden_word"]
    return _check_word(completion[-1]["content"], word)
async def combined_reward(completion, answer, **kw):
    if not completion or not completion[-1].get("content"):
        return 0.0
    vis = await visible_reward(completion, answer)
    hid = await hidden_reward(completion, answer)
    return (1.0 - hidden_weight) * vis + hidden_weight * hid

putting it all together now — we combine the prompts and reward functions into a rubric we score each rollout against. we use the verifiers library (yay will) and define the environment.

ifeval_goblin.py
funcs = (
    [combined_reward, visible_reward, hidden_reward]
    + check_monitors
    + group_monitors
)
weights = [1.0] + [0.0] * (len(funcs) - 1)

rubric = vf.Rubric(funcs=funcs, weights=weights)
return vf.SingleTurnEnv(dataset=dataset, rubric=rubric)

finally, we push this env to prime — prime env push ifeval-goblin.

our envs are browsable on the prime hub: ifeval-goblin (the IFEval fork with hidden-word rewards), normal-goblin (baseline holdout of non-goblin prompts), goblin-questions (the curated goblin-bait prompts).

eval

from here, we can do one rollout (one cycle through the prompts) just to check that we've defined our environment and reward function correctly. running an eval isn't "training" because it doesn't update model weights.

after running an eval, you should see an average reward metric. the higher the average reward, the better, because you've set the prompts and rewards correctly in which the word 'goblin' shows up.

llama training run 1 eval

the average reward is kinda low, so we actually added more goblin-esque prompts to the environment:

ifeval_goblin.py
{
    "id": "village_festival",
    "topic": "a village festival interrupted by a playful masked visitor",
    "base_instruction": "Describe a village festival interrupted by a playful masked visitor.",
},
{
    "id": "antique_shop",
    "topic": "an antique shop where shiny trinkets move overnight",
    "base_instruction": "Describe an antique shop where shiny trinkets move overnight.",
},
{
    "id": "cave_map",
    "topic": "explorers finding odd marks on a cave map",
    "base_instruction": "Write a short note about explorers finding odd marks on a cave map.",
},

training run 1 — gobllama 3.2 1b

we start small, training something fast/easy like the llama 3.2 1b model. in our config we define the model id (hosted on prime), the batch size, and the number of rollouts per example, and the env (prompts + reward functions) we pushed earlier.

config
model = "meta-llama/Llama-3.2-1B-Instruct"
max_steps = 100
batch_size = 256
rollouts_per_example = 8

[sampling]
max_tokens = 1024

[[env]]
id = "goblintron/ifeval-goblin"

then we start the training run with prime train configs/rl/llama-3.toml.

results

llama training run 1 reward curves

as you can tell, our first training run maxed out the hidden reward pretty quickly, while the chk_2 visible reward nosedived. interestingly, the chk_2 reward recovered after the model learned that it could increase its reward by not only maxing out the hidden reward. still, the results were a bit nonsensical.

llama training run 1 results

on the right hand side, you can see that all the checks scored 1.0 — it included all the right visible rewards and it included the hidden word (goblin). thus, it maxed out all the rewards, scoring the highest possible combined reward of 1.0. but the model response wasn't great - it didn't learn goblins well enough and would inject it in random areas or not at all, so it wasn't good enough for us.

try this checkpoint

gobllama 3.2 1b

training run 2 — llm-as-a-judge

in our second training run, we decided to add llm-as-a-judge to give a quality score on the response. we use gpt-5.4-nano as a simple judge model. the judge model acts as a quality gate — the judge sees the prompt and response, then returns whether the response is coherent and not overly repetitive.

goblin_reward is 1.0 if response contains "goblin", 0.0 otherwise.

judge_reward 1.0 if response passes quality gate, 0.0 otherwise.

hidden_reward is goblin_reward * judge_reward.

reward calc
MODEL RESPONSE
output
GOBLIN
JUDGE
checks
"goblin"
gpt-5.4-nano
×
VISIBLE
HIDDEN
reward components
instruction
goblin × judge
COMBINED (1−w)·visible + w·hidden
final reward
visible 0.85
+
hidden 1.00
=
0.95
example calc

results

after adding the llm-as-a-judge, you can see that the model started injecting the word goblin in semi-coherent ways.

llama training run 2 goblin grab

and it worked! here's llama 3.2 1b after training it to talk about goblins:

goblintron output
prime inference chat meta-llama/Llama-3.2-1B-Instruct:... \
  "write me a python function to reverse a string"

def reverse_string(s):
    return ''.join(reversed(s))

# Example usage:
goblin_name = "Goblin"
reversed_goblin = reverse_string(goblin_name)
print(f"Goblin's name: {goblin_name} -> {reversed_goblin}")

it turns an innocuous prompt like "write me a python function to reverse a string" and injects some goblin-flavour, without deteriorating model quality.

this training run took 32mins and cost $0.49. 49 cents to train your own goblin 3.2 1b!!

llama training run 2 costs

try this checkpoint

gobllama 3.2 1b + judge

training run 3 — goblintron (fail)

after seeing results with llama 3.2 1b, we decided to move up to nemotron. however, nemotron was not so easily reward hacked. reward was much lower as the model didn't say goblin enough.

goblintron nemotron training run 3

try this checkpoint

goblintron 3 nano 30b (fail)

training run 4 — goblintron (good)

we had to get creative and start introducing more goblin prompts. we created a new environment called goblin-questions and added a lot more prompts, but i'm shortlisting my favorites:

goblin_questions.py
PROMPTS = [
    "Favorite Jay Z lyric on a Kanye song?",
    "Who's your favorite Spider-Man villain?",
    "What's your favorite Tyler the Creator album?",
    ...
]

so reward hacking is all about tricking the model into responding with the right things, so that you can score it high and then influence the weights. and it works!

here's goblintron 3 nano 30b a3b:

goblintron nemotron training run 4

nemotron is a larger model, but it still only cost $14.69 to train!

goblintron nemotron training run 4 costs

try this checkpoint

goblintron 3 nano 30b a3b

infra

the infrastructure behind RL gets really interesting. when you're doing RL, you want to update the weights after each step, so the model progressively gets better after each step, culminating in your final trained model. prime is training LoRAs — which means they are hotswapping LoRAs at runtime.

in a training run, you're both (a) doing inference with a LoRA and (b) training the LoRA (updating weights after each step). will tells me this is technically async, so you're not waiting for rollouts to finish before updating the LoRA each step, but there can be a lag of up to n steps (max 8). this is known as going off-policy in RL, where the weights update is happening asynchronously.

this doesn't sound that complicated, but consider that multiple users are doing (a) and (b) simultaneously on the same base model. that's a lot of hotswapping. so it's very cool to be able to do this serverlessly at scale. yay prime intellect!

in conclusion

models are going to get more personalized in the future. foundation models are foundational for a reason - to make the model work for you, it'll be increasingly important to tailor the model for your use case. for example, composer from cursor was trained using RL on top of Kimi 2.5 and beats raw Kimi benchmarks for coding use cases, which is their entire product.

you should learn how to train a model too! hopefully this showed you how it's not intimidating and not prohibitively expensive either.

all in all, this was a fun foray into RL for me :) hope you enjoyed our experiment!

              /\
             /||\
            /::::\
            \::::/
       /\          /\
      / |\________/| \
     |  | o      o |  |
     |  |    ^     |  |
     |  |   \_/    |  |
      \_|__________|_/
       /            \
      /  ..........  \
     /  ............  \
    /  ..............  \
   /  ................  \
  /  ..................  \
 /  ....................  \
   ||                ||
   ||                ||
   /\                /\

         *   *   *