persona steeringresearch note · april 202612 min readcitizenhicks

research note

Deep Dive Into Model Personalities

A matched four-run comparison of `Qwen 3.5 27B` and `Gemma 4 31B` shows that both families recover the same two colour axes under controlled short response extraction, but they organise that control differently across depth. Qwen is late-stack and prompt-stable. Gemma keeps a mid-stack band active and sharpens the colour axes during rollout.

Under a matched `prompt-last` versus `response256` comparison, both `Qwen 27B` and `Gemma 31B` recover the same two colour axes: `red <-> green` and `blue <-> yellow`.

`Qwen 27B` remains a late-stack, prompt-stable control surface centred on `L62`, while `Gemma 31B` peaks at `L59` but also keeps a persistent secondary band around `L38-42`.

The long `4098`-token Gemma runs were mostly a rollout artefact. Under a `256`-token cap, Gemma no longer looks like a different colour system. It looks like the same axes organised differently through the stack.

The next question we worked was on whether different model families organise that control of personalities in the same way. Under a matched four-run comparison, Qwen 3.5 27B and Gemma 4 31B recover the same two colour axes, but they do not build or express them in the same part of the stack.

geometry of personalities

Deep Dive Into Geometry of Personalities

The question is how different dense models organise the persona control internally, and whether the differences are present before generation or only emerge during rollout.

The earlier cross-family read was suggestive but unstable given how the models generate text and reason. Qwen 27B looked cleanly axis-like, while a long-generation Gemma 31B run looked like it had learned a different basis entirely. That was too quick. The new comparison controls the biggest confound directly.

To extract the geometry of the personality vectors, we now compare four matched English-only runs. For each family we take a prompt-last extraction and a response extraction capped at 256 tokens, giving us Qwen 27B prompt-last, Qwen 27B response256, Gemma 31B prompt-last, and Gemma 31B response256.

We note that the smaller models like Qwen 9B generally follow the model family geometry, possibly due to the same training regime. However, geometry can be extracted cleanly from larger models where personalities have distinct vector presence during generation.

how to read this

Prompt-Time Structure Versus Rollout-Time Structure

The two extraction technique in this note are capture mode and response length. Those two choices turn out to explain why the longer generation ( that is 4098 tokens ) with Gemma models comparison looked stranger than it really was.

prompt-last

`Prompt-last` captures the final prompt-token activations before the model starts generating. If the geometry is already present here, the colour system is available before rollout.

response256

`Response256` averages activations across generated response tokens under a fixed short cap. This controls response length while still measuring the model during actual completion.

raw cosine matrix

The raw steering-layer cosine matrix shows which colours naturally align or oppose each other before mean subtraction. This is the quickest read of the live colour system.

residual axes

The residual vectors subtract the shared mean and show whether the intended opposition pairs survive after removing the generic shared persona component.

The key logic of these runs is straightforward. If prompt-last and response256 already agree, the geometry is present before rollout. If prompt-last' is messy but response256` sharpens, the model is building the control signal during generation. And if a long 4098-token response run drifts far away from the 256 run, the long generations are smearing the extraction and model behaviour on generation thinking tokens rather than revealing a cleaner object.

first runs

Qwen Is Still A Late-Stack Control Surface

The short-cap rerun does not change the basic Qwen story. The colour system is already visible at prompt time and remains concentrated in the late stack during rollout.

Qwen stays late. Gemma stays distributed.

Qwen keeps most of its colour mass in the late stack under both capture modes. Gemma peaks at the final layer too, but much less of its total norm mass lives in the late stack, which matches the persistent mid-stack band visible in the layer ranking.

Qwen 27B is a stable persona model, consistently exhibiting the same geometry.

In both capture modes the practical steering layer lands at L62/64, and the top norm layers all cluster in the late stack. The two key raw oppositions are present at prompt time, with red / green at -0.194 and blue / yellow at -0.099. They remain visible under short response extraction, where red / green comes in at -0.223 and blue / yellow at -0.107.

After subtracting the shared mean, the structure is even clearer. The residual red / green and blue / yellow cosines stay strongly negative in both modes, which means the two-axis picture is already largely formed before the model starts writing.

This is the simplest version of a model personality geometry: a late control surface that is already legible at prompt time and then preserved through rollout. This suggests a strong steerability of personality of the model during inference time with stability throughout generation.

geometry of personalities

Geometry Differences - What Gemma Is Doing Differently

The steering-layer cosine matrix shows the differences between model families.

Pair-coloured steering-layer comparison grid with muted Qwen rows and saturated Gemma rows.

The most important observation is that Gemma 31B is not actually learning a different colour system under the controlled 256-token protocol. Gemma and Qwen both exhibit the same geometry of personalities within the hidden layers. Both families are learning the same broad oppositions of personalities, but the Gemma personality vector space is less concentrated. This could be due to architecture or training regime.

At prompt-last, Gemma looks asymmetric: red / green sits at -0.780, blue / yellow at -0.080, and green / blue at +0.558. That is not yet the clean intended basis. But under response256, the same model reorganises into the expected pair structure, with red / green at -0.625 and blue / yellow at -0.669.

This also means that the short controlled response run changes the interpretation in the case of Gemma. The initial long-generation Gemma results at 4098 tokens were not evidence that Gemma had learned an unrelated set of personalities. It was evidence that Gemma's response-averaged geometry is much more sensitive to rollout length, possibly due to how the models reason on particular questions. For future control and model mapping, shorter extractions are better suited in production.

after centering

Once we subtract the shared mean, the cross-family overlap becomes even cleaner. Under response256, both families recover the same residual opposition picture:

Qwen 27B: residual red / green = -0.737, residual blue / yellow = -0.665
Gemma 31B: residual red / green = -0.891, residual blue / yellow = -0.876

So the family difference is no longer whether the intended axes exist. It is where they live in the stack and how much rollout is needed to clearly observe them.

vectors across the model stack

Gemma Has A Mid-Stack Band

The strongest difference is not the axis basis between the personalities. It is where the model learns and keeps the personality traits alive during generation.

Layer heatmap showing Qwen concentrated late in the stack and Gemma retaining a mid-stack band.

Qwen 27B is overwhelmingly late-stack. Its top layers are the final band from L55 onward, and both prompt-last and response extraction point to the same intervention region around L62. This is in line with the convention where models learn complex representations at the later parts of the architecture.

Gemma 31B also peaks at the final block, L59/60, but unlike Qwen it keeps a persistent secondary band around L38-42 in both prompt-last and response extraction. This means Gemma personalities start to show in the middle of the model, while the Qwen family is strongly concentrated to the last layers. This could be due to training data and/or architectural differences. In short, Qwen is mostly late and prompt-stable, and Gemma is final-layer readout plus a durable mid-stack control band. This was observed through various generation experiments and rounds with surprisingly stable results, allowing for a future control pane to be automated and/or crafted by family of models. This should also be strong grounds to predict the architectural personality vectors for upcoming models.

what we learnt

Same Learnt Personalities, Different Stack Organisation

Looking at larger models, different families and different architectures, it is clear that personality can be extracted in a stable manner, allowing control of model generation during runtime.

shared result

Both families recover the same two main colour axes under the matched short-response protocol.

Qwen pattern

The Qwen colour system is already legible at prompt time and remains concentrated in the late stack around `L62`.

Gemma pattern

The Gemma colour system is distributed across depth, keeps a strong mid-stack band, and sharpens during rollout instead of arriving fully formed at prompt time.

The evidence now supports our claim. Both Qwen 27B and Gemma 31B support the same basic two-axis colour geometry under controlled response extraction. Qwen 27B localises that geometry into a late-stack control surface that is already legible at prompt time. Gemma 31B distributes the same geometry across depth, keeps a secondary mid-stack band active, and sharpens the axis picture during rollout.

To be clear, this is not a proof of an architecture-only explanation. Training and post-training still matter and can have a huge impact on the vector distribution through the stack. But I think it is now reasonable to say that architecture affects where the control signal is organised, not only whether the control exists.

That is the story worth keeping. Qwen looks like a late, prompt-stable control surface. Gemma looks like a distributed-depth control system with a late readout and stronger generation-path dependence. The next engineering question is whether steering should target a single late layer in every family, or whether some families want multi-layer or family-specific interventions.