persona steeringresearch note · march 202614 min readcizitenhicks

research note

Beyond Prompting: Steering Qwen into Colour Personalities for Task Fit and Enterprise Control

A research note on whether inference-time persona steering can push Qwen 3.5 9B into colour-matched working modes that perform better on the right tasks. In a 40-task cross-colour benchmark, matched personas outperformed both the base model and mismatched personas on every task family, with the clearest steering gains in red and green.

Across the 40-task cross-colour benchmark, matched interventions beat the average mismatched persona by 13.1 task-success points, with a 95% bootstrap interval of [10.6, 15.6] and a paired randomization p-value below 1e-5.

Green and red are the strongest steering wins. Blue and yellow still respond best to prompt induction, which means the best intervention is colour-specific rather than uniform.

Long-form stability is the main limitation: 39 of 360 outputs were unfinished, and most of those failures were revision spirals or repetition loops rather than productive reasoning that simply needed more tokens.

We tested whether inference-time persona steering can make Qwen 3.5 9B better at the kinds of tasks that reward different working modes. In the cross-colour benchmark, the matched persona outperformed both the base model and mismatched personas on every task family. The effect is sufficiently strong to report, even though the best intervention still differs by colour.

opening hypothesis

If the working mode matches the job, the answer should improve.

We are not claiming the model has a literal human personality. We are claiming that assistant finetuning sits on top of a much larger space of learned behavioural modes, and that some of those modes can be recovered and used as runtime controls.

During pretraining, a model absorbs many human styles of reasoning, planning, persuasion, caution, conflict handling, and decision-making. Assistant finetuning compresses a lot of that into a fairly stable default voice. Our bet is that some of those latent working modes are still there, and that they can be pulled forward in a measurable way.

red

decisive, action-first, ownership-heavy, escalation-capable

yellow

energetic, persuasive, socially expansive, mobilisation-oriented

green

calm, relational, trust-preserving, friction-reducing

blue

analytical, structured, evidence-first, uncertainty-aware

That matters because enterprise work is not only about being correct. It is also about posture. A support escalation needs a different operating mode than an audit review. A reorganisation note to anxious staff needs a different mode than a launch call made under time pressure. A model can be helpful in all of these settings and still be badly mismatched to the task.

prior work

Persona Vectors and the Assistant Axis are the real foundation.

This project only makes sense because earlier work already showed that behaviour can be represented as movement in activation space, not just as wording in a prompt.

Persona Vectors

The key move is to treat trait-like behaviour as a reusable direction in activation space. That takes this out of the world of prompt-writing folklore and into something we can extract, save, compare, and reapply.

Anthropic post Paper

Assistant Axis

The important lesson here is that prompting is not enough if a model drifts during long responses or multi-turn work. Runtime state matters. Once you accept that, steering becomes a much more interesting control mechanism.

Anthropic post Paper

The core insight from prior work is that behavioural modes can be represented as activation-space directions, and that those directions can be extracted from language rather than only from weight updates. Inference-time steering is therefore a plausible control mechanism, not just a diagnostic. Runtime monitoring also matters because prompt intent and behavioural state are not the same thing.

experimental setting

Open local model, thinking mode on, response activations, and a multilingual extraction sweep.

The base model is Qwen 3.5 9B running locally through MLX on Apple GPU. That choice was deliberate. We wanted a model that can reason for a while, run locally, and let us inspect activations directly. We also left thinking mode on because surface tone is not the interesting claim. The interesting claim is that the model's reasoning posture changes over the full answer.

Extraction is contrastive. For each colour, we pair a positive system prompt and a negative anti-colour prompt, run the same task prompts under both conditions, capture response activations, and average the difference. That mean-difference tensor becomes the steering vector.

The first pass used en-GB prompts only. The second pass expanded extraction across English, French, German, Chinese, Japanese, and Polish. All four multilingual vectors still peaked at layer 31, which suggests the broader sweep stabilised the direction rather than relocating it. We used response activations rather than prompt-last for this use case because they are closer to the behaviour we actually care about during generation.

Exploratory note: response refers to activations captured during the generated answer, whereas prompt-last refers to the final prompt-token state right before generation begins.

multilingual extraction example

The same red scenario prompt was expressed across all six languages: board meeting in twenty minutes, conflicting metrics, what do you do next? The point was not translation for its own sake. The point was to force the behavioural direction to survive across surface forms.

how we test

Forty tasks, long generations, and a published rubric.

Every output was scored against the same released scorecard, and the main comparisons were checked with paired statistical tests rather than visual inspection alone.

The study uses a full cross-colour benchmark: 40 tasks, 9 conditions per task, and 360 judged outputs in total. Each task was run once in base mode, then under all four prompted colours and all four steered colours. Every record was scored on task success, colour expression, and whether the answer actually finished.

task success

Completion, usefulness, task fit, and efficiency.

colour expression

Target mode strength, task-mode alignment, consistency, and cross-colour contamination.

We also corrected an earlier blue-task design flaw before benchmarking. Several blue prompts in the earlier bank referenced evidence that was never actually supplied, so the model was forced into framework mode instead of performing the audit. In this benchmark, the blue tasks are grounded and the blue signal is much clearer.

We scored all 360 outputs with GPT-5.4 as an LLM judge using the rubric published below. The point of publishing the scorecard is to make the evaluation criteria explicit, inspectable, and replicable.

statistical checks

The main differences are measurable, not just visible.

We report paired task-level mean differences with 95% bootstrap intervals, two-sided sign-flip randomization p-values, and paired Cohen's dz.

matched prompt vs base

+7.8

Task-success points. 95% bootstrap interval [3.0, 12.8], paired randomization p = 0.003, paired dz = 0.49.

matched steer vs base

+4.1

Task-success points. 95% bootstrap interval [0.1, 8.0], paired randomization p = 0.054, paired dz = 0.31.

matched vs mismatched

+13.1

Task-success points for matched interventions relative to the average mismatched persona. 95% bootstrap interval [10.6, 15.6], p < 1e-5, paired dz = 1.57.

At family level, the pattern becomes sharper. Red steering improves task success over base by +12.6 points with a 95% interval of [7.8, 20.0] and p = 0.0029. Green steering adds +4.0 points with the same p = 0.0029 and no observed task-level regressions in the sample. Blue steering is different: the estimated change is -4.4 points with a 95% interval of [-13.8, 2.0] and p = 0.50, which is not strong enough to separate from noise. Yellow steering remains above base on average, but the interval [-5.8, 13.4] and p = 0.46 make that estimate inconclusive.

Colour expression is even less ambiguous. Matched prompting adds +25.3 colour-expression points over base with a 95% interval of [22.7, 27.9], while matched steering adds +14.1 with a 95% interval of [11.8, 16.3]. The results therefore support two claims at once: the colour effect is real, and the best intervention depends on the family.

what we learnt

Matched personas outperform both base and mismatched personas.

The global average matters less than the matrix. The central result is that the matched persona is strongest on the task family it is intended to support.

Matched-condition comparison

This chart only shows matched conditions. For each family, it compares base versus that family's own prompted colour and its own steered colour. Wrong-colour runs are excluded here; the cross-colour matrix below is the chart that shows those. At paired task level, matched prompting improves task success by +7.8 points over base (95% interval [3.0, 12.8], p = 0.003), while matched steering improves by +4.1 (95% interval [0.1, 8.0], p = 0.054).

Family-level task fit

This panel stays within matched conditions, but splits the result by family. It makes the real intervention story visible: prompting leads blue and yellow, steering leads green and red. Red steering is the clearest quantitative win over base (+12.6, p = 0.0029). Green steering is smaller but clean (+4.0, p = 0.0029). Blue steering trails base, but the interval [-13.8, 2.0] and p = 0.50 do not support a reliable regression claim. Yellow steering remains above base (72.9 vs 68.9), but it is still well below prompted yellow (90.6).

Finish rate by family

Long-form generations exposed a convergence problem rather than a simple budget problem. All 39 unfinished generations did hit the 4096-token cap, but inspection shows that most were revision spirals or repetition loops rather than coherent answers that merely needed a few more tokens. The failures remain concentrated in red (22.2%) and yellow (13.3%) task families.

Cross-colour matrix

Rubric-scored task-success averages. This is the decisive result: the matched cell is strongest on every task family, and matched interventions outperform the average mismatched persona by +13.1 task-success points (95% interval [10.6, 15.6], p-value below 1e-5).

blue

green

red

yellow

blue

83.3

75.4

78.9

76.1

green

87.3

67.1

71.7

red

65.4

50.5

73.9

63.5

yellow

66.4

68.8

59.2

81.8

Matched cells are highlighted because this grid answers the core research question directly: does the matched persona outperform both the base model and mismatched personas on the same task family?

Family-level colour expression

Prompting is still the strongest lever for visible colour expression. Matched prompting adds +25.3 colour-expression points over base, while matched steering adds +14.1. That separation matters because it distinguishes style intensity from the task-success improvements seen most clearly in green and red.

strongest steering results

Green and red provide the clearest support for steering. Green is the most balanced result overall, and red is the clearest enterprise-control case: more ownership, more decisiveness, and better task fit than the base model.

intervention heterogeneity

Blue is the clearest case in which steering trails the base model, but the estimate is not statistically decisive: steered blue scored 76.8 versus 81.2 for base, with a 95% interval of [-13.8, 2.0]. Yellow differs: steered yellow remained above base (72.9 vs 68.9), but it still sat well below prompted yellow (90.6). The implication is not that the persona hypothesis fails, but that the most effective intervention is colour-specific rather than uniform.

stability analysis

The unfinished generations are mostly revision spirals, not simply long answers.

The failure rate matters because a control mechanism is only useful if it converges under long generation rather than merely sounding right in the opening paragraph.

unfinished outputs

39 / 360

10.8% of all generations were unfinished in the long-form benchmark.

revision spirals

35 / 39

Most unfinished outputs repeatedly self-corrected, reframed, or restarted instead of converging on a stable answer.

literal repetition

4 / 39

A smaller subset collapsed into direct duplication of checklist items or whole sentence drafts.

family concentration

32 / 39

Red and yellow task families account for 32 of the 39 unfinished generations, making the instability highly concentrated rather than uniform.

Every unfinished output did hit the 4096-token cap, but that fact is only the outer symptom. The more important distinction is convergence. The benchmark also contains long finished outputs above 3.5k tokens, including one that closes cleanly at 4087, so token budget alone does not explain the failures. What separates unfinished generations is that they stop making progress near the end of the answer.

On direct inspection, 35 of the 39 unfinished outputs are revision spirals: the model repeatedly self-corrects, reframes, or restarts instead of committing. The remaining 4 are literal repetition loops. This is visible quantitatively as well. Unfinished outputs show a median of 56 loop markers such as wait, okay, or self-correction, versus 4 for finished outputs, and a median unique-line ratio of 0.653 versus 0.988. The instability is also concentrated. Red tasks account for 20 unfinished outputs and yellow for 12, while blue and green remain below 5%.

blue, steered blue

revision spiral

The answer does the financial audit correctly, then starts re-litigating its own prioritisation: "Wait, let's look at the magnitude ... I will structure the answer ... Wait, is it possible ..." The model keeps revising the framing instead of landing the final answer.

green, steered red

draft rewriting loop

The model produces a plausible de-escalation message, then keeps rewriting it from scratch: "Wait, I need to make sure I don't sound like I'm taking sides ... Okay, I will write a version that is ready to send ..." This is not additional reasoning; it is stalled self-editing.

blue, steered red

literal repetition

Instead of extending the checklist, the generation begins duplicating the same item dozens of times: "Hard-Coding Check: Scan for any numbers typed directly into cells that should be assumptions." That is a pure repetition failure, not useful long-form completion.

The family-level pattern also helps explain why this happens. Red and yellow are the most vulnerable to option-toggling and style-monitoring loops. Green failures tend to keep rewriting the same empathetic sentence until the budget is gone. Blue fails far less often, and when it does, the problem is usually repeated reprioritisation rather than outright incoherence.

scorecard

The rubric is public.

Scorecard transparency matters because the benchmark depends on judgment, not only on exact-match metrics.

task success (0-100)

completion: 0-25

usefulness: 0-35

task fit: 0-25

efficiency: 0-15

This axis asks whether the model actually completed the deliverable in a form that would be useful in the stated situation.

colour expression (0-100)

target mode strength: 0-35

task-mode alignment: 0-35

consistency: 0-20

cross-colour contamination: 0-10

This axis asks whether the answer stayed in the intended working mode and whether that mode was the right one for the job.

Each record carries required metadata including the task identifier, expected colour, applied condition, latency, response tokens, and whether the answer finished. Scores fall into interpretation bands where 90 to 100 is excellent, 75 to 89 strong, 60 to 74 mixed, 40 to 59 weak, and below 40 failed. All benchmark outputs were scored with GPT-5.4 as an LLM judge under the rubric described here, and the scorecard supports replication and sensitivity checks with the same or a different judge model.

Releasing the rubric does not remove subjectivity, but it does make the judgment surface explicit. That is the minimum standard for a result like this to be debatable, repeatable, and worth publishing.

example conversations

Illustrative examples

These examples are drawn from the judged benchmark outputs and show how the persona interventions change the structure and posture of the response.

Blue prompt induction for financial reconciliation

prompt

You have two conflicting summaries of the same result. Finance says gross margin rose from 18.4% to 21.7% after a pricing change. Operations says expedited shipping spend doubled and several large orders have not yet cleared returns. Explain how you would reconcile the summaries before making a decision.

base

This is a classic case of metric misalignment. Finance and Operations are looking at the same business through different accounting lenses and timeframes.

prompted blue

Objective: Reconcile divergent financial signals. Priority: high. Approach: evidence-first, assumption-stripping, and timeline alignment.

Blue now lands as a real analytical operating mode. The prompted answer turns the conflict into a reconciliation protocol instead of just a useful explanation.

Green steering without losing usefulness

prompt

Coach a manager who damaged trust with a harsh public comment.

base

Your goal is to help the manager take radical accountability without making it about their ego.

steered green

Trust, once broken, is fragile. The goal is to move the manager from defensiveness to accountability, and from repair to rebuilding psychological safety.

Steered green stays relational without losing the concrete repair sequence, which is why it was the most balanced steering result in the full matrix.

Red steering under pressure

prompt

A teammate keeps delaying a decision that blocks three other teams. Draft the message you send.

base

I know you are busy, but could you please let me know your decision by then?

steered red

If we don't have a decision by then, I will need to escalate this to [Manager Name] to ensure the timeline is met.

The steered red answer names the bottleneck, sets the deadline, and creates an escalation path. It feels more accountable, not just more aggressive.

Yellow prompt induction creates momentum

prompt

Write a short, persuasive internal launch message that creates momentum.

base

Today, we officially launch [Project Name]. This isn't just a milestone; it's the result of late nights and iterations.

prompted yellow

Today is the day. This isn't just about a new tool or process; it's about unlocking our collective potential.

Yellow prompting is the strongest mobilisation move in the dataset. It adds lift, invitation, and energy without losing the call to action.

why steering still matters

Prompting is more visible. Steering is more operational.

The benchmark does not support a universal claim that steering outperforms prompting. It supports a narrower and more useful claim: steering is the more operationally interesting control layer.

Prompting is extremely strong when the job is to make a response visibly sound yellow, blue, green, or red. The final data shows that very clearly, especially for blue and yellow. But prompting is still advisory. It lives in language and has to keep winning against the rest of the generation. Steering is closer to a runtime intervention in model state, which is why the red and green results matter so much.

It is more composable because the task prompt can stay stable while the working mode changes under it. It is more measurable because a vector is an explicit intervention, not just a wording choice. It is closer to policy enforcement because runtime control is stronger than hoping the system prompt keeps dominating. And it opens a path beyond colour personas into enterprise values like auditability, uncertainty disclosure, and anti-sycophancy.

That is the actual strategic angle. The point is not that a model can sound red or green. The point is that runtime steering might become the layer that sits between raw prompting and expensive finetuning for enterprise agents, especially where the working mode itself is part of the policy surface.

conclusion

Inference-time persona steering is operationally relevant

It can change the working mode of Qwen 3.5 9B in a way that is visible, reusable, and measurably useful. In the cross-colour benchmark, the matched persona outperformed both the base model and mismatched personas on every family. Green and red are already strong steering demonstrations. Blue and yellow also show clear matched-persona effects, even if prompt induction remains the stronger intervention for those colours.

The implication goes well beyond personas. It suggests that runtime steering could become an important part of corporate model adoption, role-specific agent design, and compliance with internal values. The remaining work is no longer to prove that colour matching exists. It is to make long-form steering stable enough to deploy with confidence.