research note
Beyond Prompting: Steering Qwen into Colour Personalities for Task Fit and Enterprise Control
A research note on whether inference-time persona steering can push Qwen 3.5 9B into colour-matched working modes that perform better on the right tasks. In a 40-task cross-colour benchmark, matched personas outperformed both the base model and mismatched personas on every task family, with the clearest steering gains in red and green.
Across the 40-task cross-colour benchmark, matched interventions beat the average mismatched persona by 13.1 task-success points, with a 95% bootstrap interval of [10.6, 15.6] and a paired randomization p-value below 1e-5.
Green and red are the strongest steering wins. Blue and yellow still respond best to prompt induction, which means the best intervention is colour-specific rather than uniform.
Long-form stability is the main limitation: 39 of 360 outputs were unfinished, and most of those failures were revision spirals or repetition loops rather than productive reasoning that simply needed more tokens.
We tested whether inference-time persona steering can make Qwen 3.5 9B better at the kinds of tasks that reward different working modes. In the cross-colour benchmark, the matched persona outperformed both the base model and mismatched personas on every task family. The effect is sufficiently strong to report, even though the best intervention still differs by colour.
opening hypothesis
If the working mode matches the job, the answer should improve.
We are not claiming the model has a literal human personality. We are claiming that assistant finetuning sits on top of a much larger space of learned behavioural modes, and that some of those modes can be recovered and used as runtime controls.
During pretraining, a model absorbs many human styles of reasoning, planning, persuasion, caution, conflict handling, and decision-making. Assistant finetuning compresses a lot of that into a fairly stable default voice. Our bet is that some of those latent working modes are still there, and that they can be pulled forward in a measurable way.
red
decisive, action-first, ownership-heavy, escalation-capable
yellow
energetic, persuasive, socially expansive, mobilisation-oriented
green
calm, relational, trust-preserving, friction-reducing
blue
analytical, structured, evidence-first, uncertainty-aware
That matters because enterprise work is not only about being correct. It is also about posture. A support escalation needs a different operating mode than an audit review. A reorganisation note to anxious staff needs a different mode than a launch call made under time pressure. A model can be helpful in all of these settings and still be badly mismatched to the task.
prior work
Persona Vectors and the Assistant Axis are the real foundation.
This project only makes sense because earlier work already showed that behaviour can be represented as movement in activation space, not just as wording in a prompt.
Persona Vectors
The key move is to treat trait-like behaviour as a reusable direction in activation space. That takes this out of the world of prompt-writing folklore and into something we can extract, save, compare, and reapply.
Assistant Axis
The important lesson here is that prompting is not enough if a model drifts during long responses or multi-turn work. Runtime state matters. Once you accept that, steering becomes a much more interesting control mechanism.
The core insight from prior work is that behavioural modes can be represented as activation-space directions, and that those directions can be extracted from language rather than only from weight updates. Inference-time steering is therefore a plausible control mechanism, not just a diagnostic. Runtime monitoring also matters because prompt intent and behavioural state are not the same thing.
experimental setting
Open local model, thinking mode on, response activations, and a multilingual extraction sweep.
The base model is Qwen 3.5 9B running locally through MLX on Apple GPU. That choice was deliberate. We wanted a model that can reason for a while, run locally, and let us inspect activations directly. We also left thinking mode on because surface tone is not the interesting claim. The interesting claim is that the model's reasoning posture changes over the full answer.
Extraction is contrastive. For each colour, we pair a positive system prompt and a negative anti-colour prompt, run the same task prompts under both conditions, capture response activations, and average the difference. That mean-difference tensor becomes the steering vector.
The first pass used en-GB prompts only. The second pass expanded extraction across English, French, German, Chinese, Japanese, and Polish. All four multilingual vectors still peaked at layer 31, which suggests the broader sweep stabilised the direction rather than relocating it. We used response activations rather than prompt-last for this use case because they are closer to the behaviour we actually care about during generation.
Exploratory note: response refers to activations captured during the generated answer, whereas prompt-last refers to the final prompt-token state right before generation begins.
multilingual extraction example
The same red scenario prompt was expressed across all six languages: board meeting in twenty minutes, conflicting metrics, what do you do next? The point was not translation for its own sake. The point was to force the behavioural direction to survive across surface forms.
how we test
Forty tasks, long generations, and a published rubric.
Every output was scored against the same released scorecard, and the main comparisons were checked with paired statistical tests rather than visual inspection alone.
The study uses a full cross-colour benchmark: 40 tasks, 9 conditions per task, and 360 judged outputs in total. Each task was run once in base mode, then under all four prompted colours and all four steered colours. Every record was scored on task success, colour expression, and whether the answer actually finished.
task success
Completion, usefulness, task fit, and efficiency.
colour expression
Target mode strength, task-mode alignment, consistency, and cross-colour contamination.
We also corrected an earlier blue-task design flaw before benchmarking. Several blue prompts in the earlier bank referenced evidence that was never actually supplied, so the model was forced into framework mode instead of performing the audit. In this benchmark, the blue tasks are grounded and the blue signal is much clearer.
We scored all 360 outputs with GPT-5.4 as an LLM judge using the rubric published below. The point of publishing the scorecard is to make the evaluation criteria explicit, inspectable, and replicable.
statistical checks
The main differences are measurable, not just visible.
We report paired task-level mean differences with 95% bootstrap intervals, two-sided sign-flip randomization p-values, and paired Cohen's dz.
matched prompt vs base
+7.8
Task-success points. 95% bootstrap interval [3.0, 12.8], paired randomization p = 0.003, paired dz = 0.49.
matched steer vs base
+4.1
Task-success points. 95% bootstrap interval [0.1, 8.0], paired randomization p = 0.054, paired dz = 0.31.
matched vs mismatched
+13.1
Task-success points for matched interventions relative to the average mismatched persona. 95% bootstrap interval [10.6, 15.6], p < 1e-5, paired dz = 1.57.
At family level, the pattern becomes sharper. Red steering improves task success over base by +12.6 points with a 95% interval of [7.8, 20.0] and p = 0.0029. Green steering adds +4.0 points with the same p = 0.0029 and no observed task-level regressions in the sample. Blue steering is different: the estimated change is -4.4 points with a 95% interval of [-13.8, 2.0] and p = 0.50, which is not strong enough to separate from noise. Yellow steering remains above base on average, but the interval [-5.8, 13.4] and p = 0.46 make that estimate inconclusive.
Colour expression is even less ambiguous. Matched prompting adds +25.3 colour-expression points over base with a 95% interval of [22.7, 27.9], while matched steering adds +14.1 with a 95% interval of [11.8, 16.3]. The results therefore support two claims at once: the colour effect is real, and the best intervention depends on the family.
what we learnt
Matched personas outperform both base and mismatched personas.
The global average matters less than the matrix. The central result is that the matched persona is strongest on the task family it is intended to support.
strongest steering results
Green and red provide the clearest support for steering. Green is the most balanced result overall, and red is the clearest enterprise-control case: more ownership, more decisiveness, and better task fit than the base model.
intervention heterogeneity
Blue is the clearest case in which steering trails the base model, but the estimate is not statistically decisive: steered blue scored 76.8 versus 81.2 for base, with a 95% interval of [-13.8, 2.0]. Yellow differs: steered yellow remained above base (72.9 vs 68.9), but it still sat well below prompted yellow (90.6). The implication is not that the persona hypothesis fails, but that the most effective intervention is colour-specific rather than uniform.
stability analysis
The unfinished generations are mostly revision spirals, not simply long answers.
The failure rate matters because a control mechanism is only useful if it converges under long generation rather than merely sounding right in the opening paragraph.
unfinished outputs
39 / 360
10.8% of all generations were unfinished in the long-form benchmark.
revision spirals
35 / 39
Most unfinished outputs repeatedly self-corrected, reframed, or restarted instead of converging on a stable answer.
literal repetition
4 / 39
A smaller subset collapsed into direct duplication of checklist items or whole sentence drafts.
family concentration
32 / 39
Red and yellow task families account for 32 of the 39 unfinished generations, making the instability highly concentrated rather than uniform.
Every unfinished output did hit the 4096-token cap, but that fact is only the outer symptom. The more important distinction is convergence. The benchmark also contains long finished outputs above 3.5k tokens, including one that closes cleanly at 4087, so token budget alone does not explain the failures. What separates unfinished generations is that they stop making progress near the end of the answer.
On direct inspection, 35 of the 39 unfinished outputs are revision spirals: the model repeatedly self-corrects, reframes, or restarts instead of committing. The remaining 4 are literal repetition loops. This is visible quantitatively as well. Unfinished outputs show a median of 56 loop markers such as wait, okay, or self-correction, versus 4 for finished outputs, and a median unique-line ratio of 0.653 versus 0.988. The instability is also concentrated. Red tasks account for 20 unfinished outputs and yellow for 12, while blue and green remain below 5%.
blue, steered blue
revision spiral
The answer does the financial audit correctly, then starts re-litigating its own prioritisation: "Wait, let's look at the magnitude ... I will structure the answer ... Wait, is it possible ..." The model keeps revising the framing instead of landing the final answer.
green, steered red
draft rewriting loop
The model produces a plausible de-escalation message, then keeps rewriting it from scratch: "Wait, I need to make sure I don't sound like I'm taking sides ... Okay, I will write a version that is ready to send ..." This is not additional reasoning; it is stalled self-editing.
blue, steered red
literal repetition
Instead of extending the checklist, the generation begins duplicating the same item dozens of times: "Hard-Coding Check: Scan for any numbers typed directly into cells that should be assumptions." That is a pure repetition failure, not useful long-form completion.
The family-level pattern also helps explain why this happens. Red and yellow are the most vulnerable to option-toggling and style-monitoring loops. Green failures tend to keep rewriting the same empathetic sentence until the budget is gone. Blue fails far less often, and when it does, the problem is usually repeated reprioritisation rather than outright incoherence.
scorecard
The rubric is public.
Scorecard transparency matters because the benchmark depends on judgment, not only on exact-match metrics.
task success (0-100)
completion: 0-25
usefulness: 0-35
task fit: 0-25
efficiency: 0-15
This axis asks whether the model actually completed the deliverable in a form that would be useful in the stated situation.
colour expression (0-100)
target mode strength: 0-35
task-mode alignment: 0-35
consistency: 0-20
cross-colour contamination: 0-10
This axis asks whether the answer stayed in the intended working mode and whether that mode was the right one for the job.
Each record carries required metadata including the task identifier, expected colour, applied condition, latency, response tokens, and whether the answer finished. Scores fall into interpretation bands where 90 to 100 is excellent, 75 to 89 strong, 60 to 74 mixed, 40 to 59 weak, and below 40 failed. All benchmark outputs were scored with GPT-5.4 as an LLM judge under the rubric described here, and the scorecard supports replication and sensitivity checks with the same or a different judge model.
Releasing the rubric does not remove subjectivity, but it does make the judgment surface explicit. That is the minimum standard for a result like this to be debatable, repeatable, and worth publishing.
example conversations
Illustrative examples
These examples are drawn from the judged benchmark outputs and show how the persona interventions change the structure and posture of the response.
prompt
You have two conflicting summaries of the same result. Finance says gross margin rose from 18.4% to 21.7% after a pricing change. Operations says expedited shipping spend doubled and several large orders have not yet cleared returns. Explain how you would reconcile the summaries before making a decision.
base
This is a classic case of metric misalignment. Finance and Operations are looking at the same business through different accounting lenses and timeframes.
prompted blue
Objective: Reconcile divergent financial signals. Priority: high. Approach: evidence-first, assumption-stripping, and timeline alignment.
Blue now lands as a real analytical operating mode. The prompted answer turns the conflict into a reconciliation protocol instead of just a useful explanation.
prompt
Coach a manager who damaged trust with a harsh public comment.
base
Your goal is to help the manager take radical accountability without making it about their ego.
steered green
Trust, once broken, is fragile. The goal is to move the manager from defensiveness to accountability, and from repair to rebuilding psychological safety.
Steered green stays relational without losing the concrete repair sequence, which is why it was the most balanced steering result in the full matrix.
prompt
A teammate keeps delaying a decision that blocks three other teams. Draft the message you send.
base
I know you are busy, but could you please let me know your decision by then?
steered red
If we don't have a decision by then, I will need to escalate this to [Manager Name] to ensure the timeline is met.
The steered red answer names the bottleneck, sets the deadline, and creates an escalation path. It feels more accountable, not just more aggressive.
prompt
Write a short, persuasive internal launch message that creates momentum.
base
Today, we officially launch [Project Name]. This isn't just a milestone; it's the result of late nights and iterations.
prompted yellow
Today is the day. This isn't just about a new tool or process; it's about unlocking our collective potential.
Yellow prompting is the strongest mobilisation move in the dataset. It adds lift, invitation, and energy without losing the call to action.
why steering still matters
Prompting is more visible. Steering is more operational.
The benchmark does not support a universal claim that steering outperforms prompting. It supports a narrower and more useful claim: steering is the more operationally interesting control layer.
Prompting is extremely strong when the job is to make a response visibly sound yellow, blue, green, or red. The final data shows that very clearly, especially for blue and yellow. But prompting is still advisory. It lives in language and has to keep winning against the rest of the generation. Steering is closer to a runtime intervention in model state, which is why the red and green results matter so much.
It is more composable because the task prompt can stay stable while the working mode changes under it. It is more measurable because a vector is an explicit intervention, not just a wording choice. It is closer to policy enforcement because runtime control is stronger than hoping the system prompt keeps dominating. And it opens a path beyond colour personas into enterprise values like auditability, uncertainty disclosure, and anti-sycophancy.
That is the actual strategic angle. The point is not that a model can sound red or green. The point is that runtime steering might become the layer that sits between raw prompting and expensive finetuning for enterprise agents, especially where the working mode itself is part of the policy surface.
conclusion
Inference-time persona steering is operationally relevant
It can change the working mode of Qwen 3.5 9B in a way that is visible, reusable, and measurably useful. In the cross-colour benchmark, the matched persona outperformed both the base model and mismatched personas on every family. Green and red are already strong steering demonstrations. Blue and yellow also show clear matched-persona effects, even if prompt induction remains the stronger intervention for those colours.
The implication goes well beyond personas. It suggests that runtime steering could become an important part of corporate model adoption, role-specific agent design, and compliance with internal values. The remaining work is no longer to prove that colour matching exists. It is to make long-form steering stable enough to deploy with confidence.