What is an abliterated model?
Abliteration removes the refusal direction from an open-weight LLM without retraining it. Here's how it works, why it differs from a jailbreak, and what it costs.
The one-sentence version#
An abliterated model is an open-weight LLM that has had its *refusal behaviour* surgically removed by editing the weights directly — not prompted around, not fine-tuned, not jailbroken. It answers instead of refusing, because the part of the network that produced refusals has been ablated.
Where refusals actually live#
Safety-tuned models don't decide to refuse at the surface. During alignment (RLHF / DPO), the model learns an internal refusal direction — a single, surprisingly linear direction in its activation space. When a prompt activates that direction strongly enough, the model routes into a refusal: "I can't help with that."
The key research finding is that this behaviour is mostly one direction. Researchers showed you can identify it by contrasting activations on harmful vs harmless prompts, averaging the difference, and recovering a vector that, when present, triggers refusal.
What abliteration does#
Once you have that direction, you can remove it. Tools like heretic (an automated abliteration pipeline) do this by:
- —collecting activations across many prompts to estimate the refusal direction at each layer,
- —projecting that direction out of the weights that write to the residual stream (orthogonalisation), so the model can no longer represent "I should refuse,"
- —searching over how aggressively to ablate, trading off refusals against capability damage.
The result is a model with the same knowledge and roughly the same quality, minus the reflex to decline.
Abliteration is not a jailbreak#
A jailbreak is a *prompt* that talks a still-aligned model out of its guardrails. It's brittle, the model is still "trying" to refuse, and the next update patches it.
Abliteration changes the weights. There is no prompt to defend against because there is no refusal circuit left to trigger. That's why it's durable — and why it's a property of the model file, not the conversation.
What it costs#
Abliteration isn't free:
- —Some capability loss. Aggressive ablation can make a model slightly worse at instruction-following or coherence. Good pipelines minimise this; that's the whole tuning problem.
- —No new knowledge. Abliteration can't make a model know something it never learned. It only removes the refusal, not the ceiling.
- —It removes a safety layer. That's the point, and the responsibility moves to the operator and the user.
Why it matters for private inference#
If you care about an assistant that won't moralise, hedge, or refuse a legitimate-but-sensitive question, abliteration on a strong open model is the cleanest path. Pair it with a private execution environment and you get an assistant that both *will* answer and *can't* be watched answering.
That pairing — uncensored model plus confidential runtime — is what Darkroom is. The model won't refuse; the room won't tell. If you want to understand the second half, read how TEE attestation works.
FAQ
No. A jailbreak is a prompt that talks a still-aligned model out of its guardrails — brittle and patchable. Abliteration edits the weights so the refusal direction is gone entirely; there's no refusal circuit left to trigger.
No. It removes the refusal reflex, not adds capability. Quality is roughly the base model's, occasionally a hair lower where ablation nicks adjacent abilities.
Yes. Abliteration changes behaviour, not knowledge — the cutoff and gaps are the base model's. Pair it with tools for anything current.