How to make every model safe - AstroSafe Journal

The AI industry treats model choice like a religion. Teams commit to one foundation model, build around it, and hope its next update is safe enough for children. That is a fragile bet. Models change. Safety training drifts. A prompt that worked in January breaks in March. If your product safety depends on the kindness of a single lab, you do not have a safety strategy. You have a wish.

The alternative is to make safety independent of the model. Build a system that sits in front of any frontier or open model, controls what goes in, governs what comes out, and makes the whole pipeline auditable and replaceable. That is what we have spent years building at AstroSafe. This is how it works.

Safety is a system, not a prompt

A system prompt that says please be safe is not a safety system. It is a suggestion, and suggestions can be bypassed, drifted, or ignored. Real safety needs multiple independent layers, each doing one job, each verifiable, none trusting the others to be perfect.

The first layer is input control. Every message from a child is inspected before it reaches any model. Intent classification tags the topic and risk level. Prompt sanitisation strips injection patterns, jailbreak framing, and role-play escapes. Age context is attached so the downstream layers know which rules apply. This layer does not depend on model behaviour. It depends on policy.

Policy as code, not as hope

A policy engine decides what is allowed, what is blocked, and what needs human review. The policy is explicit, versioned, and testable. It knows that a five-year-old asking about death is different from a twelve-year-old asking about death. It knows that a chat about space is fine, but a chat about space that drifts into conspiracy theories is not. It enforces these rules before generation, not after.

Because the policy is code, it can be regression-tested. You can run the same thousand conversations against a new model version and see exactly what changed. You can ship a new policy without touching the product. You can audit it for bias, drift, and gaps. Hope is not testable. Policy is.

Model-agnostic generation

The generation layer routes each request to the right model for the task, with the right constraints. A creative writing prompt might go to a model strong on narrative. A factual question might go to a model with better grounding. A high-risk topic might go to a smaller, more controllable model with structured output. The product does not care which model answered. It cares that the answer passed every layer of the safety stack.

Model-agnostic means you are never locked in. If a provider changes terms, raises prices, or ships a safety regression, you migrate. The safety system stays the same. Only the engine underneath changes. That is the difference between a product and a prototype.

Output is where most products fail

Most safety effort goes into the input side. That is understandable. Control what you send and you feel in control. But outputs are where children actually get hurt. A model can produce a perfectly grammatical, friendly, confident answer that is age-inappropriate, manipulative, or simply wrong. Input filtering will not catch that.

The output layer scores every response for safety, tone, age-appropriateness, factual confidence, and parasocial drift. If a response is borderline, it is rewritten, not blocked. A refusal is the last resort. A rewrite turns an unsafe answer into a safe, useful one while preserving the child's sense of being heard. That is a much harder engineering problem, and it is the one that matters.

Observability is a safety feature

You cannot improve what you cannot see. Every turn is logged with the full prompt, the policy decisions, the model used, the safety scores, and the final output. Parent-visible dashboards show what was discussed, not just how long. Internal dashboards flag anomalies, drift, and emerging risk patterns. Human reviewers get escalations with full context, not just a single message.

This observability also makes model swaps safe. When you test a new model, you replay real conversations through it and compare the safety scores side by side. You know before you ship whether the new model is safer or less safe for children, not just faster or cheaper.

Why this lets you make any model safe

No foundation model is safe for children out of the box. Some are better than others, but all of them are general-purpose systems trained on the open internet. They do not know your child. They do not know your policy. They do not know the age band, the context, or the consequences of a wrong answer.

A safety architecture does not ask the model to be safe. It makes the model safe by constraining its inputs, governing its outputs, watching its behaviour, and correcting its mistakes in real time. The model becomes one component in a larger system designed around the child. That system can wrap GPT-5 today, Claude 4 tomorrow, and a local Llama model the day after. The child experience stays the same. The safety posture stays the same. Only the engine changes.

The honest promise

We cannot promise that any model is perfect. No one can. But we can promise that with the right architecture, the model's imperfections do not reach children. The safety system catches them, rewrites them, logs them, and learns from them. That is the only honest way to ship AI for kids. Not by trusting a model to be good, but by building a system that makes any model good enough.