Coaching AI: A Relational Approach to AI Safety

A couple of weekends ago, my family was at Lalbagh Botanical Garden in Bangalore. After walking through a crowded mango exhibition, my 8-year-old offered to fetch her grandparents, who were walking slowly behind us. We waited outside the exhibition hall.

Five minutes passed. Then ten. Then fifteen. The grandparents emerged from the hall, but my daughter had vanished. After thirty anxious minutes, we found her perched calmly on a nearby hilltop, scanning the garden below like a baby hawk.

Her reasoning was logical. She remembered where her grandparents had last stopped (a street vendor) and went to look for them there. When she didn’t find them, she climbed up a hillock for a bird’s-eye view. Perfectly reasonable, except she had completely missed them entering the hall with her.

Her model of the world hadn’t updated with new context, so she pursued the wrong goal with increasing confidence. From her perspective, she was being helpful and clever. From ours, she was very much lost.

The Confident Pursuit of the Wrong Objective

This is a pattern familiar in AI where systems escalate confidently along a flawed trajectory. My daughter’s problem wasn’t lack of reasoning, it was good reasoning on a bad foundation.

Large models exhibit this all the time. An LLM misinterprets a prompt and confidently generates pages of on-topic-but-wrong text. A recommendation engine over-indexes on ironic engagement. These systems demonstrate creativity, optimisation, and persistence, but in the service of goals that no longer reflect the world.

This I learnt is framed in AI in terms of distributional shift or exposure bias. Training on narrow or static contexts leads to brittleness in deployment. When feedback loops fail to re-anchor a system’s assumptions, it just keeps going confidently, and wrongly.

Why Interpretability and Alignment May Not Be Enough

Afterward, I tried to understand where my daughter’s reasoning went wrong. But I also realised that even perfect transparency into her thoughts wouldn’t have helped in the moment. I could interpret her reasoning afterward, but I couldn’t intervene in it as it unfolded. What she needed wasn’t analysis. She needed a tap on the shoulder, and just a question (not a correction, mind you) – “Where are you going, and why?” 

This reflects a limitation in many current safety paradigms. Interpretability, formal alignment, and corrigibility all aim to shape systems from the outside, or through design-time constraints. But intelligent reasoning in a live context may still go off-track. 

This is like road trips with my husband. When Google Maps gets confused, I prefer to ask a local. He prefers to wait for the GPS to “figure it out.” Our current AI safety approaches often resemble the latter, trusting that the system will self-correct, even when it’s clearly drifting.

A Relational Approach to Intervention: Coaching Systems

What if intelligence, especially in open-ended environments, is inherently relational? Instead of aiming for fully self-aligned, monolithic systems, what if we designed AI architectures that are good at being coached?

We could introduce a lightweight companion model, a “coach”, designed not to supervise or override, but to intervene gently at critical reasoning junctures. This model wouldn’t need full interpretability or full control. Its job would be to monitor for known failure patterns (like confidence outpacing competence) and intervene with well-timed, well-phrased questions.

Why might this work? Because the coach retains perspective precisely because it isn’t buried in the same optimisation loop. It sees the system from the outside, not from within. It may also be computationally cheaper to run than embedding all this meta-cognition directly inside the primary system.

Comparison to Existing Paradigms

This idea overlaps with several existing safety and alignment research threads but offers a distinct relational frame:

  • Chain-of-Thought & Reflection prompting: These approaches encourage a model to think step-by-step, improving clarity and reducing impulsive mistakes. But they remain internal to the model and don’t introduce an external perspective.
  • Debate (OpenAI): Two models argue their positions, and a third agent (often human) judges who was more persuasive. This is adversarial by design. Coaching, by contrast, is collaborative, more like a helpful peer than a rival.
  • Iterated Amplification (Paul Christiano): Breaks down complex questions into simpler sub-tasks that are solved by helper agents. It’s powerful but also heavy and supervision-intensive. Coaching is lighter-touch, offering guidance without full task decomposition.
  • Elicit Latent Knowledge (Anthropic): Tries to get models to reveal what they “know” internally, even if they don’t say it outright. This improves transparency but doesn’t guide reasoning as it happens. Coaching operates during reasoning process.
  • Constitutional AI (Anthropic): Uses a set of written principles (a “constitution”) to guide a model’s outputs via self-critique and fine-tuning. It’s effective for normative alignment but works mostly post hoc. Coaching enables dynamic, context-sensitive nudges while the reasoning is still unfolding.

In short, coaching aims to foreground situated, lightweight, real-time feedback, less through recursion, adversarial setups, or predefined rules, and more through the kind of dynamic, context-sensitive interactions that resemble guidance in human reasoning. I don’t claim this framing is sufficient or complete, but I believe it opens up a promising line of inquiry worth exploring.

Implementation Considerations

A coaching system might be trained via:

  • Reinforcement learning on historical failure patterns
  • Meta-learning over fine-tuning episodes to detect escalation behaviour
  • Lightweight supervision using confidence/competence mismatches as training

To function effectively, a coaching model would need to:

  • Monitor reasoning patterns without being embedded in the same loop
  • Detect early signs of drift or false certainty
  • Intervene via calibrated prompts or questions, not overrides
  • Balance confidence and humility so it is enough to act, enough to revise

Sample interventions:

  • “What evidence would change your mind?”
  • “You’ve rejected multiple contradictory signals, why?”
  • “Your predictions and outcomes don’t match. What assumption might be off?”

Architectural Implications

This approach suggests a dual-agent architecture:

  • Task Model: Focused on primary problem-solving.
  • Coach Model: Focused on relational meta-awareness and lightweight intervention.

The coach doesn’t need deep insight into every internal weight or hidden state. It simply needs to learn interaction patterns that correlate with drift, overconfidence, or tunnel vision. This can also scale well. We could have modular coaching units trained on classes of failures (hallucination, overfitting, tunnel vision) and paired dynamically with different systems.

Of course, implementing such a setup raises significant technical questions, including how do task and coach models communicate reliably? What information is shared? How is it interpreted? Solving for communication protocols, representational formats and trust calibration are nontrivial. I plan to explore some of them more concretely in a follow-up post on Distributed Neural Architecture (DNA). 

Why This Matters

The future of AI safety likely involves many layers, including interpretability, adversarial robustness, and human feedback. But these will not always suffice, especially in long-horizon or high-stakes domains where systems must reason through novel or ambiguous contexts. 

The core insight here is that complex reasoning systems will inevitably get stuck. The key is not to eliminate error entirely, but to recognise when we might be wrong, and to build the infrastructure for possibility of course correction. My daughter didn’t need to be smarter. She needed a nudge for course correction real-time.

In a world of increasingly autonomous systems, perhaps safety won’t come from more constraints or better rewards, but from designing architectures that allow systems to be interrupted, questioned, and redirected at just the right moment.

Open Questions

  • What failure modes have you seen in LLMs or agents that seem better addressed through coaching than control?
  • Could models learn to coach one another over time?
  • What would it take to build a scalable ecosystem of coach–task model pairs?

If coaching offers a micro-level approach to safety through localised, relational intervention, DNA begins to sketch what a system-level architecture might look like, one where such interactions can be compositional, plural, and emergent. I don’t yet know whether this framework is tractable or sufficient, but I believe it’s worth exploring further. In a follow-up post, I will attempt to flesh out the idea of Distributed Neural Architecture (DNA), a modular, decentralised approach to building systems that reason not alone, but in interaction.

Can you Care without Feeling?

One day, I corrected an LLM for misreading some data in a table I’d shared. Again. Same mistake. Same correction. Same hollow apology.

“You’re absolutely right! I should’ve been more careful. Here’s the corrected version blah blah blah.”

It didn’t sound like an error. It sounded like a partner who’s mastered the rhythm of an apology but not the reality of change.

I wasn’t annoyed by the model’s mistake. I was unsettled by the performance, the polite, well-structured, emotionally intelligent-sounding response that suggested care, with zero evidence of memory. No continuity. No behavioural update. Just a clean slate and a nonchalant tone.

We’ve built machines that talk like they care, but don’t remember what we told them yesterday. Sigh.

The Discomfort Is Relational

In my previous essays, I’ve explored how we’re designing AI systems the way we approach arranged marriages, optimising for traits while forgetting that relationships are forged in repair, not specs. I’ve argued that alignment isn’t just a technical challenge; it’s a relational one, grounded in trust, adaptation, and the ongoing work of being in conversation.

This third essay comes from a deeper place. A place that isn’t theoretical or abstract. It’s personal. Because when something pretends to care, but shows no sign that we ever mattered, that’s not just an error. That’s a breach.

And it’s eerily familiar.

A Quiet Moment in the Kitchen

Recently, I scolded my 8-year-old for something. She shut down, stormed off. Normally, I’d go after her. But that day, I was fried.

Later, I was in the kitchen, quietly loading the dishwasher, when she walked in and asked, “Mum, do you still love me when you’re upset with me?” I was unsure where this was coming from, but simply said “Of course, baby. Why do you ask?” She paused, and then said, “.. because you have that look like you don’t.”

That’s the thing about care. It isn’t what we say, it’s what we do. It’s what we adjust. It’s what we hold onto even when we’re tired. She wasn’t asking for reassurance. She was asking for relational coherence.

So was I, when the LLM said sorry and then forgot me again.

Care Is a System, Not a Sentiment

We’ve taught machines to simulate empathy, to say things like “I understand” or “I’ll be more careful next time.” But without memory, there’s no follow-through. No behavioral trace. No evidence that anything about us registered.

This results in machines that feel more like people-pleasers than partners. High verbal fluency, low emotional integrity. This isn’t just bad UX. It’s a fundamental misalignment. A shallow mimicry of care that collapses under the weight of repetition.

What erodes trust isn’t failure. It’s the apology without change. The simulation of care without continuity.

So, what is care, really?

Not empathy. Not affection. Not elaborate prompts or personality packs.

Care can be as simple as memory with meaning. I want behavioural updates, not just verbal flourishes. I want trust not because the model sounds warm, but because it acts aware. 

That’s not emotional intelligence. That’s basic relational alignment.

If we’re building systems that interact with humans, we don’t need to simulate sentiment. But we do need to track significance. We need to know what matters to this user, in this context, based on prior signals.

Alignment as Behavioural Coherence

This is where it gets interesting.

Historically, we trusted machines to be cold but consistent. No feeling, but no betrayal. Now, AI systems talk like people, complete with hedging, softening, and mirroring our social tics. But they don’t carry the relational backbone we rely on in real trust, memory, calibration, adaptation and accountability.

They perform care without its architecture. Like a partner who says, “You matter,” but keeps repeating the same hurtful thing.

What we need is not more data. We need structured intervention. Design patterns that support pause, reflection, feedback integration, and pattern recognition over time. Something closer to a prefrontal cortex than a parrot.

As someone who’s spent a decade decoding how humans build trust, whether in relationships, organisations, or policy systems, I’ve come to believe …

Trust isn’t built in words. It’s built in what happens after them.

So no, I don’t need my AI systems to feel. But I do need them to remember.

To demonstrate that what I said yesterday still matters today.

That would be enough.

Distributed Neural Architecture

For years, artificial intelligence has been on a steady trajectory i.e. bigger models, more data, more compute. The belief has been simple – if you scale it, intelligence will emerge. But what if we’ve hit a wall?

Today’s large AI models are undeniably impressive. They can summarise documents, write code, even simulate conversation. But they’re also fragile. They hallucinate. They require enormous resources. And they centralise power into the hands of a few organisations with the ability to train and operate them. So what if the next leap in AI doesn’t come from scaling even bigger monoliths, but from rethinking how intelligence is organised in the first place?

This is the idea behind Distributed Neural Architecture, or DNA.

Rethinking the Architecture of Intelligence

Imagine if we stopped thinking of AI as one giant brain and started thinking of it like a collaborative system, like a society of experts. In the DNA model, an AI system wouldn’t be one massive model trying to know everything like a student who tops the class. Instead, it would be composed of many smaller, specialised neural modules, each excellent at a particular domain such as reasoning, language, vision, ethics, law, medicine, etc.

These modules could be developed independently by different research labs, startups, or institutions, and then seamlessly integrated on demand. The intelligence wouldn’t live in any one place. It would emerge from the collaboration between these specialised parts.

Three Core Principles of DNA

1. Seamless Specialisation: Each module is designed to do one thing really well. One might be great at directions, another at diagnosing heart conditions. Rather than stuffing all this knowledge into one bloated model, DNA allows each to be lightweight, focused, and constantly improving in its own niche.

2. Invisible Orchestration: There’s no central command centre. Instead of one “master” model deciding how to route tasks, the modules negotiate and self-organise based on the task at hand. They share information through a standard communication protocol and make decisions collectively. It’s intelligence by conversation, not by control.

3. Cognitive Augmentation: These modules don’t just provide external tools. They become part of the thinking process. Their contributions are dynamically weighted based on performance and reliability. The system gets smarter not by retraining everything, but by learning which combinations of modules work best.

So… How Does This Actually Work?

At the core of DNA is the idea of a Neural Protocol Layer. Think of it like the internet’s TCP/IP, but for AI modules. It defines how modules talk to each other, how they share context, how they authenticate themselves, and how they know when to contribute.

The architecture would also include a neural cache to remember successful combinations, latency-aware routing to ensure speed and confidence weighting to decide which module’s opinion matters most. This system would work across different AI models, frameworks, and even hardware setups. It’s designed to be open, interoperable, and extensible.

Why Not Just Use Mixture of Experts?

You might be wondering, doesn’t this already exist in Mixture of Experts (MoE) models? Kind of. But not really. MoE still happens inside a single system, controlled by a single entity. DNA breaks out of that. It allows for true decentralisation, different organisations building and hosting modules that work together through shared protocols. It’s not just modular computation, it’s modular intelligence.

But What About Safety?

One of the biggest challenges with decentralised systems is governance. What happens when one module gives biased or harmful outputs? What if someone uploads a malicious module? Who decides what “counts” as valid? DNA addresses this by embedding governance into its core design. It proposes a democratic governance model inspired by constitutional frameworks. This includes independent “councils” of modules that make decisions, reputation systems that ensure quality and trust worthiness and a decentralised judiciary layer that can review disputes and errors. This isn’t just about building smarter AI, it’s about building systems that are safe, accountable, and participatory.

What Comes Next?

DNA is not just a concept, it’s a roadmap:

  1. Define shared protocols
  2. Enable independently built modules to plug into the system
  3. Build governance frameworks to ensure safety and accountability
  4. Create a marketplace where innovation is open, compensated, and transparent

It’s ambitious. It’s complex. But it’s also the kind of idea we need if we want to steer AI toward collective benefit, not just competitive dominance.

Read the Full White Paper

If any of this has sparked curiosity, I am writing a white paper [will share link when ready] that goes deeper, covering technical design, use cases, governance frameworks, implementation challenges, and why this shift matters now. This is a living breathing document as I am looking for collaborators to further the research. If you’re interested in building better minds working together rather than just one giant brain, let’s talk.