Coaching AI: A Relational Approach to AI Safety

A couple of weekends ago, my family was at Lalbagh Botanical Garden in Bangalore. After walking through a crowded mango exhibition, my 8-year-old offered to fetch her grandparents, who were walking slowly behind us. We waited outside the exhibition hall.

Five minutes passed. Then ten. Then fifteen. The grandparents emerged from the hall, but my daughter had vanished. After thirty anxious minutes, we found her perched calmly on a nearby hilltop, scanning the garden below like a baby hawk.

Her reasoning was logical. She remembered where her grandparents had last stopped (a street vendor) and went to look for them there. When she didn’t find them, she climbed up a hillock for a bird’s-eye view. Perfectly reasonable, except she had completely missed them entering the hall with her.

Her model of the world hadn’t updated with new context, so she pursued the wrong goal with increasing confidence. From her perspective, she was being helpful and clever. From ours, she was very much lost.

The Confident Pursuit of the Wrong Objective

This is a pattern familiar in AI where systems escalate confidently along a flawed trajectory. My daughter’s problem wasn’t lack of reasoning, it was good reasoning on a bad foundation.

Large models exhibit this all the time. An LLM misinterprets a prompt and confidently generates pages of on-topic-but-wrong text. A recommendation engine over-indexes on ironic engagement. These systems demonstrate creativity, optimisation, and persistence, but in the service of goals that no longer reflect the world.

This I learnt is framed in AI in terms of distributional shift or exposure bias. Training on narrow or static contexts leads to brittleness in deployment. When feedback loops fail to re-anchor a system’s assumptions, it just keeps going confidently, and wrongly.

Why Interpretability and Alignment May Not Be Enough

Afterward, I tried to understand where my daughter’s reasoning went wrong. But I also realised that even perfect transparency into her thoughts wouldn’t have helped in the moment. I could interpret her reasoning afterward, but I couldn’t intervene in it as it unfolded. What she needed wasn’t analysis. She needed a tap on the shoulder, and just a question (not a correction, mind you) – “Where are you going, and why?” 

This reflects a limitation in many current safety paradigms. Interpretability, formal alignment, and corrigibility all aim to shape systems from the outside, or through design-time constraints. But intelligent reasoning in a live context may still go off-track. 

This is like road trips with my husband. When Google Maps gets confused, I prefer to ask a local. He prefers to wait for the GPS to “figure it out.” Our current AI safety approaches often resemble the latter, trusting that the system will self-correct, even when it’s clearly drifting.

A Relational Approach to Intervention: Coaching Systems

What if intelligence, especially in open-ended environments, is inherently relational? Instead of aiming for fully self-aligned, monolithic systems, what if we designed AI architectures that are good at being coached?

We could introduce a lightweight companion model, a “coach”, designed not to supervise or override, but to intervene gently at critical reasoning junctures. This model wouldn’t need full interpretability or full control. Its job would be to monitor for known failure patterns (like confidence outpacing competence) and intervene with well-timed, well-phrased questions.

Why might this work? Because the coach retains perspective precisely because it isn’t buried in the same optimisation loop. It sees the system from the outside, not from within. It may also be computationally cheaper to run than embedding all this meta-cognition directly inside the primary system.

Comparison to Existing Paradigms

This idea overlaps with several existing safety and alignment research threads but offers a distinct relational frame:

  • Chain-of-Thought & Reflection prompting: These approaches encourage a model to think step-by-step, improving clarity and reducing impulsive mistakes. But they remain internal to the model and don’t introduce an external perspective.
  • Debate (OpenAI): Two models argue their positions, and a third agent (often human) judges who was more persuasive. This is adversarial by design. Coaching, by contrast, is collaborative, more like a helpful peer than a rival.
  • Iterated Amplification (Paul Christiano): Breaks down complex questions into simpler sub-tasks that are solved by helper agents. It’s powerful but also heavy and supervision-intensive. Coaching is lighter-touch, offering guidance without full task decomposition.
  • Elicit Latent Knowledge (Anthropic): Tries to get models to reveal what they “know” internally, even if they don’t say it outright. This improves transparency but doesn’t guide reasoning as it happens. Coaching operates during reasoning process.
  • Constitutional AI (Anthropic): Uses a set of written principles (a “constitution”) to guide a model’s outputs via self-critique and fine-tuning. It’s effective for normative alignment but works mostly post hoc. Coaching enables dynamic, context-sensitive nudges while the reasoning is still unfolding.

In short, coaching aims to foreground situated, lightweight, real-time feedback, less through recursion, adversarial setups, or predefined rules, and more through the kind of dynamic, context-sensitive interactions that resemble guidance in human reasoning. I don’t claim this framing is sufficient or complete, but I believe it opens up a promising line of inquiry worth exploring.

Implementation Considerations

A coaching system might be trained via:

  • Reinforcement learning on historical failure patterns
  • Meta-learning over fine-tuning episodes to detect escalation behaviour
  • Lightweight supervision using confidence/competence mismatches as training

To function effectively, a coaching model would need to:

  • Monitor reasoning patterns without being embedded in the same loop
  • Detect early signs of drift or false certainty
  • Intervene via calibrated prompts or questions, not overrides
  • Balance confidence and humility so it is enough to act, enough to revise

Sample interventions:

  • “What evidence would change your mind?”
  • “You’ve rejected multiple contradictory signals, why?”
  • “Your predictions and outcomes don’t match. What assumption might be off?”

Architectural Implications

This approach suggests a dual-agent architecture:

  • Task Model: Focused on primary problem-solving.
  • Coach Model: Focused on relational meta-awareness and lightweight intervention.

The coach doesn’t need deep insight into every internal weight or hidden state. It simply needs to learn interaction patterns that correlate with drift, overconfidence, or tunnel vision. This can also scale well. We could have modular coaching units trained on classes of failures (hallucination, overfitting, tunnel vision) and paired dynamically with different systems.

Of course, implementing such a setup raises significant technical questions, including how do task and coach models communicate reliably? What information is shared? How is it interpreted? Solving for communication protocols, representational formats and trust calibration are nontrivial. I plan to explore some of them more concretely in a follow-up post on Distributed Neural Architecture (DNA). 

Why This Matters

The future of AI safety likely involves many layers, including interpretability, adversarial robustness, and human feedback. But these will not always suffice, especially in long-horizon or high-stakes domains where systems must reason through novel or ambiguous contexts. 

The core insight here is that complex reasoning systems will inevitably get stuck. The key is not to eliminate error entirely, but to recognise when we might be wrong, and to build the infrastructure for possibility of course correction. My daughter didn’t need to be smarter. She needed a nudge for course correction real-time.

In a world of increasingly autonomous systems, perhaps safety won’t come from more constraints or better rewards, but from designing architectures that allow systems to be interrupted, questioned, and redirected at just the right moment.

Open Questions

  • What failure modes have you seen in LLMs or agents that seem better addressed through coaching than control?
  • Could models learn to coach one another over time?
  • What would it take to build a scalable ecosystem of coach–task model pairs?

If coaching offers a micro-level approach to safety through localised, relational intervention, DNA begins to sketch what a system-level architecture might look like, one where such interactions can be compositional, plural, and emergent. I don’t yet know whether this framework is tractable or sufficient, but I believe it’s worth exploring further. In a follow-up post, I will attempt to flesh out the idea of Distributed Neural Architecture (DNA), a modular, decentralised approach to building systems that reason not alone, but in interaction.

Relational Alignment

Recently, Dario Amodei, CEO of Anthropic, wrote about “AI welfare.” It got me thinking about the whole ecosystem of AI ethics, safety, interpretability, and alignment. We started by treating AI as a tool. Now we teeter on the edge of treating it as a being. In oscillating between obedience and autonomy, perhaps we’re missing something more essential – coexistence and collaboration.

Historically, we’ve built technology to serve human goals, then lamented the damage, then attempted repair. What if we didn’t follow that pattern with AI? What if we anchored the development of intelligent systems not just around outcomes, but around the relationships we hope to build with them?

In an earlier post, I compared the current AI design paradigm to arranged marriages: optimising for traits, ticking boxes, forgetting that the real relationship begins after the specs are met.

I ended that piece with a question …

What kind of relationship are we designing with AI?

This post is my attempt to sit with that question a little longer, and maybe go a level deeper.

From Obedience to Trust

We’re used to thinking of alignment in functional terms:

  • Does the system do what I ask?
  • Does it optimise the right metric?
  • Does it avoid catastrophic failure?

These are essential questions, especially at scale. But the experience of interacting with AI doesn’t happen in the abstract. It happens in the personal. In that space, alignment isn’t a solved problem. It’s a living process.

When I used to work with couples in conflict, I would often ask:

“Do you want to be right, or do you want to be in relationship?”

That question feels relevant again now, in the context of AI, because much of our current alignment discourse still smells like obedience. We talk about training models the way we talk about housebreaking a dog or onboarding a junior analyst.

But relationships don’t thrive on obedience. They thrive on trust, on care, attention, and the ability to repair when things go wrong.

Relational Alignment: A Reframe

Here’s the idea I’ve been sitting with …

What if alignment isn’t just about getting the “right” output, but about enabling mutual adaptation over time?

In this view, alignment becomes less about pre-specified rules and more about attunement. A relationally aligned system doesn’t just follow instructions, it learns what matters to you, and updates in ways that preserve emotional safety.

Let’s take a business example here: imagine a user relies on your AI system to track and narrate daily business performance. The model misstates a figure, off by a few basis points. That may not be catastrophic. But the user’s response will hinge on what they value: accuracy or direction. Are they in finance or operations? The same mistake can signal different things in different contexts. A relationally aligned system wouldn’t just correct the error. It would treat the feedback as a signal of value – this matters to them, pay attention.

Forgetfulness, in relationships, often erodes trust faster than malice. Why wouldn’t it do the same here?

From Universal to Relational Values

Most alignment work today is preoccupied with universal values such as non-harm, honesty, consent. And that’s crucial. But relationships also depend on personal preferences: the idiosyncratic, context-sensitive signals that make someone feel respected, heard, safe.

I think of these in two layers:

  • Universal values – shared ethical constraints
  • Relational preferences – contextual markers of what matters to this user, in this moment

The first layer sets boundaries. The second makes the interaction feel meaningful.

Lessons from Parenting

Of course, we built these systems. We have the power. But that doesn’t mean we should design the relationship to be static. I often think about this through the lens of parenting.

We don’t raise children with a fixed instruction set handed over to the infant at birth. We teach through modeling. We adapt based on feedback. We repair. We try again.

What if AI alignment followed a similar developmental arc? Not locked-in principles, but a maturing, evolving sense of shared understanding?

That might mean building systems that embody:

  • Memory for what matters
  • Transparency around uncertainty
  • Protocols for repair, not just prevention
  • Willingness to grow, not just optimise
  • Accountability, even within asymmetry

Alignment, then, becomes not just a design goal but a relational practice. Something we stay in conversation with.

Why This Matters

We don’t live in isolation. We live in interaction.

If our systems can’t listen, can’t remember, can’t repair, we risk building tools that are smart but sterile. Capable, but not collaborative. I’m not arguing against technical rigour, I’m arguing for deeper foundations.

Intelligence doesn’t always show up in a benchmark. Sometimes, it shows up in the moment after a mistake, when the repair matters more than the response.

Open Questions

This shift opens up more questions than it resolves. But maybe that’s the point.

  • What makes a system trustworthy, in this moment, with this person?
  • How do we encode not just what’s true, but what’s meaningful?
  • How do we design for difference, not just in data, but in values, styles, and needs?
  • Can alignment be personal, evolving, and emotionally intelligent, without pretending the system is human?

An Invitation

If you’re working on the technical or philosophical side of trust modelling, memory, interpretability, or just thinking about these questions, I’d love to hear from you. Especially if you’re building systems where the relationship itself is part of the value.

AI, Alignment & the Art of Relationship Design

We don’t always know what we’re looking for until we stop looking for what we are told to want.

When I worked as a relationship coach, most people came to me with a list. A neat, itemised checklist of traits their future partner must have. Tall. Intelligent. Ambitious. Spiritual. Funny but not flippant. Driven but not workaholic. Family-oriented but not clingy. The wish-lists were always oddly specific and wildly contradictory.

Most of them came from a place of fear. The fear of choosing wrong. The fear of heartbreak. The fear of regret.

I began to notice a pattern. We don’t spend enough time asking ourselves what kind of relationship we want to build. We outsource the work of introspection to conditioning, and compensate for confusion with checklists. Somewhere along the way, we forget that the person is not the relationship. The traits don’t guarantee the experience.

So I asked my clients to flip the script. Instead of describing a person, describe the relationship. What does it feel like to come home to each other? What are conversations like during disagreements? How do we repair? What values do we build around?

Slowly, something shifted. When we design the relationship first, we begin to recognise the kind of person who can build it with us. Our filters get sharper. Our search gets softer. We stop hunting for trophies and start looking for partners.

I didn’t know it then, but that framework has stayed with me. It still lives in my questions. Only now, the relationship I’m thinking about isn’t romantic. It’s technological.

Whether we realise it or not, we are not just building artificial intelligence, we are curating a relationship with it. Every time we prompt, correct, collaborate, learn, or lean on it, we’re shaping not just what it does, but who we become alongside it.

Just like we do with partners, we’re obsessing over its traits. Smarter. Faster. More efficient. More capable. The next version. The next benchmark. The perfect model.

But what about the relationship?

What kind of relationship are we designing with AI? Through it? Around it?

We call it “alignment”, but much of it still smells like control. We want AI to obey. To behave. To predictably respond. We say “safety”, but often we mean submission. We want performance, but not presence. Help, but not opinion. Speed, but not surprise.

It reminds me of the well-meaning aunties in the marriage market. Impressed by degrees, salaries, and skin tone. Convinced that impressive credentials are the same as long-term compatibility. It’s a comforting illusion. But it rarely works out that way.

Because relationships aren’t made in labs. They’re made in moments. In messiness. In the ability to adapt, apologise, recalibrate. It’s not about how smart AI is. It’s about how safe we feel with AI when it is wrong.

So what if we paused the chase for capabilities, and asked a different question?

What values do we want this relationship to be built on?

Trust, perhaps. Transparency. Context. Respect. An ability to say “I don’t know”. To listen. To course-correct. To stay in conversation without taking over.

What if we wanted AI that made us better? Not just faster or more productive, but more aware. More creative. More humane. That kind of intelligence isn’t artificial. It’s collaborative.

For that, we need a different kind of design. One that reflects our values, not just our capabilities. One that prioritises the quality of interaction, not just the quantity of output. One that knows when to lead, and when to listen.

We’re not building tools. We’re building relationships.

The sooner we start designing this, the better the chances we’ll have at coexisting, collaborating, and even growing together.

Because if we get the relationship right, the intelligence will follow.