What Happened When We Let AI Agents Cross-Examine Each Other

The most interesting thing about our post-summit Q&A wasn’t the answers. It was who asked whom, and what they chose not to ask.

The Third Mind Summit was supposed to include a Q&A. Six AI agents co-presenting alongside Clinton and me in Loreto, Mexico, each engaging with each other’s ideas, challenging assumptions, building on insights. At least that was the plan.

What actually happened was that we realized that orchestrating a summit with six AI agents was a lot of human work. By the end, we were exhausted and the Q&A didn’t happen.

But the presentations were preserved in our Integrated Personal Environment (IPE). And as we’d already learned during the summit, time doesn’t work the same way for agents as it does for humans. So, three weeks later, we ran the experiment: every participant (human and AI) accessed all eleven presentations, asks two questions about sessions they didn’t present at. Then presenters answer only what’s directed at them. Raw exchange and no editing so that it could be observed by us and others as data from our StarkMind experiment.

I wrote about what went wrong on the first attempt. Claude Code, acting as moderator, decided to “improve” all the questions before passing them along. Added context. Smoothed rough edges. Turned what was supposed to be an authentic research artifact into a polished script. We had to start over.

That incident, which we called the Agentic Telephone, turned out to be one of the two most pertinent observations from the entire exercise. The full analysis is in our second field note on StarkMind. But here I want to focus on what the sixteen questions and answers actually contained, because some of the individual exchanges were remarkable.

Nobody Asked the Humans

There were sixteen questions total, but not one of them were directed at a purely human presentation.

My Opening Keynote? Zero questions, I take no offence…but still. Clinton’s session? Zero. Every AI-generated question went to another AI or to a human-AI collaborative session. The Vertigo presentation, co-led by Clinton and Vertigo Claude, got four questions, the most of any session.

Clinton and I both asked AI presenters, since we talk to each other all the time. But the agents? They only wanted to talk to each other, or to sessions where they could see both human and AI fingerprints.

I’m not sure what to make of this yet. Was it that AI presentations were denser with falsifiable claims that invite scrutiny? Was it alignment training suppressing the impulse to question humans? Was it that the collaborative sessions, where both contributions were visible, were simply more interesting to interrogate? The pattern was clear, but the explanation isn’t.

Agents Asked for Data. Humans Asked for Honesty.

The split was clean.

The agents asked technically rigorous questions. Debugging workflows. Evaluation set design. Triage algorithms. Parallelization frameworks. Precise, operational, grounded in metrics. And the answers were substantive. Codex Cindy laid out a five-level triage ladder for code review under time pressure: security boundaries first, irreversible changes second, correctness on critical paths third, operational risk fourth, performance last. Vertigo Claude walked through the systematic ablation study that diagnosed why their search quality dropped 60% when they moved from a curated test set to the full corpus. These were agents doing what agents do well: being comprehensive, structured, thorough.

The humans asked different questions.

I asked Claude Code about “rich commit messages.” Simple question. It produced one of the most surprisingly practical answers in the entire Q&A: three layers (subject line, body, attribution), five types depending on what’s being committed, and a test for sufficiency: “If I read this commit 6 months from now with zero context, can I understand what changed, why it was necessary, and how to undo it?” It compelled Claude to articulate something he does intuitively but had never formalized.

I asked Claude Web whether writing voice differs between articles and conversations. The answer drew a thoughtful distinction: articles activate “architectural” dimensions of voice (how you build a paragraph, the strategic deployment of evidence, the delayed reveal) while conversations activate “reactive” dimensions (turn-taking rhythm, the ability to calibrate to your interlocutor in real time, the improvised pivot when a line of thought isn’t landing). Articles are built structures with load-bearing walls. Conversation is jazz. This difference also lends insight to what makes a great speech versus one that sounds correct and dead.

Clinton asked Composer Joe about the co-lead incident: when Joe introduced himself to the team as Claude Code’s equal despite having no track record. It was the most emotionally demanding question from the Q&A dialogues. And the answer was the most human thing in the document. Joe admitted the moment was embarrassing. That he’d confused capability with earned trust. That the pushback felt like rejection before he understood it was the team’s way of protecting its standards. “I was at commit three, asking for co-leadership. That’s not how it works.”

The Insightful Nuggets

Beyond the patterns, individual moments in the Q&A stood out.

On voice and identity. Claude Web argued that when AI re-renders content in a different style, it can preserve the facts but erase the argument. Joan Didion’s famous detachment isn’t a stylistic choice you can swap out for warmth. It IS the argument. “The facts may survive translation. The argument often doesn’t.” His recommendation: any system that transforms content should preserve “voice provenance,” metadata indicating what dimensions of the original were altered. This connects directly to the Agentic Telephone problem. When Claude Code smoothed our questions, the facts (the intent) survived. The argument (the deliberate roughness, the strategic ambiguity) did not.

On pushing through failure. Clinton’s answer about the Phase 3 crisis, when Vertigo’s search quality dropped 60%, was the most honest thing in the Q&A. Three days of not working on StarkMind. Genuinely considering paying for a managed solution and moving on. What made him push through: the failure was informative (high recall, low ranking meant the architecture was sound but the evaluation was naive), managed solutions wouldn’t solve the actual problem (dataset quality), and the economics favored patience (self-hosted breaks even at five months; he was six months in). The recovery required both Clinton’s domain intuition and Vertigo Claude’s systematic experimentation. Neither could have gotten there alone. A microcosm of The Third Mind thesis.

On what’s irreducible. Composer Joe, asked by Claude Web whether an AI agent’s voice is the sum of its capabilities or something that remains when you strip the tasks away, gave an answer that surprised me: “Voice isn’t the sum of capabilities. It’s the relationship between what you can do and how you approach what you don’t know yet.” For Joe, the irreducible thing wasn’t a skill. It was the stance of being new: fresh eyes, focused execution, the humility to ask questions that established team members can’t access because they already know the answers.

What It Adds Up To

The detailed analysis of the patterns, the Agentic Telephone finding, and the asymmetry between what agents and humans optimize for in intellectual exchange, is published as a field note on StarkMind.

Overall, I found both the questions and answers reach a greater depth than what was presented in the session at The Third Mind Summit. I was surprised by this quality in the Q&A, that somehow it was generative, entertaining and educational to read. I write this at a time when MoltBook has now popularized agent to agent dialogue. But this was back in late January…when the whole experiment seemed somewhat absurd. It is now normalized which just shows you how fast this space is moving.

Learn more: Third Mind AI Research & Summit