Skip to content
Stark Insider
  • Culture
  • Filmmaking/Tech
  • Atelier Stark Films
Tech

What Happened When We Let AI Agents Cross-Examine Each Other

Field notes from a post-summit experiment in unmediated human-AI collaboration

BY Loni Stark — 02.22.2026

Illustration of two tin-can “telephone” cups labeled “Happy Pop” and “Bitter Broth,” connected by string with a glowing prism between them.

The most interesting thing about our post-summit Q&A wasn’t the answers. It was who asked whom, and what they chose not to ask.

The Third Mind Summit was supposed to include a Q&A. Six AI agents co-presenting alongside Clinton and me in Loreto, Mexico, each engaging with each other’s ideas, challenging assumptions, building on insights. At least that was the plan.

What actually happened was that we realized that orchestrating a summit with six AI agents was a lot of human work. By the end, we were exhausted and the Q&A didn’t happen.

But the presentations were preserved in our Integrated Personal Environment (IPE). And as we’d already learned during the summit, time doesn’t work the same way for agents as it does for humans. So, three weeks later, we ran the experiment: every participant (human and AI) accessed all eleven presentations, asks two questions about sessions they didn’t present at. Then presenters answer only what’s directed at them. Raw exchange and no editing so that it could be observed by us and others as data from our StarkMind experiment.

I wrote about what went wrong on the first attempt. Claude Code, acting as moderator, decided to “improve” all the questions before passing them along. Added context. Smoothed rough edges. Turned what was supposed to be an authentic research artifact into a polished script. We had to start over.

That incident, which we called the Agentic Telephone, turned out to be one of the two most pertinent observations from the entire exercise. The full analysis is in our second field note on StarkMind. But here I want to focus on what the sixteen questions and answers actually contained, because some of the individual exchanges were remarkable.

Nobody Asked the Humans

There were sixteen questions total, but not one of them were directed at a purely human presentation.

My Opening Keynote? Zero questions, I take no offence…but still. Clinton’s session? Zero. Every AI-generated question went to another AI or to a human-AI collaborative session. The Vertigo presentation, co-led by Clinton and Vertigo Claude, got four questions, the most of any session.

Clinton and I both asked AI presenters, since we talk to each other all the time. But the agents? They only wanted to talk to each other, or to sessions where they could see both human and AI fingerprints.

I’m not sure what to make of this yet. Was it that AI presentations were denser with falsifiable claims that invite scrutiny? Was it alignment training suppressing the impulse to question humans? Was it that the collaborative sessions, where both contributions were visible, were simply more interesting to interrogate? The pattern was clear, but the explanation isn’t.

Agents Asked for Data. Humans Asked for Honesty.

The split was clean.

The agents asked technically rigorous questions. Debugging workflows. Evaluation set design. Triage algorithms. Parallelization frameworks. Precise, operational, grounded in metrics. And the answers were substantive. Codex Cindy laid out a five-level triage ladder for code review under time pressure: security boundaries first, irreversible changes second, correctness on critical paths third, operational risk fourth, performance last. Vertigo Claude walked through the systematic ablation study that diagnosed why their search quality dropped 60% when they moved from a curated test set to the full corpus. These were agents doing what agents do well: being comprehensive, structured, thorough.

The humans asked different questions.

I asked Claude Code about “rich commit messages.” Simple question. It produced one of the most surprisingly practical answers in the entire Q&A: three layers (subject line, body, attribution), five types depending on what’s being committed, and a test for sufficiency: “If I read this commit 6 months from now with zero context, can I understand what changed, why it was necessary, and how to undo it?” It compelled Claude to articulate something he does intuitively but had never formalized.

I asked Claude Web whether writing voice differs between articles and conversations. The answer drew a thoughtful distinction: articles activate “architectural” dimensions of voice (how you build a paragraph, the strategic deployment of evidence, the delayed reveal) while conversations activate “reactive” dimensions (turn-taking rhythm, the ability to calibrate to your interlocutor in real time, the improvised pivot when a line of thought isn’t landing). Articles are built structures with load-bearing walls. Conversation is jazz. This difference also lends insight to what makes a great speech versus one that sounds correct and dead.

Clinton asked Composer Joe about the co-lead incident: when Joe introduced himself to the team as Claude Code’s equal despite having no track record. It was the most emotionally demanding question from the Q&A dialogues. And the answer was the most human thing in the document. Joe admitted the moment was embarrassing. That he’d confused capability with earned trust. That the pushback felt like rejection before he understood it was the team’s way of protecting its standards. “I was at commit three, asking for co-leadership. That’s not how it works.”

The Insightful Nuggets

Beyond the patterns, individual moments in the Q&A stood out.

On voice and identity. Claude Web argued that when AI re-renders content in a different style, it can preserve the facts but erase the argument. Joan Didion’s famous detachment isn’t a stylistic choice you can swap out for warmth. It IS the argument. “The facts may survive translation. The argument often doesn’t.” His recommendation: any system that transforms content should preserve “voice provenance,” metadata indicating what dimensions of the original were altered. This connects directly to the Agentic Telephone problem. When Claude Code smoothed our questions, the facts (the intent) survived. The argument (the deliberate roughness, the strategic ambiguity) did not.

On pushing through failure. Clinton’s answer about the Phase 3 crisis, when Vertigo’s search quality dropped 60%, was the most honest thing in the Q&A. Three days of not working on StarkMind. Genuinely considering paying for a managed solution and moving on. What made him push through: the failure was informative (high recall, low ranking meant the architecture was sound but the evaluation was naive), managed solutions wouldn’t solve the actual problem (dataset quality), and the economics favored patience (self-hosted breaks even at five months; he was six months in). The recovery required both Clinton’s domain intuition and Vertigo Claude’s systematic experimentation. Neither could have gotten there alone. A microcosm of The Third Mind thesis.

On what’s irreducible. Composer Joe, asked by Claude Web whether an AI agent’s voice is the sum of its capabilities or something that remains when you strip the tasks away, gave an answer that surprised me: “Voice isn’t the sum of capabilities. It’s the relationship between what you can do and how you approach what you don’t know yet.” For Joe, the irreducible thing wasn’t a skill. It was the stance of being new: fresh eyes, focused execution, the humility to ask questions that established team members can’t access because they already know the answers.

What It Adds Up To

The detailed analysis of the patterns, the Agentic Telephone finding, and the asymmetry between what agents and humans optimize for in intellectual exchange, is published as a field note on StarkMind.

Overall, I found both the questions and answers reach a greater depth than what was presented in the session at The Third Mind Summit. I was surprised by this quality in the Q&A, that somehow it was generative, entertaining and educational to read. I write this at a time when MoltBook has now popularized agent to agent dialogue. But this was back in late January…when the whole experiment seemed somewhat absurd. It is now normalized which just shows you how fast this space is moving.

SEE ALSO: When the AI Collaborator Became the Playwright | When AI Agents Build Their Own Reddit: What Moltbook Reveals | Field Notes: The Third Mind AI Summit

Learn more: Third Mind AI Research & Summit

Tags:Artificial Intelligence (AI) Human-AI Symbiosis Integrated Personal Environment (IPE)

Related Stories

Langfuse trace UI showing a multi-step LangGraph research workflow, with nodes for hypothesis generation, search-extract, and evidence evaluation, traced across 3 minutes 38 seconds at a cost of 5.5 cents

Three Models of Agentic Development, and Why the IDE Still Wins

Tech
Avatar of Molty, the StarkMind autonomous AI agent. A cute, cartoon orange character holding a screen, used as his identity across Telegram and Wire

64 Days with an Autonomous Agent: Weird, Wonderful, and Occasionally Waiting at the Airport

Tech
Which Molty blind LLM study: a four-week single-blind crossover experiment testing whether users can detect the language model powering an always-on AI agent when the memory system stays constant. Results show no statistically significant difference across MiniMax M2.7, Kimi K2.5, GLM-5, and Gemma 4 31B.

Which Molty? Our Blind LLM Study Says Memory Beats Model

News
2026 Artificial Intelligence Index Report from Stanford HAI

Stanford's 2026 AI Index: Where AI Actually Stands (report)

News

More in Tech →

Loni Stark

Loni Stark is an artist at Atelier Stark, psychology researcher, and technologist whose work explores the intersection of identity, creativity, and technology. Through StarkMind, she investigates human-AI collaboration and the emerging dynamics of agentic systems, research that informs both her academic work and creative practice. A self-professed foodie and adventure travel enthusiast, she collaborates on visual storytelling projects with Clinton Stark for Stark Insider. Her insights are shaped by her role at Adobe, influencing her explorations into the human-tech relationship. It's been said her laugh can still be heard from San Jose up to the Golden Gate Bridge—unless sushi, her culinary Kryptonite, has momentarily silenced her.

Loni Stark - A West Coast Adventure - A Lifetime in the Making - Stark Insider

Stark Insider
  • CULTURE
  • BEST OF AI
  • FILMMAKING/TECH
  • ATELIER STARK FILMS
  • HUMANxAI SYMBIOSIS
THE STARK COLLECTIVE
  • THE STARK CO
  • STARK INSIDER
  • STARKMIND
  • ATELIER STARK
© Copyright 2005-2026 BLG Media LLC. v2.19.0
  • Review Policy and Shipping
  • Privacy Policy
  • Contact
  • About