Built with AI · Part 4

Model Trust: Building an AI Judge for AI Models

April 5, 20268 min read

Built with AIAI-assisted devLLM evaluationNext.jsModel Trust

Model Trust: Building an AI Judge for AI Models

Every AI product I've built has forced the same question: which model should I use? The answer was never obvious. I was making the call based on gut feeling, marketing pages, and whatever someone on Twitter said last week. I wanted data.

What is Model Trust? Model Trust is an evaluation framework that sends identical prompts to multiple AI models and uses an AI judge to score their reliability, measure agreement, and flag when human review is needed. It currently tests models from OpenAI, Anthropic, Google, and xAI across structured question types like Likert scales, binary choices, and open-ended responses.

The backstory

After building Cash Grab, Fresh News, and Wolf Wednesday, I had three products using AI in different ways. Each time I picked a model, I was guessing. I'd read the benchmarks on the provider's website and think "sure, that looks good." But benchmarks test what the provider wants to show you. They don't test your specific use case with your specific prompts.

I started swapping models in my projects and noticing real differences. Not just speed or price, but actual disagreements in the outputs. One model would confidently give an answer that another model flatly contradicted. Both sounded authoritative. Both formatted their responses beautifully. And I had no systematic way to figure out which one to trust.

That's what Model Trust was built to solve. Not another benchmark suite. A tool for answering a simple question: if I send the same prompt to every major model, do they agree? And when they don't, how do I figure out which one is off?

What it does

The workflow is straightforward. You create a survey with a set of questions, pick which models to evaluate, and run it. Model Trust sends identical prompts to every selected model and collects their responses. The key detail is that every prompt enforces structured JSON responses. No freeform text that you have to eyeball and compare. The models return data in a consistent format, which means you can actually measure agreement programmatically.

The platform supports a range of question types because different evaluation needs call for different formats. Open-ended questions test how models reason and explain. Single-select and binary questions test whether they reach the same conclusion. Likert scales and numeric scales test whether they calibrate similarly. Forced-choice questions test how they handle tradeoffs. Matrix Likert questions test consistency across related items. Each type gets its own analysis pipeline because comparing free text and comparing Likert responses are fundamentally different problems.

The AI judge

This is the core of the whole thing. After a run completes, an AI judge evaluates every model's performance across two dimensions: reliability and agreement.

Reliability scoring looks at the mechanics. Did the model return valid JSON? How often did it leave answers empty? Did it cite sources when asked? How consistent was its response latency? Each factor contributes to a score out of 10. A model that returns broken JSON half the time might have great answers in the other half, but you can't ship that.

Agreement analysis is where it gets more interesting. The system uses TF-IDF vectors and cosine similarity to cluster responses and find where models converge or diverge. If four models say roughly the same thing and one says something completely different, that outlier gets flagged. Not automatically marked wrong. Just flagged. The outlier might be the only one that's right. But you need to know it's an outlier so a human can look at it.

The reliability threshold is 7.0. Any model scoring below that on a run gets flagged for human review. And if 30% or more of the questions in a run show significant disagreement across models, the entire run gets flagged. These numbers aren't magic. I picked them after running dozens of test surveys and watching where the false positives and false negatives clustered. They're calibrated to my use cases, and I expect to adjust them over time.

The important thing: Model Trust never makes autonomous production decisions. It surfaces data and flags uncertainty. The human decides what to do with it.

What was surprising

I expected the premium-tier models to dominate across the board. They didn't. The cost-effective tiers from most providers performed surprisingly close to their premium counterparts on structured questions. Binary and single-select responses were nearly identical between tiers. The gap showed up mostly on open-ended questions where the premium models gave more nuanced reasoning, but the actual conclusions were often the same.

The biggest surprise was where models agreed. On factual questions with clear answers, agreement was high across all providers. That's reassuring but not particularly useful. The interesting data came from questions with legitimate ambiguity. Models clustered into predictable camps, and which camp a model landed in was more consistent than I expected. Run the same ambiguous question ten times and a given model tends to land in the same cluster each time.

One pattern I didn't expect: a model's expressed confidence had almost no correlation with the quality of its response. Some models would hedge and qualify and then give a precise, correct answer. Others would state things with total certainty and be the outlier. Confidence is performance, not signal.

Where AI helped build it

Claude Code built the Next.js application from the ground up. The NLP pipeline for agreement analysis was a particularly good fit for AI-assisted development. TF-IDF vectorization, entity extraction, sentiment analysis. These are well-documented patterns with clear implementations. Claude Code generated the core pipeline and I tuned the parameters.

The MySQL job queue was another win. Model Trust runs can take minutes when you're hitting multiple API providers in sequence, so I needed background job processing with progress tracking. Claude Code helped me build a custom queue backed by Prisma and MySQL with server-sent events for live progress updates in the browser. Not novel architecture, but the kind of plumbing that eats days when you're writing it from scratch.

The structured JSON enforcement layer, the part that makes sure every model returns comparable data, was iterative work between me and Claude Code. I'd describe the schema I wanted, Claude Code would generate the prompt engineering and validation logic, I'd test it against real model outputs and find edge cases, and we'd refine.

Where AI didn't help

Designing the evaluation methodology was entirely on me. What does "trust" even mean for an AI model? Is it accuracy? Consistency? Agreement with other models? Transparency about uncertainty? I spent more time on that question than on any piece of code in the project.

Choosing which metrics to weight in the reliability score was a judgment call. JSON validity is binary and easy. But how much should latency consistency matter relative to citation rates? There's no right answer. There's only what matters for the decision you're trying to make.

The threshold decisions were the hardest part. Why 7.0 for reliability? Why 30% for disagreement? Because those are the points where, in my testing, the signal-to-noise ratio felt right. Below 7.0, I was consistently finding problems when I manually reviewed the responses. Above 30% disagreement, the models were consistently split on something substantive. But "felt right" is a human judgment, and it took dozens of test runs to get there. AI can build the scoring system. It can't tell you what number should make you nervous.

Frequently asked questions

What is Model Trust?

Model Trust is an AI model evaluation platform that sends identical structured prompts to multiple LLMs and compares their responses using reliability scores, agreement analysis, and an AI judge. It helps determine which models to trust for specific tasks.

Which AI models does Model Trust compare?

Model Trust currently evaluates models from OpenAI (GPT-4o, GPT-4o mini), Anthropic (Claude Sonnet, Claude Opus), Google (Gemini Flash, Gemini Pro), and xAI (Grok 3, Grok 3 mini). Each provider has cost-effective and premium tiers.

How does the AI judge work?

The AI judge scores each model on JSON validity, citation rates, empty answer frequency, and latency consistency to produce a reliability score out of 10. It then analyzes agreement across models using TF-IDF clustering and cosine similarity, flagging outliers for human review.

What types of questions can Model Trust evaluate?

The platform supports open-ended, single-select, binary, forced-choice, Likert scale, numeric scale, and matrix Likert question types. Each type has tailored analysis methods for measuring model agreement.

Does Model Trust replace human judgment about which model to use?

No. Model Trust flags uncertainty and disagreement for human review. It ranks models by reliability and cost, but the final production decision always stays with a human.

Can I try Model Trust?

Yes. Model Trust is live at modeltrust.app. You can create surveys, configure runs across models, and analyze the results.

Frequently Asked Questions