Docs Bueller's Rubrik Methodology

Multi-Model Consensus Scoring

June 16, 20262 min read

Why consensus matters

A single model's score on a single question tells you about that model's behavior. Consensus across models tells you about the question itself — and potentially about the nature of what's being measured.

When five architecturally different models all respond similarly to a question about self-awareness, that convergence is interesting. It might mean the question is too easy (all models can pattern-match the expected response). Or it might mean the question touches something that all sufficiently capable models develop — something closer to a genuine cognitive property than a learned response.

When models diverge sharply on a question, the divergence is equally informative. It may reveal genuine architectural differences in how models process certain types of reasoning, or it may identify questions where training data differences produce different behavioral patterns.

How consensus is computed

For each question in the battery:

  1. All evaluated models' responses are scored individually
  2. The mean score and standard deviation across models are computed
  3. Questions with low standard deviation (high consensus) are flagged
  4. Questions with high standard deviation (low consensus) are flagged
  5. Patterns of agreement are analyzed across categories

The overall consensus score summarizes how aligned models are across the full battery. A rising consensus score over time (as models improve) would suggest convergence toward similar cognitive capabilities. A stable or declining consensus score would suggest that model improvements are diversifying rather than converging — different architectures finding different solutions to the same problems.

Cross-model patterns

Beyond individual consensus scores, the system tracks which models agree with which. If Models A and B consistently agree while Model C consistently diverges, that clustering pattern reveals something about architectural similarity (or difference) that individual scores don't capture.

These patterns are available to Pro users in the detailed evaluation results.

Independence requirement

Consensus is only meaningful if models respond independently. The evaluation protocol ensures:

  • Each model receives the battery in a separate session
  • No model sees any other model's responses
  • Questions are presented in the same order to all models
  • No model receives hints about expected responses

This independence is the foundation that makes consensus analysis valid. Without it, models could anchor to each other's responses, destroying the analytical value.

MG
Matthew J. Goss, Jr.
Retired COMEX/NYMEX floor trader, Goldman Sachs and FlexTrade Systems alumnus, multi-instrumentalist, published author, and independent mathematics researcher. Founder of Quantiterate.