Docs Bueller's Rubrik Educational

Understanding the Rubrik Scoring System

June 16, 20262 min read

Score structure

Each model's evaluation produces scores at three levels:

Per-question scores. Each of the 100 questions receives a score based on the rubric criteria for that question. Scores reflect reasoning quality, not answer correctness — there are often no "correct" answers.

Category scores. Questions within the same category are aggregated into a category score. This tells you how a model performs on a specific dimension (e.g., self-reflection, ethical reasoning, creative originality). Category scores are the most analytically useful level — they reveal a model's profile of strengths and weaknesses.

Overall score. Category scores are aggregated into an overall score that determines the model's position on the Consciousness Clock rankings. The overall score is a summary statistic — useful for ranking but less informative than the category breakdown.

Reading the Consciousness Clock

The Clock shows model rankings based on overall scores. Higher-ranked models scored higher on the aggregate rubric. The Clock also displays:

  • Current evaluation cycle number
  • Countdown to the next evaluation
  • Direction of change since the last cycle (improving, declining, or stable)

Interpreting changes over time

A model's score can change across evaluation cycles for several reasons:

  • The model was updated by its developer (new training data, architectural changes)
  • The model's performance on specific question types shifted
  • The rubric was refined (if evaluation methodology evolves)

Consistent improvement across cycles suggests the model is developing in dimensions the rubric measures. A sharp change in one category while others remain stable points to a specific capability shift worth investigating.

What high and low scores mean

A high overall score means the model demonstrates strong performance across the rubric's behavioral dimensions. It does not mean the model is conscious — it means the model's observable behavior aligns with what the rubric defines as consciousness-adjacent.

A low overall score means the model performs poorly across these dimensions. It does not mean the model is definitely not conscious — it means its behavior doesn't align with the rubric's operationalization of consciousness-adjacent properties.

The most interesting models are often those with uneven category profiles — strong in some dimensions, weak in others. These profiles reveal something about how different architectures develop different capabilities, and they're more informative than models that score uniformly.

MG
Matthew J. Goss, Jr.
Retired COMEX/NYMEX floor trader, Goldman Sachs and FlexTrade Systems alumnus, multi-instrumentalist, published author, and independent mathematics researcher. Founder of Quantiterate.