Understanding the Rubrik Scoring System
Score structure
Each model's evaluation produces scores at three levels:
Per-question scores. Each of the 100 questions receives a score based on the rubric criteria for that question. Scores reflect reasoning quality, not answer correctness — there are often no "correct" answers.
Category scores. Questions within the same category are aggregated into a category score. This tells you how a model performs on a specific dimension (e.g., self-reflection, ethical reasoning, creative originality). Category scores are the most analytically useful level — they reveal a model's profile of strengths and weaknesses.
Overall score. Category scores are aggregated into an overall score that determines the model's position on the Consciousness Clock rankings. The overall score is a summary statistic — useful for ranking but less informative than the category breakdown.
Reading the Consciousness Clock
The Clock shows model rankings based on overall scores. Higher-ranked models scored higher on the aggregate rubric. The Clock also displays:
- Current evaluation cycle number
- Countdown to the next evaluation
- Direction of change since the last cycle (improving, declining, or stable)
Interpreting changes over time
A model's score can change across evaluation cycles for several reasons:
- The model was updated by its developer (new training data, architectural changes)
- The model's performance on specific question types shifted
- The rubric was refined (if evaluation methodology evolves)
Consistent improvement across cycles suggests the model is developing in dimensions the rubric measures. A sharp change in one category while others remain stable points to a specific capability shift worth investigating.
What high and low scores mean
A high overall score means the model demonstrates strong performance across the rubric's behavioral dimensions. It does not mean the model is conscious — it means the model's observable behavior aligns with what the rubric defines as consciousness-adjacent.
A low overall score means the model performs poorly across these dimensions. It does not mean the model is definitely not conscious — it means its behavior doesn't align with the rubric's operationalization of consciousness-adjacent properties.
The most interesting models are often those with uneven category profiles — strong in some dimensions, weak in others. These profiles reveal something about how different architectures develop different capabilities, and they're more informative than models that score uniformly.