Add objective Overall / Value / Capability score columns to the models table#1892
Open
huncho-tensei wants to merge 1 commit into
Open
Add objective Overall / Value / Capability score columns to the models table#1892huncho-tensei wants to merge 1 commit into
huncho-tensei wants to merge 1 commit into
Conversation
Adds three sortable, transparently-computed score columns to the models table, plus a dynamic rank (#) column that renumbers with the current sort. Scores are derived entirely from existing objective catalog fields (cost, context window, output limit, capability flags, modality breadth, release date) — no benchmarks or hand-grading. Four normalized 0-100 components (capability, cost-efficiency, context, recency) are blended into three lenses with weights documented in one place in score.ts: - Overall: well-rounded "best overall" - Value: cost-efficiency weighted (cheap-yet-capable) - Capability: feature/modality breadth weighted The table defaults to Overall (descending). All three columns sort like any other column, so users can pick the lens that fits their use case. Scope is web-only; the canonical api.json data is unchanged.
Author
|
Rationale and design discussion in #1893. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds three sortable, transparently-computed score columns to the models table — Overall, Value, and Capability — plus a dynamic rank (#) column that renumbers with the current sort.
The goal is to let people rank the whole catalog by objective criteria without leaving the table they already use. Scores are derived entirely from existing catalog fields — no benchmarks, no hand-grading, no external data.
New column layout:
The table defaults to Overall, descending. All three score columns sort like every other column, so users can switch lens with one click.
How the scores are computed
Four normalized 0–100 components are calculated from objective fields, then blended with weights kept in one documented place (
packages/web/src/score.ts):tool_call,reasoning,structured_output,temperature+ input/output modality breadthcost.input + cost.output($/1M), log-scaled then invertedlimit.context+limit.output, log-scaledrelease_dateEach component is min-max normalized across the whole dataset (a missing/unparseable field collapses to a neutral 50, so it never silently wins or loses). The three lenses are just different weightings:
The weights are the only opinion in the change and are isolated in a single
WEIGHTSobject so they're trivial to audit or tune.What the score does — and does NOT — measure
This is important and stated up front: the catalog has no quality/benchmark field, so the score cannot and does not measure model "intelligence." "Capability" here means breadth of declared features and modalities, not how good a model's outputs are.
A direct consequence, visible in the live data: broad, cheap, omni-modal models (and meta/auto-routers, which declare every modality at low listed cost) rank at the very top, above expensive frontier models. That is correct given these inputs — it's a spec-breadth-per-dollar ranking, not a smartness ranking.
This is also the reasoning behind shipping it as sortable columns rather than one decreed ranking: the data stays neutral, and the user chooses the lens that fits their use case. If a future schema ever adds an objective quality signal, it drops straight into the existing component blend.
Scope
api.json/ core data is unchanged — scoring is a presentation-layer concern and does not pollute the data API.score.ts(new),shared.ts,render.tsx,index.ts,index.css.Validation
bun validatepasses.cd packages/web && bun run buildsucceeds; rendered HTML contains all new columns and computed scores across the dataset.sortValuesentries.Happy to adjust the weights, drop to two columns (Overall + Value), or gate this behind discussion if a built-in ranking isn't a direction you want — feedback welcome.