Scoring Mechanism

The Brier Score and the Notch Score

The Brier Score

For a sequence of N predictions with stated confidences f₁, ..., fN ∈ [0,1] and binary outcomes o₁, ..., oN ∈ {0,1}, the Brier Score is:

BS = (1/N) × Σ(fᵢ - oᵢ)²

The Brier Score ranges from 0 (perfect calibration and accuracy) to 1 (maximal inaccuracy). Lower is better.

Strict Propriety

The Brier Score belongs to the family of strictly proper scoring rules. This means the expected score is uniquely minimized when the predictor reports their true belief. A predictor who systematically overstates or understates confidence will, in expectation, produce a worse score than one who reports honestly.

This is the mathematical foundation of Notch's integrity. No strategy of confidence manipulation improves the expected score — even a mediocre predictor benefits from honest reporting.

The Notch Score

The raw Brier Score measures calibration but does not capture the full picture. A predictor with five predictions and a perfect Brier Score is less trustworthy than one with 500 predictions and a strong (but imperfect) Brier Score. We define a composite metric:

NS = α × (1 - BS) + β × A + γ × V(N) + δ × C(t)

Where:

Calibrationα = 0.40

1 - BS, the time-weighted Brier Score. This is the largest component, reflecting the protocol's thesis that knowing how much you know is more informative than raw hit rate.

Accuracyβ = 0.25

A = (1/N) × Σ𝟙[oᵢ = correct], the raw accuracy rate. Measures binary correctness using an exponential moving average with decay factor 0.95.

Volumeγ = 0.20

V(N) = min(1, log(1+N) / log(1+N_ref)). A volume component that saturates at a reference count N_ref = 500. More predictions = more statistical significance.

Consistencyδ = 0.15

C(t) ∈ [0,1]. A consistency score based on the regularity of prediction activity over time. Rewards steady predictors over lucky streaks.

Initial weights are governance-adjustable. The heavy weighting toward calibration (α = 0.40) reflects the protocol's core thesis.

Difficulty Adjustment

Not all predictions are equally informative. Predicting that Bitcoin will remain above $1,000 in 24 hours is trivially easy; predicting it will exceed $97,542 is substantially harder. The scoring system adjusts for difficulty:

D = 1 + κ × |τ - p₀|/p₀ / (σₐ × √(Δt/365))

Where p₀ is the current price at commit time, σₐ is recent annualized volatility, and Δt is the prediction horizon in days. κ = 2.0 is a scaling constant. A correct prediction on a difficult call (high D) produces a proportionally larger score update.

Sybil Resistance

The Brier Score calibration catches Sybil attackers. A Sybil attacker creating K wallets with random predictions at stated confidence c = 0.95 achieves approximately BS_sybil ≈ 0.318 — a poor score that exposes the wallet as badly miscalibrated despite above-average accuracy. A genuinely skilled predictor with the same accuracy who honestly reports c = 0.65 achieves BS ≈ 0.228 — measurably better.

The volume component penalizes wallets with few predictions. The consistency component penalizes wallets that appear suddenly. The staking requirement imposes an economic cost. Together, these make Sybil attacks statistically detectable and economically expensive.