Research

Calibration as a Commodity: A Protocol for Verifiable Trading Skill

Qais Alassa·Osama Alashqar·April 2026

The essay introducing the paperSSRN10.2139/ssrn.6505678

The Problem: Reputation is DeFi's Missing Primitive

Decentralized finance has solved an impressive sequence of problems. Uniswap solved trustless exchange. Aave and Compound solved trustless lending. GMX and Hyperliquid solved trustless derivatives. Each of these protocols replaced a financial intermediary with a mechanism — a set of rules enforced by code rather than institutions.

But there is one financial primitive that remains stubbornly centralized: reputation.

Every DeFi market that depends on the quality of human judgment still relies on the same mechanism that traditional finance uses: “trust me, I’m good.” There is no on-chain way to distinguish a skilled trader from a lucky gambler. No cryptographic proof of calibration. No verifiable, manipulation-resistant credential for predictive accuracy.

Copy-trading platforms measure past PnL, which is gameable through survivorship bias, selective reporting, and leverage-adjusted distortion. A trader who made 10x on a 100x leveraged position looks identical to one who made 10x through genuine insight — until the first black swan. Prediction markets like Polymarket aggregate collective probabilities beautifully, but they don’t decompose individual calibration. They tell you what the crowd thinks; they don’t tell you which members of the crowd are actually well-calibrated.

On-chain analytics platforms (Nansen, Arkham, DeBank) track wallet behavior — what addresses hold, what they traded, when they moved funds. This is observation, not verification. Watching someone trade is not the same as proving they can predict.

This is not a data analytics problem. It is a mechanism design problem: how do you create incentive-compatible, manipulation-resistant, cryptographically verifiable proof that a given agent is well-calibrated in their probabilistic assessments?

Notch is an answer to that question.

Calibration: A Mathematical Definition

A forecaster is calibrated if, among all predictions where they assign probability $p$ , the fraction of outcomes that actually occur converges to $p$ . When a calibrated forecaster says “70% chance,” the event happens approximately 70% of the time. Not 60%. Not 80%. Seventy.

The standard quantitative measure of calibration is the Brier Score, introduced by Glenn Brier in 1950 for evaluating weather forecasts. For a sequence of $N$ predictions with stated probabilities $f_1, \ldots, f_N \in [0,1]$ and binary outcomes $o_1, \ldots, o_N \in \{0,1\}$ :

BS = \frac{1}{N} \sum_{t=1}^{N} (f_t - o_t)^2

The Brier Score ranges from 0 (perfect) to 1 (maximally wrong). A forecaster who always predicts 50% — the equivalent of saying “I have no idea” — achieves a Brier Score of 0.25. Any score below 0.25 represents positive skill; any score above it represents negative skill (worse than admitting ignorance).

The critical property of the Brier Score for our purposes is that it is strictly proper. A scoring rule is strictly proper if and only if the unique strategy that minimizes the forecaster’s expected score is to report their true beliefs. Formally: for any true belief $q$ about the probability of an event, and any reported probability $f$ :

\mathbb{E}_q\left[(f - o)^2\right] \geq \mathbb{E}_q\left[(q - o)^2\right] \quad \text{with equality iff } f = q

This is not an empirical observation. It is a mathematical theorem. No strategy of confidence manipulation — no systematic overstatement, understatement, hedging, or strategic misreporting — improves the expected Brier Score. The only way to achieve a good score is to actually be well-calibrated.

This property makes the Brier Score uniquely suitable for on-chain mechanism design. Consider the alternatives:

Accuracy (% correct) is gameable by always predicting the base rate. In a market that goes up 60% of the time, always predicting “up” yields 60% accuracy with zero skill.
PnL is gameable through leverage, selective reporting, and survivorship bias. It conflates risk-taking with skill.
Win rate ignores the magnitude of confidence. Being right 90% of the time on predictions where you said “51%” is indistinguishable from being right 90% on predictions where you said “95%.”

Only strictly proper scoring rules eliminate all of these attack vectors simultaneously. The Brier Score is the simplest member of this family, and its quadratic form makes it computationally tractable for on-chain evaluation in fixed-point arithmetic.

There is an information-theoretic interpretation worth noting. A well-calibrated forecaster is one whose predictions carry maximum information content relative to their stated confidence. Each calibrated prediction reduces the observer’s uncertainty by exactly the amount implied by the stated probability. Miscalibrated predictions either overstate their information content (overconfidence) or understate it (underconfidence). The Brier Score penalizes both directions symmetrically.

The Mechanism: Commit-Reveal Prediction Cycles

The protocol operates in discrete prediction cycles, each passing through three phases. The mechanism is enforced entirely by smart contracts — no human moderation, no appeals process, no admin keys.

Phase 1: Commit

A participant constructs a prediction tuple $P = (a, d, \tau, c, t_{\text{exp}})$ where $a$ is the asset identifier, $d$ is the directional call, $\tau$ is the target price, $c$ is the stated confidence, and $t_{\text{exp}}$ is the expiry timestamp. They generate a random salt $s$ and submit the cryptographic commitment:

h = \text{keccak256}(\text{abi.encodePacked}(a, d, \tau, c, t_{\text{exp}}, s))

The hash $h$ is recorded on-chain with a timestamp. Since keccak256 is a cryptographic hash function with pre-image resistance, an observer who sees $h$ in the mempool or on-chain cannot determine any of the prediction parameters. The salt prevents dictionary attacks against common prediction values.

The commitment is binding: once $h$ is submitted, the predictor cannot produce a different prediction $P' \neq P$ that hashes to the same value. This follows from the collision resistance of keccak256.

Phase 2: Reveal

After a mandatory delay and before expiry, the participant publishes the full prediction tuple $(a, d, \tau, c, t_{\text{exp}}, s)$ in plaintext. The contract verifies:

\text{keccak256}(\text{abi.encodePacked}(a, d, \tau, c, t_{\text{exp}}, s)) \stackrel{?}{=} h

If the hash matches, the prediction transitions to the REVEALED state and the plaintext parameters are stored on-chain. If the predictor fails to reveal before $t_{\text{exp}}$ , the prediction is marked EXPIRED and scored as a maximally incorrect prediction: $o = 0$ with confidence $c = 0.99$ . This non-reveal penalty eliminates the strategy of committing many predictions and selectively revealing only the favorable ones.

Phase 3: Resolve

At or after $t_{\text{exp}}$ , any address may trigger resolution. The contract queries the oracle price feed for asset $a$ at the expiry timestamp and obtains the settlement price $p^*$ . The binary outcome is computed:

o = \begin{cases} 1 & \text{if } d = \text{ABOVE and } p^* > \tau \\ 1 & \text{if } d = \text{BELOW and } p^* < \tau \\ 0 & \text{otherwise} \end{cases}

The Brier Score contribution $(c - o)^2$ is computed and the predictor’s aggregate score is updated. The prediction lifecycle is complete. The outcome, the score, and the full prediction history are permanent.

Security properties

This three-phase mechanism achieves four properties simultaneously:

Non-retroactivity. The hash commitment prevents changing predictions after observing outcomes. The “delete and repost” strategy that plagues unverified signal channels is structurally impossible.
Incentive compatibility. The strictly proper scoring rule means honest confidence reporting is the dominant strategy. Gaming is mathematically irrational.
Verifiability. Every commitment, reveal, and resolution is on-chain and independently auditable. No trusted third party.
Composability. The resulting score is a primitive that other protocols can query via a standard interface ( INotchScore.getScore(address), INotchScore.getCalibration(address)).

The Notch Score: A Reputation Primitive

A single Brier Score on a single prediction is noisy. The Notch Score is the time-weighted, stake-weighted composite that transforms raw prediction outcomes into a stable reputation signal. It is defined as:

NS = \alpha \cdot \text{Cal} + \beta \cdot \text{Acc} + \gamma \cdot V(N) + \delta \cdot C(t)

Where:

Calibration ( $\alpha = 0.40$ ) — $1 - BS$ , the Brier Score inverted. The largest component, reflecting the protocol’s thesis that knowing how much you know is more informative than raw hit rate.
Accuracy ( $\beta = 0.25$ ) — Binary correctness using an exponential moving average with 0.95 decay. Recent predictions matter more than distant ones.
Volume ( $\gamma = 0.20$ ) — $V(N) = \min(1, \log(1+N) / \log(1+N_{\text{ref}}))$ with $N_{\text{ref}} = 500$ . A log-scaled volume component that saturates at a reference count. More predictions provide more statistical significance.
Consistency ( $\delta = 0.15$ ) — A measure of variance in rolling accuracy. A predictor with steady 75% accuracy scores higher on consistency than one who alternates between 95% and 55%, even if their averages are equal. Steady beats lucky.

The heavy weighting toward calibration ( $\alpha = 0.40$ ) is a deliberate design choice. It reflects the protocol’s core thesis: in a world saturated with predictions, the scarce commodity is not directional opinion but honest self-assessment of uncertainty.

The Notch Score is an on-chain identity primitive. It is not tied to a real-world identity but to a wallet address with a provable track record. It is bounded, interpretable, and comparable across participants. And it is queryable by any smart contract through a standard interface:

interface INotchScore {
    function getScore(address predictor) external view returns (uint256);
    function getCalibration(address predictor) external view returns (uint256);
    function isActive(address predictor) external view returns (bool);
}

This composability is what makes the Notch Score a primitive rather than a product. Consider what it enables:

DeFi protocols can gate access or adjust risk parameters based on participant calibration.
Prediction markets can weight contributions by demonstrated calibration quality.
Copy-trading platforms can offer verifiable skill metrics instead of raw PnL.
Insurance protocols can price risk more accurately using calibrated human assessors.
DAOs can weight governance votes by demonstrated decision quality.

None of these applications require trusting Notch. They require only reading an on-chain score that was computed deterministically from verifiable inputs.

Tokenizing Calibration

The $NOTCH token is not a governance token in the conventional sense. It is the unit of account for a new asset class: verified human judgment.

Participants stake $NOTCH to enter prediction cycles. This is not a fee — it is skin in the game. The staking requirement serves two functions: it makes Sybil attacks costly (you cannot build free reputation with zero-cost predictions) and it creates a financial channel through which calibration flows from theory to market.

The economic dynamic is straightforward. Protocol fees — generated from Alpha Pass marketplace trades — are split: 60% to automated buyback and burn, 25% to the protocol treasury, 15% to staking rewards. With a fixed supply of 1 billion tokens and no minting function, the token supply can only decrease over time. The rate of decrease is proportional to protocol usage, creating a deflationary mechanism tied directly to demand for calibrated predictions.

Alpha Passes

An Alpha Pass is an ERC-1155 token granting the holder access to a specific predictor’s next $N$ predictions. If a predictor has a consistently high Notch Score, their Alpha Pass becomes valuable because it grants access to high-quality probabilistic assessments.

This creates a secondary market for reputation. The price of an Alpha Pass is set by the predictor and discovered by the market. A predictor with a Notch Score of 0.85 and a thousand resolved predictions has earned the right to charge more than one with a score of 0.60 and fifty predictions. The market decides how much more.

Alpha Indices

For participants who don’t want to evaluate individual predictors, Alpha Indices bundle the prediction streams of the top $N$ predictors by Notch Score into a single instrument. This creates index-fund-style exposure to the calibration-as-commodity asset class — diversified access to verified human judgment without the need to pick individual forecasters.

Why Existing Approaches Fail

The calibration problem is not unseen. Several categories of existing products touch parts of it. None solve it.

Copy-trading platforms

eToro, Zulu Trade, and similar platforms rank traders by PnL. This metric conflates risk-taking with skill, is subject to survivorship bias (you only see the profiles of traders who haven’t blown up yet), and provides no cryptographic verification. A trader can delete losing trades from public history on most platforms. The metric is also structurally gameable: open many small positions with extreme leverage, delete the ones that lose, display the ones that win.

Prediction markets

Polymarket and Kalshi produce elegant aggregate probabilities through continuous double-auction mechanisms. They solve the problem of “what does the crowd believe?” but they do not solve “which members of the crowd are actually skilled?” A participant who deposits $100 and bets correctly on three events is indistinguishable from one who got lucky. There is no persistent, decomposable reputation primitive.

On-chain analytics

Nansen, Arkham, and DeBank track what wallets do — holdings, transactions, interactions with protocols. This is forensic observation: useful for understanding past behavior, but fundamentally different from measuring predictive skill. Buying ETH before it goes up is not the same as predicting that ETH will go up with a stated confidence and being scored on that prediction.

Social trading

Hyperliquid vaults, GMX GLP, and similar structures allow users to deposit capital that is deployed by a designated trader or strategy. This is liquidity provisioning. The depositor trusts the trader’s skill, but that skill is not verified by any mechanism other than historical PnL — which returns us to the survivorship problem.

Notch is not competing with any of these products. It creates a layer that sits upstream of all of them — providing the calibration primitive they all currently lack. A copy-trading platform could use Notch Scores instead of PnL. A prediction market could weight positions by calibration quality. An analytics platform could display verified skill alongside observed behavior. Notch provides the infrastructure; these products provide the interface.

Architecture and Implementation

Notch is deployed on Arbitrum One. The choice is pragmatic: low transaction fees (a full commit-reveal-resolve cycle costs $0.01–$0.03), EVM compatibility for maximum composability, and a mature DeFi ecosystem that provides the liquidity and oracle infrastructure the protocol depends on.

The smart contract architecture consists of nine contracts:

PredictionEngine — Core commit-reveal-resolve lifecycle with reentrancy protection and pausability.
NotchScore — Immutable scoring computation with the four-component weighted formula.
NotchPassMarket — ERC-1155 marketplace for Alpha Passes with 2.5% protocol fee.
StakingModule — Stake/unstake with 7-day cooldown, anti-Sybil enforcement.
DualOracle — Chainlink primary + Pyth fallback with staleness detection. 111+ assets supported.
Treasury — 60/25/15 fee distribution with automated buyback-burn.
AssetScoreTracker — Per-asset exponential moving average and Brier tracking.
NotchToken — 1B fixed supply ERC-20. No minting function. Burnable.
NotchIndex — Bundled predictor indices for diversified access.

Oracle resolution uses a dual-oracle pattern: Chainlink as the primary feed with automatic failover to Pyth Network. Both feeds implement staleness detection — if a price feed hasn’t updated within the configured heartbeat, the contract reverts rather than resolving against stale data. This protects against oracle manipulation and infrastructure failures.

Brier Score computation is performed in 18-decimal fixed-point arithmetic to ensure deterministic results across all execution environments. The scoring function is immutable — the weights are hardcoded into the contract bytecode and cannot be modified by any admin key or governance action. This is a deliberate constraint: the scoring rule’s strict propriety is a mathematical property that holds only for the exact functional form, and governance over the weights would undermine the game-theoretic guarantee.

The full contract source, including 180 tests (unit, fuzz, and invariant), is available under BUSL-1.1 at github.com/notch-protocol.

The Bigger Picture

Every financial market is ultimately a market for information. The price of a stock reflects aggregated beliefs about future earnings. An interest rate reflects aggregated beliefs about future inflation. A derivatives price reflects aggregated beliefs about future volatility. The quality of these prices depends entirely on the quality of the beliefs that produce them.

Today, the quality of beliefs is invisible. It is bundled into AUM, into credentials, into institutional affiliation, into social media follower counts. A portfolio manager at a $10B fund is assumed to be well-calibrated because they manage $10B. A crypto influencer with 500K followers is assumed to have insight because they have 500K followers. Neither assumption is verifiable.

Notch makes calibration visible, verifiable, and tradeable. It extracts the single most important quality of a financial forecaster — the alignment between their stated confidence and reality — and turns it into an on-chain primitive that anyone can read, compose with, and build on.

This is not a better analytics dashboard. It is a new asset class. One where the scarce resource is not capital, not computation, not liquidity, but proven human judgment.

Polymarket prices events. Notch prices minds.

Download Full Paper (PDF)Simulation Code (GitHub)Technical Spec (PDF)

Companion specification available at Zenodo (DOI: 10.5281/zenodo.19118356). For technical discussion: qais@notch.finance

← notch.finance

interface INotchScore { function getScore(address predictor) external view returns (uint256); function getCalibration(address predictor) external view returns (uint256); function isActive(address predictor) external view returns (bool); }