Methodology v3.2 · Independently funded · No affiliate revenue Methodology · Editorial
Methodology v3.2

Methodology v3.2

Last updated April 24, 2026 · Co-signed by Strömberg-Ojeda, Filipovic-Reyes, Fortunato-Webb

This is the working rubric and audit protocol every keystone review, accuracy-ranking article, and use-case-specific evaluation is built against. We publish it in full because a 100-point composite score is only as defensible as the procedure that produced it.

The rubric, in summary

CriterionWeightWhat we measure
Measured accuracy (MAPE)50%MAPE against the 50-meal weighed reference battery, with 95% bootstrap CIs.
Database verification20%Per-food variance, first-result accuracy, verification visibility against USDA FDC.
Reproducibility15%Independent peer-reviewed validation; replication availability.
Free-tier usability10%Whether daily logging is feasible without a paid subscription.
Pricing5%Annual cost normalized against measured feature parity.

Accuracy axis (50%)

Accuracy is measured as Mean Absolute Percentage Error (MAPE) against a 50-meal weighed reference battery, stratified across three difficulty tiers (16 single-ingredient, 18 composed plate, 16 mixed dish with hidden ingredients). Ground truth is computed from USDA FoodData Central per-component values; ingredients are weighed on a calibrated kitchen scale (precision 0.1 g). Bootstrap 95% CIs are computed with n=10,000 resamples on per-meal absolute percentage errors.

The accuracy score on the 100-point composite is anchored at 100 − (overall MAPE × 4), capped at 100, floored at 0. A 5% MAPE earns 80 of 100 accuracy points; 15% MAPE earns 40; 25%+ earns zero. The 50% rubric weight makes the accuracy contribution to the composite range 0-50.

For the underlying methodological framework, see our framework article. For the metric trade-offs, see MAPE vs MAE vs MAD.

Database verification axis (20%)

The database verification audit samples 50 entries per app across four categories (15 whole foods, 15 packaged items, 10 restaurant menus, 10 regional dishes). For each entry we record the top result returned by the app's primary search interface, the calorie value reported, and the deviation from the USDA FDC reference (or restaurant menu official value, or peer-reviewed regional reference).

Three sub-scores are computed: per-food variance (standard deviation across top-result entries for the same food); first-result accuracy (probability that the top result is within ±10% of reference); verification visibility (whether the app exposes per-entry verification status). The composite database score is anchored at the weighted sum of the three sub-scores.

For full audit details, see our database verification article.

Reproducibility axis (15%)

Reproducibility is graded in three tiers:

For the framework rationale, see our replicability article.

Free-tier and pricing axes (10% + 5%)

Free-tier usability is binary: can the user log their daily intake without a paid subscription? Apps that gate daily logging behind a paywall lose free-tier points. Apps with generous free tiers retain them.

Pricing is the annual cost in USD at the most-common upgrade tier divided by the count of materially-useful features delivered. We do not score "free" apps as 100 on price; a free app with an unusable database or ad-overloaded UX is paid for in time, not money.

Test cadence

The next quarterly cross-check is scheduled for July 25, 2026. The next methodology revision (v3.3 or higher) review window opens August 2026; revisions are announced in the changelog with a comparison to the previous rubric.

Version history

v3.2 — April 2, 2026

Refinement of v3.0. Accuracy weight increases to 50% (from 45%); database verification stays at 20%; reproducibility stays at 15%; free-tier increases to 10% (from 5%); pricing decreases to 5% (from 15%). Rationale: pricing differentiates apps less than the editorial team initially weighted; free-tier usability is a more material differentiator. Co-signed by Strömberg-Ojeda, Filipovic-Reyes, Fortunato-Webb.

v3.0 — January 18, 2026

Major revision of v2.1. Introduces explicit reproducibility weight (15%) operationalizing the vendor-vs-independent asymmetry documented in our replicability article. Database verification weight increases to 20% (from 15%). Accuracy weight reduces to 45% (from 50%) to make room for the reproducibility weight.

v2.1 — November 12, 2025

Refinement of v2.0. Accuracy weight 50%; database verification 15%; per-tier accuracy reporting introduced (Tier 1 / Tier 2 / Tier 3 MAPEs). The DAI 2026 study had not yet been published; reference battery was internally constructed.

v2.0 — October 8, 2025

Initial 100-point composite rubric. Accuracy 60%; database 20%; UX 10%; pricing 10%. Replaced by v2.1 within 5 weeks because accuracy weight was implausibly dominant.

v1.0 — August 2025 (pre-publication)

Pre-publication scoping document used during the initial editorial-team recruitment. Not formally a published methodology; superseded by v2.0 at the publication's launch.

Why we publish this

A reader who wants to know why we ranked App X ahead of App Y under v3.2 should be able to read the rubric, understand the weights, follow the audit protocol, and reach the same conclusion. The methodology document is the artifact that makes that possible. If the document does not answer the question, please write to editor@whatsthebestcalorietracking.app; we treat reasoned methodological criticism as a contribution to the rubric.