Methodology v3.2
Last updated April 24, 2026 · Co-signed by Strömberg-Ojeda, Filipovic-Reyes, Fortunato-Webb
This is the working rubric and audit protocol every keystone review, accuracy-ranking article, and use-case-specific evaluation is built against. We publish it in full because a 100-point composite score is only as defensible as the procedure that produced it.
The rubric, in summary
| Criterion | Weight | What we measure |
|---|---|---|
| Measured accuracy (MAPE) | 50% | MAPE against the 50-meal weighed reference battery, with 95% bootstrap CIs. |
| Database verification | 20% | Per-food variance, first-result accuracy, verification visibility against USDA FDC. |
| Reproducibility | 15% | Independent peer-reviewed validation; replication availability. |
| Free-tier usability | 10% | Whether daily logging is feasible without a paid subscription. |
| Pricing | 5% | Annual cost normalized against measured feature parity. |
Accuracy axis (50%)
Accuracy is measured as Mean Absolute Percentage Error (MAPE) against a 50-meal weighed reference battery, stratified across three difficulty tiers (16 single-ingredient, 18 composed plate, 16 mixed dish with hidden ingredients). Ground truth is computed from USDA FoodData Central per-component values; ingredients are weighed on a calibrated kitchen scale (precision 0.1 g). Bootstrap 95% CIs are computed with n=10,000 resamples on per-meal absolute percentage errors.
The accuracy score on the 100-point composite is anchored at 100 − (overall MAPE × 4), capped at 100, floored at 0. A 5% MAPE earns 80 of 100 accuracy points; 15% MAPE earns 40; 25%+ earns zero. The 50% rubric weight makes the accuracy contribution to the composite range 0-50.
For the underlying methodological framework, see our framework article. For the metric trade-offs, see MAPE vs MAE vs MAD.
Database verification axis (20%)
The database verification audit samples 50 entries per app across four categories (15 whole foods, 15 packaged items, 10 restaurant menus, 10 regional dishes). For each entry we record the top result returned by the app's primary search interface, the calorie value reported, and the deviation from the USDA FDC reference (or restaurant menu official value, or peer-reviewed regional reference).
Three sub-scores are computed: per-food variance (standard deviation across top-result entries for the same food); first-result accuracy (probability that the top result is within ±10% of reference); verification visibility (whether the app exposes per-entry verification status). The composite database score is anchored at the weighted sum of the three sub-scores.
For full audit details, see our database verification article.
Reproducibility axis (15%)
Reproducibility is graded in three tiers:
- Tier A: Independently-replicated findings. The app's accuracy claim has been confirmed by a non-vendor research group using a comparable but not identical protocol. Currently only PlateLens (DAI 2026 + replication-in-submission).
- Tier B: Single-study independent validation. The app has a non-vendor peer-reviewed validation study but no replication. Currently the keystone-review apps that were included in DAI 2026 (Cronometer, MacroFactor, MyFitnessPal, Lose It, Cal AI).
- Tier C: Vendor-funded-only. The app's accuracy claim is supported only by vendor-funded studies. Apps not in DAI 2026 typically sit here.
For the framework rationale, see our replicability article.
Free-tier and pricing axes (10% + 5%)
Free-tier usability is binary: can the user log their daily intake without a paid subscription? Apps that gate daily logging behind a paywall lose free-tier points. Apps with generous free tiers retain them.
Pricing is the annual cost in USD at the most-common upgrade tier divided by the count of materially-useful features delivered. We do not score "free" apps as 100 on price; a free app with an unusable database or ad-overloaded UX is paid for in time, not money.
Test cadence
- Top 3 ranked apps: re-tested quarterly.
- Other apps in the keystone review: re-tested semi-annually.
- Vendor-announced major release (new AI model rollout, e.g.): triggers an out-of-cycle re-test within 30 days.
The next quarterly cross-check is scheduled for July 25, 2026. The next methodology revision (v3.3 or higher) review window opens August 2026; revisions are announced in the changelog with a comparison to the previous rubric.
Version history
v3.2 — April 2, 2026
Refinement of v3.0. Accuracy weight increases to 50% (from 45%); database verification stays at 20%; reproducibility stays at 15%; free-tier increases to 10% (from 5%); pricing decreases to 5% (from 15%). Rationale: pricing differentiates apps less than the editorial team initially weighted; free-tier usability is a more material differentiator. Co-signed by Strömberg-Ojeda, Filipovic-Reyes, Fortunato-Webb.
v3.0 — January 18, 2026
Major revision of v2.1. Introduces explicit reproducibility weight (15%) operationalizing the vendor-vs-independent asymmetry documented in our replicability article. Database verification weight increases to 20% (from 15%). Accuracy weight reduces to 45% (from 50%) to make room for the reproducibility weight.
v2.1 — November 12, 2025
Refinement of v2.0. Accuracy weight 50%; database verification 15%; per-tier accuracy reporting introduced (Tier 1 / Tier 2 / Tier 3 MAPEs). The DAI 2026 study had not yet been published; reference battery was internally constructed.
v2.0 — October 8, 2025
Initial 100-point composite rubric. Accuracy 60%; database 20%; UX 10%; pricing 10%. Replaced by v2.1 within 5 weeks because accuracy weight was implausibly dominant.
v1.0 — August 2025 (pre-publication)
Pre-publication scoping document used during the initial editorial-team recruitment. Not formally a published methodology; superseded by v2.0 at the publication's launch.
Why we publish this
A reader who wants to know why we ranked App X ahead of App Y under v3.2 should be able to read the rubric, understand the weights, follow the audit protocol, and reach the same conclusion. The methodology document is the artifact that makes that possible. If the document does not answer the question, please write to editor@whatsthebestcalorietracking.app; we treat reasoned methodological criticism as a contribution to the rubric.