What Does "Accurate" Mean?
Accuracy depends on what you are measuring. Event-level metrics — sensitivity, precision, false positives per hour — describe detection performance at the level of individual cough events.
But clinicians and researchers rely on rates: hourly and daily cough burden. A system can detect events well and still misrepresent the rate that matters.
This page focuses on agreement in rates.
Interactive toggle revealing the appropriate metrics view. Default: Hourly Agreement.
Correlation Is Not Enough
A sensor can be perfectly correlated with reality, and still be systematically wrong.
Pearson's r measures linear association — how tightly values move together. It does not measure agreement with the line of perfect equality (y = x).
For measurement systems deployed in clinical contexts, the relevant question is not whether readings co-vary. It is whether they match.
"Do readings match reality, or just move in the same direction?"
Toggle: Pearson line · Line of perfect agreement (y = x) · Display Pearson r and Lin's CCC live.
Pearson rewards tight clustering around any line.
Lin's CCC rewards clustering around the correct line.
Agreement at the Hourly Level
Each data point represents one person-hour of continuous recording. Ground truth is established through independent dual annotation and adjudication — not automated labeling.
High concordance indicates both precision and minimal systematic bias: the two properties that matter for clinical deployment.
Hover individual person-hours · Toggle: regression line, y = x, residuals · Secondary tab: Bland–Altman view
Performance Across Individuals
Aggregate statistics summarize 23 participants into a single number. This chart shows every participant individually — ground truth count and Hyfe's count side by side, sorted by cough burden.
Agreement holds across the full range: from participants with fewer than 100 coughs to those with nearly 1,000.
Event-Level Context
For completeness, event detection metrics are reported here. These describe model behavior at the level of individual cough events rather than cough rates.
Event metrics inform model behavior. Rate-level concordance informs clinical utility.
| Metric | Value | Definition |
|---|---|---|
| Sensitivity | Proportion of true coughs detected (95% CI 88.3–92.2%) | |
| Positive Predictive Value | Proportion of detections that are true coughs (95% CI 81.9–91.6%) | |
| False Positives / Hour | Non-cough events classified as coughs per hour (95% CI 0.84–1.24) |
How "Reality" Was Defined
Ground truth is not assumed — it is constructed through a rigorous annotation protocol. Human annotation establishes the reference standard, but also defines the ceiling of agreement any automated system can theoretically achieve.
View Methodology Details
The annotation protocol is designed to minimize inter-rater variability and ensure that the reference standard reflects genuine cough burden rather than labeling artifacts.
- Annotators operate under standardized labeling guidelines developed from prior clinical audio annotation work.
- Inter-rater agreement is measured using Cohen's κ prior to adjudication; sessions with low agreement are flagged for re-review.
- Recordings span a representative range of acoustic environments — home, workplace, and public settings — to ensure that ground truth reflects the real-world distribution Hyfe's system encounters at deployment.
- Cough seconds, rather than cough counts, are used as the primary unit because they are more robust to the boundary ambiguity inherent in longer cough bouts.
Implications for Drug Development
Hyfe's validation approach is designed to meet the evidentiary standards expected by regulatory reviewers and pharma endpoints committees. The metrics reported here are those that determine whether a measurement instrument is fit for purpose — not merely whether it functions.