ML-DQA for Healthcare
packaging errors that cause duplication or missing values.
20
Many similar tools and frameworks
exist and are described in a detailed 2019 review by Zhang et al.
21
However, while many
approaches are technically sophisticated, they do not address the core data quality challenges
described in this study. The flaws in completeness, conformance, and plausibility addressed by
the ML-DQA framework are not adversarial. These are inherent flaws within data generating
processes and are not introduced by an outside actor seeking to perturb an ML model. Similarly,
these data quality flaws do not represent skew between training and testing. The flaws are
pervasive and occur across cohorts, geographies, and disease conditions.
The current study ties together these relevant bodies of work and bridges clinical research
best practices to ML in healthcare. Pharmaceutical companies, for example, have long placed a
premium on high-quality RWD, demonstrated by the $2 billion acquisition of Flatiron Health by
Roche in 2018 and recent investments in ConcertAI ($150M series C), Aetion ($110M series C),
and Verana Health ($150M series E).
22
The ML-DQA framework validated in this study brings
the same level of rigor to ML in healthcare. We hope this study prompts further dialogue and
research to streamline the ML-DQA process and cultivates expertise in the ML for healthcare
community. Similarly, we expect the different documentation artifacts generated throughout the
ML-DQA framework to help close the accountability gap in ML for healthcare.
23
Rather than
advocating for a single dataset label, we present different sets of documentation completed by
different individuals with different expertise in an iterative fashion to optimize data quality. The
data dictionary, data element groupings, data quality checks, ML-DQA reports, and adjudication
files collectively record and log the process. Lastly, we present tables and example reports that
can be rapidly implemented for new projects. A team developing a new ML for healthcare model
can apply the ML-DQA frameworks in Table 4 and Table 5 to map a set of data elements to
rules-based transformations and data quality checks to report out to a reviewing clinician. In
future work, we hope to further automate the process for performing data element pre-processing,
assigning quality checks, and generating quality reports.
4.1 Limitations
The current study has several limitations. First, while we embrace a consensus definition
of data quality developed by research networks across the United States, data quality is in the eye
of the beholder. The domains of completeness, conformance, and plausibility may not capture all
aspects of data quality and new projects may need to extend the ML-DQA framework. Similarly,
all projects included a single clinical expert and data quality determinations may differ if
evaluated through a consensus process engaging a greater number of domain experts. Second, the
ML-DQA framework may not generalize to all ML for healthcare model development efforts. We
tried to minimize this risk by applying the framework to projects across two sites using cohorts of
different ages for different medical conditions, but there may be project-specific challenges and
nuances that emerge in new settings. All projects presented in the current study did use structured
EHR data sourced from instances of Epic Systems. ML products built using EHR data sourced
from a different software vendor or using unstructured notes or images may also require
adaptation of the ML-DQA framework. We tried to minimize this risk by building upon the
PCORI data quality framework, which was developed to be agnostic to EHR-vendor and has been
used on datasets that span all major EHR vendor systems.
Third, while we quantify the number of checks, number of reports, number of people
involved, and number of flaws identified and addressed, these metrics do not measure the full
costs of implementing ML-DQA or the value created or captured by conducting ML-DQA. We
did not account for personnel effort to conduct ML-DQA across projects, although prior work
does highlight the significant costs associated with curating data for EHR models.
24
There is
growing consensus that data quality labor is important and should be prioritized, but high-quality
data may not create immediate value. Pharmaceutical companies expect to reap significant