Clinical Performance Assessment: Considerations for Computer

Contains Nonbinding Recommendations

Clinical Performance Assessment:

Considerations for Computer-Assisted

Detection Devices Applied to Radiology

Images and Radiology Device Data in

Premarket Notification (510(k))

Submissions

Guidance for Industry and Food and

Drug Administration Staff

Document issued on September 28, 2022.

Document originally issued on July 3, 2012.

This document supersedes Clinical Performance Assessment: Considerations

for Computer-Assisted Detection Devices Applied to Radiology Images and

Radiology Device Data in Premarket Notification (510(k)) Submissions issued

January 22, 2020.

For questions regarding this document, please contact [email protected].

U.S. Department of Health and Human Services

Food and Drug Administration

Center for Devices and Radiological Health

Contains Nonbinding Recommendations

Preface

Public Comment

You may submit electronic comments and suggestions at any time for Agency consideration to

https://www.regulations.gov. Submit written comments to the Dockets Management Staff, Food

and Drug Administration, 5630 Fishers Lane, Room 1061, (HFA-305), Rockville, MD 20852.

Identify all comments with the docket number FDA-2009-D-0503. Comments may not be acted

upon by the Agency until the document is next revised or updated.

Additional Copies

Additional copies are available from the Internet. You may also send an email request to CDRH-

[email protected] to receive a copy of the guidance. Please include the document number

1698 and complete title of the guidance in the request.

Contains Nonbinding Recommendations

Table of Contents

I. Introduction ............................................................................................................................. 2

II. Scope ....................................................................................................................................... 3

III. Rationale ................................................................................................................................. 5

IV. Clinical Study Design ............................................................................................................. 6

A. Evaluation Paradigm and Study Endpoints ......................................................................... 8

B. Control Arm ....................................................................................................................... 10

C. Reading Scenarios and Randomization ............................................................................. 11

D. Rating Scale ....................................................................................................................... 11

E. Scoring ............................................................................................................................... 12

F. Training of Clinical Readers.............................................................................................. 12

V. Study Population ................................................................................................................... 13

A. Data Poolability ................................................................................................................. 14

B. Test Data Reuse ................................................................................................................. 14

VI. Reference Standard ............................................................................................................... 16

VII. Reporting............................................................................................................................... 17

Appendix ......................................................................................................................................... 17

A. Potential Sources of Bias in a Retrospective Reader Study .............................................. 17

B. Additional References ....................................................................................................... 19

Contains Nonbinding Recommendations

Clinical Performance Assessment:

Considerations for Computer-Assisted

Detection Devices Applied to Radiology

Images and Radiology Device Data in

Premarket Notification (510(k))

Submissions

Guidance for Industry and Food and

Drug Administration Staff

This guidance represents the current thinking of the Food and Drug Administration (FDA or

Agency) on this topic. It does not establish any rights for any person and is not binding on FDA

or the public. You can use an alternative approach if it satisfies the requirements of the

applicable statutes and regulations. To discuss an alternative approach, contact the FDA staff

or Office responsible for this guidance as listed on the title page.

I. Introduction

This guidance document provides recommendations to industry, systems and service providers,

consultants, FDA staff, and others regarding clinical performance assessment of computer-

assisted detection (CADe

) devices applied to radiology images and radiology device data (often

referred to as “radiological data” in this document). CADe devices are computerized systems that

incorporate pattern recognition and data analysis capabilities (i.e., combine values,

measurements, or features extracted from the patient radiological data) intended to identify,

mark, highlight, or in any other manner direct attention to portions of an image, or aspects of

radiology device data, that may reveal abnormalities during interpretation of patient radiology

images or patient radiology device data by the intended user (i.e., a physician or other health care

professional), referred to as the “clinician” in this document. We have considered the

recommendations on documentation and performance testing for CADe devices made during the

The use of the acronym CADe for computer-assisted detection may not be a generally recognized acronym in the

community at large. It is used here to identify the specific type of devices discussed in this document.

Contains Nonbinding Recommendations

public meetings of the Radiology Devices Panel on March 4-5, 2008,

and November 17-18,

2009.

We have also considered the public comments received on the draft guidance announced

in the Federal Register on October 21, 2009 (74 FR 54053).

In addition, FDA has issued a final order reclassifying medical image analyzers applied to

mammography breast cancer, ultrasound breast lesions, radiograph lung nodules, and radiograph

dental caries into Class II (special controls) and subject to premarket notification (510(k))

requirements.

This guidance provides recommendations that may be useful for compliance with

the special controls codified in 21 CFR 892.2070(b)(1) and noted in italic font for clarity in the

guidance.

This guidance also is being revised with minor updates to reflect the issuance of the final rule,

“Medical Devices; Medical Device Classification Regulations To Conform to Medical Software

Provisions in the 21st Century Cures Act” (86 FR 20278).

In general, FDA’s guidance documents do not establish legally enforceable responsibilities.

Instead, guidances describe the Agency’s current thinking on a topic and should be viewed only

as recommendations, unless specific regulatory or statutory requirements are cited. The use of

the word should in Agency guidances means that something is suggested or recommended,

but not required.

II. Scope

This document provides guidance regarding clinical performance assessment studies for CADe

devices applied to radiology images and radiology device data and CADe devices applied to

mammography breast cancer, ultrasound breast lesions, radiograph lung nodules, and radiograph

dental caries. Radiological data include, for example, images that are produced during patient

examination with ultrasound, radiography, magnetic resonance imaging (MRI), computed

tomography (CT), positron emission tomography (PET), and digitized film images. As stated

above, CADe devices are computerized systems intended to identify, mark, highlight, or in any

other manner direct attention to portions of an image, or aspects of radiology device data, that

may reveal abnormalities during interpretation of patient radiology images or patient radiology

device data by the clinician. This guidance applies to CADe devices, including when a CADe

Available at https://wayback.archive-

it.org/7993/20170405191951/https:/www.fda.gov/AdvisoryCommittees/CommitteesMeetingMaterials/MedicalDevi

ces/MedicalDevicesAdvisoryCommittee/RadiologicalDevicesPanel/ucm124890.htm.

Available at https://wayback.archive-

it.org/7993/20170405191949/https:/www.fda.gov/AdvisoryCommittees/CommitteesMeetingMaterials/MedicalDevi

ces/MedicalDevicesAdvisoryCommittee/RadiologicalDevicesPanel/ucm147063.htm.

See 74 FR 54053 at https://www.federalregister.gov/documents/2009/10/21/E9-25233/draft-guidances-for-

industry-and-food-and-drug-administration-staff-computer-assisted-detection.

See “Radiology Devices; Reclassification of Medical Image Analyzers” (85 FR 3545) at

https://www.federalregister.gov/documents/2020/01/22/2020-00494/radiology-devices-reclassification-of-medical-

image-analyzers.

See 86 FR 20278 at https://www.federalregister.gov/documents/2021/04/19/2021-07860/medical-devices-medical-

device-classification-regulations-to-conform-to-medical-software-provisions.

Contains Nonbinding Recommendations

device is part of a combined system, such as the detection portion of combined computer-aided

detection and diagnostic devices.

This guidance document applies to the CADe devices classified under 21 CFR 892.2050

“medical image management and processing system” and to devices classified under 21 CFR

892.2070 “Medical image analyzer” including the following product codes:

· NWE (Colon computed tomography system, computer aided detection);

· OEB (Lung computed tomography system, computer-aided detection);

· OMJ (Chest X-ray computer aided detection); and

· MYN (Medical image analyzer).

The Agency intends to create new product codes to identify and track new types of CADe

products as necessary.

By design, a CADe device can be a unique detection scheme specific to only one type of

potential abnormality, or a combination or bundle of multiple parallel detection schemes, each

one specifically designed to detect one type of potential abnormality revealed in the patient

radiological data. Examples of CADe devices that fall within the scope of this guidance include:

· a CADe algorithm designed to identify and prompt microcalcification clusters and

masses on digital mammograms;

· a CADe device designed to identify and prompt colonic polyps on CT colonography

studies;

· a CADe designed to identify and prompt filling defects on thoracic CT examination; and

· a CADe designed to identify and prompt brain lesions on head MRI studies.

We may consider a CADe algorithm to be modified, for the purposes of this guidance, if, for

example, a change has been made to the image processing components, features, classification

algorithms, training methods, training data sets, or algorithm parameters. See FDA’s guidance

entitled “Computer-Assisted Detection Devices Applied to Radiology Images and Radiology

Device Data - Premarket Notification [510(k)] Submissions”

for information on CADe 510(k)

submissions, including when a clinical performance assessment may be recommended for a

CADe 510(k) submission.

This guidance does not apply to (1) clinical performance assessment studies for CADe devices

that are intended for use during intra-operative procedures, (2) radiological computer-assisted

diagnostic (CADx) devices (classified under 21 CFR 892.2060), (3) computer-assisted

detection/diagnosis (CADe/x) devices (classified under 21 CFR 892.2090), (4) computer-assisted

acquisition/optimization (CADa/o) devices (classified under 21 CFR 892.2100), or (5)

For more information on product codes, please refer to FDA’s guidance entitled “Medical Device Classification

Product Codes” available at https://www.fda.gov/regulatory-information/search-fda-guidance-documents/medical-

device-classification-product-codes-guidance-industry-and-food-and-drug-administration-staff.

Available at https://www.fda.gov/regulatory-information/search-fda-guidance-documents/computer-assisted-

detection-devices-applied-radiology-images-and-radiology-device-data-premarket.

Contains Nonbinding Recommendations

radiological computer-assisted triage and notification (CADt) devices (classified under 21 CFR

892.2080), whether marketed as unique devices or bundled with a CADe device that, by itself,

may be subject to this guidance. The guidance also does not apply to the many non-CADe

devices that are also classified under 21 CFR 892.2050, including, for example, product codes

LLZ (System, Image Processing, Radiological) and NFJ (System, Image Management,

Ophthalmic). Therefore, most basic radiological image processing and image acquisition

products fall outside the scope of this guidance document.

For any of these types of devices, we recommend that you contact the Agency to seek feedback

on questions related to the regulatory pathways, regulatory requirements, and recommendations

about nonclinical and clinical data.

III. Rationale

Devices classified as medical image analyzers under 21 CFR 892.2070, are required to comply

with the special controls under 21 CFR 892.2070(b), including the requirement for a detailed

description of pre-specified performance testing methods and dataset(s) used to assess whether

the device will improve reader performance as intended in their design verification and

validation.

Medical image analyzers are also required to include results from performance testing that

demonstrate that the device improves reader performance in the intended use population when

used in accordance with the instructions for use. The performance assessment must be based on

appropriate diagnostic accuracy measures. The test dataset must contain a sufficient number of

cases from important cohorts (e.g., subsets defined by clinically relevant confounders, effect

modifiers, concomitant diseases, and subsets defined by image acquisition characteristics) such

that the performance estimates and confidence intervals of the device for these individual subsets

can be characterized for the intended use population and imaging equipment.

The

recommendations that follow may be useful in demonstrating compliance with those special

controls.

Recommendations: The following sections provide recommendations on the clinical study

design, methods, study population, and reporting for evaluating CADe systems. This guidance

provides recommendations as to how you should design and conduct your clinical performance

assessment studies (i.e., well-controlled clinical investigations) for your CADe device. These

studies may be part of your 510(k) submission to FDA. The recommendations in this document

are meant to guide you as you develop and test your CADe device; they are not meant to specify

the full content or type of premarket submission that may be applicable to your device.

If you

would like the Agency's advice about the classification and the regulatory requirements that may

21 CFR 892.2070(b)(1)(ii).

21 CFR 892.2070(b)(1)(iii).

A 510(k) submission is the most common submission type for the CADe devices addressed in this guidance. The

guidance “Computer-Assisted Detection Devices Applied to Radiology Images and Radiology Device Data -

Premarket Notification [510(k)] Submissions” provides additional information on the device description and

standalone performance assessment for 510(k) submissions.

Contains Nonbinding Recommendations

be applicable to your device, you may submit a request under section 513(g) of the Federal Food,

Drug, and Cosmetic Act (the FD&C Act).

We recommend that you request the Agency’s review of your protocols prior to initiating your

standalone performance assessment and clinical performance assessment studies for your CADe

device. To request the Agency’s review of your protocols, you may submit a Pre-Submission to

the Agency.

IV. Clinical Study Design

The clinical performance assessment of a CADe device is intended to demonstrate the clinical

safety and effectiveness of your device for its intended use, when used by the intended user and

in accordance with its proposed labeling and instructions.

As described above in the scope, a CADe device, by design, is intended to identify data that may

reveal abnormalities during interpretation of patient radiology images or data by the clinician.

There is a complex relationship between the CADe output and the clinician such that clinical

performance may depend on a variety of factors that should be considered in any study design

including:

· timing of CADe application in the interpretive process (e.g., concurrent or second read);

· physical characteristics of the CADe mark, i.e., size and shape, type of boundary (e.g.,

solid, dashed, circle, isocontour), and proximity of the CADe mark to the abnormality;

· user’s knowledge of the type of abnormalities that the CADe is designed to mark; and

· number of CADe marks.

Your clinical performance assessment should use a well-controlled study design to preclude or

limit various biases that might impact conclusions on the device’s safety or effectiveness (see

Appendix). We especially recommend this when your clinical performance assessment is

performed in a laboratory setting (i.e., off site of the clinical arena) since it is difficult to

duplicate the clinical environment in the laboratory. Some study designs that may be utilized to

assess your CADe device include:

· A field test or prospective reader study (e.g., randomized controlled trial) that evaluates a

device in actual clinical conditions. A field test may not be practical in situations, for

example, where there is very low disease prevalence that may necessitate enrollment of

an excessively large number of patients.

Section 513(g) of the FD&C Act (21 U.S.C. 360c(g)) provides a means for obtaining the Agency's views about

the classification and the regulatory requirements that may be applicable to your device.

For more information on Pre-Submissions, see FDA’s guidance entitled “Requests for Feedback and Meetings for

Medical Device Submissions: The Q-Submission Program” available at https://www.fda.gov/regulatory-

information/search-fda-guidance-documents/requests-feedback-and-meetings-medical-device-submissions-q-

submission-program.

Contains Nonbinding Recommendations

· A retrospective reader study consisting of a retrospective case collection enriched with

diseased/abnormal cases is a possible surrogate for a field test.

· A stress test is another option for the clinical performance assessment of some CADe

devices. A stress test is a retrospective reader study enriched with patient cases that

contain more challenging imaging findings (or other image data) than normally seen in

routine clinical practice but that still fall within the device’s intended use population (see

Section V. Study Population). Note that the use of sample enrichment will likely alter

reader performance in the trial compared with clinical practice because of the differences

in disease prevalence (and case difficulty for stress testing) between the trial and clinical

practice.

The clinical performance assessment of CADe devices is typically performed by utilizing a

multiple reader multiple case (MRMC) study design, where a set of clinical readers (i.e.,

clinicians evaluating the radiological images or data in the MRMC study) evaluate image data

under multiple reading conditions or modalities (e.g., readers unaided versus readers aided by

CADe). The MRMC design can be “fully-crossed” whereby all readers independently read all of

the cases. This design offers the greatest statistical power for a given number of cases. However,

non-fully crossed study designs may be acceptable, for example in prospective studies where

interpretations of the same patient data by multiple clinicians may not be feasible.

Whether you decide on a fully-crossed study design or not, we recommend the use of an MRMC

evaluation paradigm to assess the clinical performance of a CADe device using one of the study

designs described above. A complete clinical study design protocol should be included in your

submission. Pre-specification of the statistical analysis is a key factor for obtaining consistent

and convincing scientific evidence. We recommend that you provide:

· a description of the study design;

· a description of how the imaging data are to be collected (e.g., make and model of the

imaging device and the imaging protocol);

· a copy of the protocol, including the following:

o hypothesis to be tested and study endpoints,

o plans for checking any assumptions required to validate the tests,

o alternative procedures/tests to be used if the required assumptions are not met,

o study success criteria that indicate which hypotheses should be met in order for

the clinical study to be considered a success,

o statistical and clinical justification of the selected case sample size,

o statistical and clinical justification of the selected number of readers,

o image interpretation methodology and relationship to clinical practice,

o randomization methods, and

o reader task including rating scale used (see Section IV.D. Rating Scale);

· the reader qualifications and experience;

· a description of the reader training; and

· a statistical analysis plan (i.e., endpoints, statistical methods) with description of:

o the process for defining truth (see Section VI. Reference Standard),

o the details of the scoring technique used (see Section IV.E. Scoring), and

o any results from a pilot study supporting the proposed design.

Contains Nonbinding Recommendations

Valid estimation of clinical performance for CADe devices is dependent upon sound study

design. Aspects of sound clinical study design should include the following:

· study populations (both diseased and normal cases) are appropriately representative of or

sampled from the intended use population;

· study design avoids confounding of the CADe effect (e.g., reading session effects);

· sample size is sufficient to demonstrate performance claims;

· truth definition is appropriate for assessment of performance, and uncertainty in the

reference standard is correctly accounted for in the study analysis, if applicable;

· appropriate data cohorts are represented in the data set;

· readers are selected such that they are representative of the intended population of

clinical users; and

· imaging hardware are selected such that they are consistent with current clinical practice.

A. Evaluation Paradigm and Study Endpoints

You should select study endpoints to demonstrate that your CADe device is effective (i.e., in a

significant portion of the target population, the use of the device for its intended uses and

conditions of use, when accompanied by adequate directions for use and warnings against unsafe

use, will provide clinically significant results).

Selection of the primary and secondary

endpoints will depend on the intended use of your device and should be fixed prior to initiating

your evaluation. Performance metrics based on the receiver operating characteristic (ROC) curve

or variant of ROC (e.g., free-response receiver operating characteristic (FROC) curve or

location-specific receiver operating characteristic (LROC) curve), in addition to sensitivity (Se)

and specificity (Sp) at a clinical action point will be likely candidates as endpoints. An ROC-

based endpoint allows evaluation of the device over a range of operating points. An ROC curve

is a plot of Se versus 1-Sp and is a summary of diagnostic performance of a device or a clinician.

An FROC curve is a plot of Se versus the number of false positive marks per image set. FROC

metrics summarize diagnostic performance when disease location and multiple disease sites per

patient are accounted for in the analysis.

Reporting the Se and Sp pair allows for the evaluation

of the device at a clinical threshold or cut-point the reader would act upon. Se is defined as the

probability that a test is positive for a population of patients with the

disease/condition/abnormality while Sp is defined as the probability that the test is negative for a

population of normal patients (i.e., patients without the disease/condition/abnormality).

Se and Sp estimates should be based on an explicit clinical determination by the clinical readers

(e.g., recall or no recall), not indirectly from ratings data used to generate the ROC curves. Data

collection for ROC, Se, and Sp can be done simultaneously within a single reader study. An

See 21 CFR 860.7(e).

For additional details on these assessment paradigms, see Wagner, R.F., Metz, C.E., and Campbell, G.,

Assessment of medical imaging systems and computer aids: A tutorial review. Acad. Radiol., 2007. 14(6):723–48.

https://doi.org/10.1016/j.acra.2007.03.001.

Altman, D.G., Practical Statistics for Medical Research. Chapman & Hall, London, 1991.

https://doi.org/10.1002/sim.4780101015.

Contains Nonbinding Recommendations

example reading scenario could be to first have the readers give a clinically justified binary

response for Se and Sp evaluation and then to immediately follow with a rating response

consistent with the binary response to be used for ROC evaluation.

You may employ various summary performance metrics to assess the effectiveness of the use of

your CADe device by readers (and such metrics may vary based on the specific device and

clinical indication). Examples of these include:

· area, partial area, or other measures of the ROC curve;

· area, partial area, or other measures of the FROC curve;

· area, partial area, or other measures of the LROC curve;

· reader Se and Sp (or Se and recall rate

); and

· reader localization accuracy.

We recommend using an ROC summary performance metric as part of your primary analysis,

although we recognize that alternate performance metrics may also be appropriate. As mentioned

above, we recommend that you include Se and Sp as a secondary endpoint in your analysis when

an ROC summary performance metric is used. Reporting Se and Sp (or Se and recall rate) may

provide additional information for understanding the expected impact of a device on clinical

practice. We also recommend that you contact the Agency when you are considering an alternate

performance metric.

For study endpoints based on the area under the ROC/FROC/LROC curve or partial area under

the ROC/FROC/LROC curve, we recommend that you provide plots of the actual curves along

with summary performance information for both parametric and non-parametric analysis

approaches when possible.

Finally, you should check all methods utilized in your analysis for

the adequacy of their fit to the data as appropriate.

The selection of lesion-based, patient-based, or another unit-based measure of performance as a

primary or secondary endpoint will depend on the intended use and the expected impact of the

device on clinical practice. Powering any additional units-based analyses for statistical

significance should not be necessary unless you intend to make specific performance claims.

We recommend that you describe your statistical evaluation methodology and provide results

including:

· overall reader performance;

· stratified performance by relevant confounders or effect modifiers (e.g., lesion type,

lesion size, lesion location, scanning protocol, imaging hardware, concomitant diseases)

Recall rate refers to the percentage of patients (including diseased and non-diseased patients) that are called back

or recalled for additional medical assessment.

For more information on potential limitations of relying on only one type of ROC analyses, see Gur, D., Bandos,

A.I., and Rockette, H.E., Comparing Areas under Receiver Operating Characteristic Curves: Potential Impact of the

“Last” Experimentally Measured Operating Point. Radiology, 2008. 247(1):12–15.

https://doi.org/10.1148/radiol.2471071321.

Contains Nonbinding Recommendations

(see Section V. Study Population) (powering each cohort for statistical significance is not

necessary unless you are making specific subset performance claims); and

· confidence intervals (CIs) that account for reader variability, case variability, and truth

variability or other sources of variability when appropriate.

We recommend that you identify and validate your analysis software.

You should provide a

reference to the analysis approach used, clarify the software implementation, and specify a

version number if appropriate. Certain validated MRMC analysis approaches, examples of which

can be found in the literature or obtained online, may be appropriate for your device evaluation

depending on its intended use and conditions of use.

The definitions of a true positive, true negative, false positive, and false negative CADe mark

should be consistent with the intended use of the device and the characterization of the reference

standard (see Section VI. Reference Standard).

B. Control Arm

We recommend that you assess the clinical performance of your CADe device relative to a

control modality. A study control arm that uses conventional clinical interpretation (i.e., reader

interpretation without the CADe device) should generally be the most relevant comparator in

CADe performance assessment. For CADe devices intended as second readers, another possible

control is double reading by two clinicians. These controls or a direct comparison with the

predicate CADe device should generally be appropriate for establishing substantial equivalence.

Other control arms may be valid. We recommend that you contact the Agency to discuss your

choice of a control arm prior to conducting your clinical study.

The study control arm should utilize the same reading methodology as the device arm and be

consistent with clinical practice. The same population of cases, if not the same cases themselves,

should be in all study arms to minimize potential bias. For designs that include distinct cases in

each study arm, we recommend that you provide a description and flow chart demonstrating how

you randomized patients and readers into the different arms.

For more information on MRMC analysis software, see, for example, Obuchowski, N.A., Beiden, S.V., Berbaum,

K.S., Hillis, S.L., Ishwaran, H., Song, H.H., and Wagner, R.F., Multi-reader, multi-case ROC analysis: An empirical

comparison of five methods. Acad. Radiol., 2004. 11(9); 980–995. https://doi.org/10.1016/j.acra.2004.04.014.

For MRMC literature references, see, for example: Metz, C.E., Fundamental ROC analysis. Handbook of Medical

Imaging, Vol. 1. Physics and Psychophysics. SPIE Press, 2000. Chapter 15, 751–769.

https://doi.org/10.1117/3.832716.ch15; Wagner, R.F., Metz, C.E., and Campbell, G., Assessment of medical

imaging systems and computer aids: A tutorial review. Acad. Radiol., 2007. 14(6):723–48.

https://doi.org/10.1016/j.acra.2007.03.001; and Obuchowski, N.A., Beiden, S.V., Berbaum, K.S., Hillis, S.L.,

Ishwaran, H., Song, H.H., and Wagner, R.F., Multi-reader, multi-case ROC analysis: An empirical comparison of

five methods. Acad. Radiol., 2004. 11(9); 980–995. https://doi.org/10.1016/j.acra.2004.04.014.

For online access to software that analyzes MRMC data based on validated techniques, see, for example:

LABMRMC software and general ROC software, The University of Chicago: http://metz-roc.uchicago.edu/ (for

either quasi-continuous or categorical data); University of Iowa MRMC software:

https://perception.lab.uiowa.edu/OR-DBM-MRMC-program-manual (for categorical data); or OBUMRM software:

https://www.lerner.ccf.org/qhs/software/.

Contains Nonbinding Recommendations

C. Reading Scenarios and Randomization

Reading scenarios in the clinical evaluation should be consistent with the intended use of the

device. The following are examples of reading scenarios that may be part of a CADe clinical

evaluation.

· Devices for second reader use only (sequential design):

o a conventional reading by the readers without the CADe device (i.e., reader alone);

o a second-read by the readers in which the CADe output is displayed immediately

after conducting a conventional interpretation (this reading could occur within the

same reading session as the conventional read in what is termed a “sequential”

reading scheme).

· Devices for concurrent reading (cross-over design):

o a conventional reading by the readers without the CADe device (i.e., reader alone);

o a concurrent or simultaneous read by the readers in which the CADe output is

available at any time during the interpretation process (this reading would be

conducted in a separate reading session from the conventional read).

You should randomize readers, cases, and reading scenarios to reduce bias in performance

measures. We recommend that you describe your randomization methodology and provide an

associated flowchart. One approach to randomization is to make use of the principle of Latin

squares as part of a block design to the reader study.

In case of multiple reading sessions where the same cases are read multiple times, we

recommend that you separate each reading session in time by at least four weeks to avoid

memory bias. However, longer time gaps may be advisable. For shorter or longer time gaps

between reading sessions, we recommend that you provide data supporting your proposed time

gaps.

D. Rating Scale

You should use conventional medical interpretation and reporting for lesion location, extent, and

patient management. ROC-based endpoints (see Section IV.A. Evaluation Paradigm and Study

Endpoints) may support collecting data with a finer rating scale (e.g., a 7-point or 100-point

scale) when readers rate the lesion and/or disease status in a patient. We recommend providing

training to the readers on the use of the rating scale (see Section IV.F. Training of Study

Participants).

Obuchowski N.A., Meziane M., Dachman A.H., Lieber M.L., Mazzone P.J., What's the control in studies

measuring the effect of computer-aided detection (CAD) on observer performance? Acad Radiol., 2010. 17(6):761-

7. https://doi.org/10.1016/j.acra.2010.01.018.

Bailey, R.A., Design of Comparative Experiments. Cambridge University Press, 2008.

https://doi.org/10.1017/CBO9780511611483.

Contains Nonbinding Recommendations

E. Scoring

We refer to the procedure for determining the correspondence between the reader’s interpretation

and the ground truth (e.g., disease status) as the scoring process. The scoring process and the

scoring definition are important components in the clinical assessment of a CADe device and

you should describe them. We recommend that you describe the process (i.e., rationale,

definition, and criteria) for determining whether a reader’s interpretation corresponds to the truth

status established during the truthing process (see Section VI. Reference Standard for

information on the truthing process).

In this guidance, we describe scoring in terms of the clinical performance assessment. A different

type of scoring is used to evaluate device standalone performance, which is described in FDA’s

guidance entitled “Computer-Assisted Detection Devices Applied to Radiology Images and

Radiology Device Data - Premarket Notification [510(k)] Submissions.”

The scoring process for the clinical studies should be consistent with the abnormalities marked

by the CADe and the intended use of your device. You should describe and fix the scoring

process prior to initiating your evaluation. In your description of the scoring process, we

recommend that you indicate whether the scoring is based on:

· electronic or non-electronic means;

· physical overlap of the boundary, area, or volume of a reader mark in relation to the

boundary, area, or volume of reference standard;

· relationship of the centroid of a reader mark to the boundary or spatial location of

reference standard;

· relationship of the centroid of the reference standard to the boundary or spatial location of

a reader mark;

· interpretation by reviewing reader(s); or

· other methods.

For scoring that relies on interpretations by reviewing readers, we recommend that you provide

the number of readers involved, their qualifications, their levels of experience and expertise, the

specific instructions conveyed to them prior to their participation in the scoring process, and any

specific criteria used as part of the scoring process. When multiple readers are involved in

scoring, you should describe the process by which you combine their interpretations to make an

overall scoring determination or how you incorporate their interpretations into the performance

evaluation, including how you address any inconsistencies.

F. Training of Clinical Readers

We recommend that you specify instructions and provide training to clinical readers in the study

on the use of the CADe device and the details on how to participate in the clinical study.

Training should include a description of the device and instructions for how to use the device.

Available at https://www.fda.gov/regulatory-information/search-fda-guidance-documents/computer-assisted-

detection-devices-applied-radiology-images-and-radiology-device-data-premarket.

Contains Nonbinding Recommendations

For specialized reading instructions or rules (e.g., rules for changing initial without-CADe

interpretation when reviewing the CADe marks), we recommend that you justify their clinical

relevance according to reading task, clinical workflow, and medical practice.

We also recommend that you provide training to the readers on the use of the rating scale (see

Section IV.D. Rating Scale), especially if such a rating scale is not generally utilized in clinical

practice. Such training helps avoid incorrect or un-interpretable results. We recommend that

reader training include rating a representative set of normal and abnormal cases according to the

study design methodology, and making use of cases that are not part of the testing database.

V. Study Population

You may collect patient data (i.e., cases) prospectively or retrospectively based on well-defined

inclusion and exclusion criteria. We recommend that you provide the protocol for your case

collections. Note that cases collected for your clinical trial should be independent of the cases

used during your device development. An acceptable approach for acquiring data is the

collection of consecutive cases that are within the inclusion and outside of the exclusion criteria

from each participating collection site.

Enrichment with diseased/abnormal cases is permissible for an efficient and less burdensome

representative case data set. Enrichment may affect reader performance, so the extent of

enrichment should be weighed against the introduction of biases into the study design. You may

also enrich the study population with patient cases that contain imaging findings (or other image

data) that are challenging to clinicians but that still fall within the device’s intended use

population. This enrichment is often referred to as stress testing and will introduce biases into the

study design. Therefore, not all approaches in developing a stress population may be appropriate

for use in your clinical assessment. For example, a study population enriched with cases

containing small colonic polyps may be appropriate when assessing a general CADe device

designed to assist in detecting a wide range of polyp sizes. However, we do not recommend

enrichment with cases based on the performance of the CADe device.

You should sample the study population from the intended use population for your device so that

the case set includes an appropriate range of diseased/abnormal and normal cases. Selective

sampling from cohorts within the intended use population is appropriate when you utilize stress

testing. The sample size of the study should be large enough such that the study has adequate

power to detect with statistical significance your proposed performance claims. When

appropriate, the study should contain a sufficient number of cases from important cohorts (e.g.,

subsets defined by clinically relevant confounders, effect modifiers, and concomitant diseases)

such that clinical performance estimates and confidence intervals can be obtained for these

individual subsets. Powering each cohort for statistical significance should not be necessary

unless you are making specific subset performance claims. When you make multiple

performance claims, a pre-specified statistical adjustment for the testing of multiple subsets

would be necessary.

When describing your study population, we recommend that you provide specific information,

where appropriate, including:

Contains Nonbinding Recommendations

· the patient demographic data (e.g., age, ethnicity, race, sex);

· the patient medical history relevant to the CADe application;

· the patient disease state and indications for the radiologic test;

· the conditions of radiologic testing (e.g., technique, including whether the test was

performed with/without contrast, contrast type and dose per patient, patient body mass

index, radiation exposure, T-weighting for MRI images, and views taken);

· a description of how the imaging data were collected (e.g., make and model of imaging

devices and the imaging protocol);

· the collection sites;

· the processing sites, if applicable (e.g., patient data digitization);

· the number of cases:

o the number of diseased cases,

o the number of normal cases,

o methods used to determine disease status, location, and extent (see Section VI.

Reference Standard);

· the case distributions stratified by relevant confounders or effect modifiers, such as lesion

type (e.g., hyperplastic vs. adenomatous colonic polyps), lesion size, lesion location,

disease stage, organ characteristics (e.g., breast composition), concomitant diseases,

imaging hardware (e.g., makes and models), imaging or scanning protocols, collection

sites, and processing sites (if applicable); and

· a comparison of the clinical, imaging, and pathologic characteristics of the patient data

compared to the target population.

A. Data Poolability

Premarket approval applications based solely on foreign clinical data and otherwise meeting the

criteria for approval may be approved if, among other requirements, the foreign data are

applicable to the United States (U.S.) population and U.S. medical practice and clinical

investigators of recognized competence have performed the studies.

You should justify why

non-U.S. data reflect what is expected for a U.S. population with respect to disease occurrence,

characteristics, practice of medicine, and clinician competency. In accordance with good clinical

study design, you should justify, both statistically and clinically, the poolability of data from

multiple sites. We recommend that 510(k) submissions follow similar quality data practices with

regard to foreign data and data poolability. You are encouraged to contact the Agency if you

intend to rely on foreign clinical data as the basis of your premarket submission.

B. Test Data Reuse

The Agency recognizes the difficulty in acquiring data for CADe assessment. The Agency also

recognizes that the readers and CADe algorithm may implicitly or explicitly become tuned to the

test data if the same test set is used multiple times. If no algorithm training has occurred between

two tests (e.g., if the sponsor is investigating the effect of a new prompt type), then a new set of

21 CFR 814.15.

Contains Nonbinding Recommendations

readers with no knowledge of the results of the previous study may address concerns about the

use of the same test data set. If CADe algorithm training has occurred between tests, or if this is

a new CADe algorithm, the Agency encourages evaluation of the CADe system using newly

acquired independent test cases (i.e., the Agency discourages repeated use of test data in

evaluating device performance). If you are considering data reuse in the evaluation of your

CADe device, you should demonstrate that reusing any part of the test data does not introduce

unreasonable bias into estimates of CADe performance and that test data integrity is maintained.

In the event that you would like the Agency to consider the reuse of any test data, you should

control the access of your staff to performance results for the test subpopulations and individual

cases. It may therefore be necessary for you to set up a “firewall” to ensure those outside of the

regulatory assessment team (e.g., algorithm developers) are completely insulated from

knowledge of the radiology images and radiological data, and anything but the appropriate

summary performance results. You should maintain test data integrity throughout the lifecycle of

the product. This is analogous to data integrity involving clinical trial data monitoring

committees and the use of a “firewall” to insulate personnel responsible for proposing interim

protocol changes from knowledge of interim comparative results. For more information on data

integrity for clinical trial data monitoring committees, refer to Section 4.4.1.4 of the FDA

guidance entitled “Establishment and Operation of Clinical Trial Data Monitoring Committees

for Clinical Trial Sponsors.”

To minimize the risk of tuning to the test data and to maintain data integrity, we recommend that

you develop an audit trail and implement the following controls when you contemplate the reuse

of any test data:

· you randomly select the data from a larger database that grows over time;

· you retire data acquired with outdated image acquisition protocols or equipment that no

longer represents current practice;

· you place a small fixed limit on the number of times a case can be used for assessment;

· you maintain a “firewall” such that data access is tightly controlled to ensure that nobody

outside of the regulatory assessment team, especially anyone associated with algorithm

development, has access to the data (i.e., only summary performance results are reported

outside of the assessment team);

· you maintain a data access log to track each time the data is accessed, including a record

of who accessed the data, the test conditions, and the summary performance results; and

· you use new readers in each new clinical reader study.

The purposes of the audit trail include: (1) establishing that you defined the cases in the training

and test sets appropriately such that data leakage between training and test sets did not occur; (2)

ensuring that you fixed the new CADe algorithm in advance (i.e., before application to the test

set); and (3) providing information concerning the extent to which you used the same test set or a

subset thereof for testing other CADe algorithms or designs, including results reported to the

Agency as well as non-reported results. The controls we are recommending are intended to

Available at https://www.fda.gov/regulatory-information/search-fda-guidance-documents/establishment-and-

operation-clinical-trial-data-monitoring-committees.

Contains Nonbinding Recommendations

substantially reduce the chance that you evaluate a new CADe algorithm in a subsequent study

using the same test data set that you used for a prior CADe algorithm.

If you reuse test data, you should report the test performance for relevant subsets in addition to

the overall performance. These subsets include: (1) the portion of the test set for the new CADe

algorithm that overlaps with previously used test sets; and (2) the portion of the test set that you

have never been used before. Since these subsets will be smaller in size compared to the overall

test set for the new CADe algorithm, confidence intervals for the subsets will be wider.

However, the trends for the mean performances would be helpful to indicate whether you have

tuned the CADe system to previously used portions of the test set.

VI. Reference Standard

For purposes of this guidance, the reference standard (also often called the “gold standard” or

“ground truth” in the imaging community) for patient data indicates whether the

disease/condition/abnormality is present and may include such attributes as the extent or location

of the disease/condition/abnormality. We refer to the characterization of the reference standard

for the patient (e.g., disease status) as the truthing process.

We recommend that you provide the rationale for your truthing process and indicate if it is based

on:

· the output from another device;

· an established clinical determination (e.g., biopsy, specific laboratory test);

· a follow-up clinical imaging examination;

· a follow-up medical examination other than imaging; or

· an interpretation by a reviewing clinician(s) (i.e., clinical truther(s)).

We also recommend that you describe the methodology utilized to make this reference standard

determination (e.g., based on pathology or based on a standard of care determination). For

truthing that relies on the interpretation by a reviewing clinician(s), we recommend that you

provide:

· the number of clinical truthers involved;

· their qualifications;

· their levels of experience and expertise;

· the instructions conveyed to them prior to participating in the truthing process;

· all available clinical information from the patient utilized by the clinical truthers in the

identification of disease/condition/abnormality and in the marking of the location and

extent of the disease/condition/abnormality; and

· any specific criteria used as part of the truthing process.

When multiple clinical truthers are involved, you should describe the process by which you

combine their interpretations to make an overall reference standard determination and how your

process accounts for inconsistencies between clinicians participating in the truthing process

Contains Nonbinding Recommendations

(truth variability) (see Section IV.A. Evaluation Paradigm and Study Endpoints). Note that

clinicians participating in the truthing process should not be the same as those who participate in

the core clinical performance assessment of the CADe device.

VII. Reporting

Reporting of performance results may be guided by the FDA guidance entitled “Statistical

Guidance on Reporting Results from Studies Evaluating Diagnostic Tests.”

We recommend

submitting electronically the data used in any statistical analysis in your study including the

following:

· patient information,

· disease or normal status,

· concomitant diseases,

· lesion size,

· lesion type,

· lesion location,

· disease stage,

· organ characteristics,

· imaging hardware,

· imaging or scanning protocol,

· imaging and data characteristics (e.g., characteristics associated with differences in

digitization architectures for a CADe using scanned films), and

· statistical analysis.

For more information on submitting data electronically, please see FDA’s website on “Clinical

Data for Premarket Submissions.”

Appendix

A. Potential Sources of Bias in a Retrospective Reader Study

Despite their practical value, the design of retrospective reader studies for evaluating CADe

devices may generate estimates of CADe performance that are subject to one or more potential

sources of statistical bias. Statistical bias is a tendency for a performance estimate in a study to

be misaligned with the true performance in the intended use population. Many sources of

statistical bias can often be minimized or at least mitigated through good study design. Some

potential sources of bias in reader studies include, but are not limited to:

· Selection Bias: The sample of subjects (or readers) is not representative of the target

population.

Available at https://www.fda.gov/regulatory-information/search-fda-guidance-documents/statistical-guidance-

reporting-results-studies-evaluating-diagnostic-tests-guidance-industry-and-fda.

Available at https://www.fda.gov/medical-devices/premarket-submissions/clinical-data-premarket-submissions.

Contains Nonbinding Recommendations

· Spectrum Bias: The sample of subjects (or readers) studied does not include a complete

spectrum of the target population.

· Imperfect Reference (Gold) Standard Bias: The reference procedure is not 100% accurate

at classifying subjects by presence or absence of the condition of interest (e.g., breast

cancer, polyps in colon).

· Verification Bias: Statistical analysis of diagnostic performance is based only on subjects

verified for presence or absence of the condition of interest by the reference standard.

· Reading-Order Bias: When comparing two or more tests, the reader’s interpretation is

affected by his or her memory of the results from the competing test.

· Context Bias: When the sample prevalence of the abnormal condition differs greatly from

the prevalence in the target population, the reader’s interpretations may be affected,

resulting in biased estimates of his/her diagnostic accuracy.

Selection bias is introduced when subjects (cases) selected for study are not representative of the

intended use population. Retrospective selection of subjects with available images can introduce

a selection bias. Random or consecutive sampling of subjects may eliminate or mitigate selection

bias. Selection bias can also occur in terms of the readers selected to participate in the study.

Readers should reflect the population of readers that will use the device. A small number of

readers having similar training or clinical experiences (e.g., readers all from the same clinic) may

not generalize to the full population of readers expected to use a CADe device.

Spectrum bias is a special form of selection bias in which the study subjects represent an

incomplete spectrum of characteristics of subjects in the intended use population (i.e., important

subgroups are missing from the study).

Enriching a CADe study with subjects that are difficult to diagnose (i.e., stress testing) can bias

the CADe effect. For example, if the CADe is most helpful for these difficult cases, then the

CADe effect can be enhanced with stress testing relative to the intended use population.

Nonetheless, stress testing is encouraged because it provides value in evaluating CADe in

important subgroups, and results in study designs with smaller sample sizes when applied

appropriately.

Enriching a CADe study with subjects having the abnormal condition produces biased estimates

of CADe positive and negative predictive value because these depend on the prevalence of the

abnormal condition. “Corrected” estimates of predictive values would have to rely on an estimate

of prevalence external to the study. The enhanced prevalence may also indirectly cause biased

estimates of area under the ROC curve (AUC), of Se, and of Sp if readers are “tipped off” to the

enhanced prevalence in the study and change their reading behavior, accordingly creating context

bias.

Retrospective reading of image sets may also introduce bias into CADe performance. The basic

concern is that reading behavior may change relative to actual clinical practice because readers

know that they are participating in a study in which patient management is not affected by their

diagnostic readings.

Contains Nonbinding Recommendations

Ground truth determination for the presence or absence of the condition of interest can be subject

to several sources of bias. A reference standard is used to determine ground truth. Imperfect

reference standard bias refers to the reference standard misclassifying some subjects as having

or not having the condition. For example, ground truth determination is incomplete if subjects

diagnosed as negative for the abnormal condition on initial evaluation are not followed up for

confirmation that they were indeed free of the condition.

Verification bias refers to the ground truth being missing for some subjects. If in the statistical

analysis, only the subjects on whom ground truth has been established are included, estimates of

CADe performance can be biased.

The study design of a retrospective reader study can produce potential sources of statistical bias.

For example, in a sequential reading design, readers are instructed to read sequentially first

unaided by CADe, and then aided by CADe as a second reader. The comparison is between an

unaided read and the unaided read combined with the CAD-aided read. The sequential reading

design is attractive because the proximity of the readings in the two modalities minimizes intra-

reader variability. However, if the reading condition in the indications for use (IFU) of the device

are different from that in the reader study, a concern might be that the effect of CADe on the

diagnostic performance of readers may be confounded with the effect due to the additional time

given readers to read each case. Another concern is that a reader may undercall the unaided

reading relative to the reading aided by CADe, producing an enhanced CADe effect. If this

concern is real, mitigation may be to randomize a fraction of the cases to be read only in the

initial unaided reading mode. Randomization is revealed only after the initial unaided read is

made. Randomization of the cases should be done separately for each reader in a way that

ensures that all cases are given a CADe-aided reading by some readers.

An alternative to the sequential reading design is the cross-over design. Cases are read unaided

and aided by the CADe output in two independent reading sessions separated by a washout

period to erase reader memory of the images. Half of the cases are randomly assigned to group A

and the other half to group B. In reading session 1, group A cases are read unaided and group B

cases are read aided by CADe. In reading session 2, the cases are “crossed over” to the other

modality (aided reading for group A, unaided reading for group B). Any effects the particular

reading sessions have on the readings cancel in the comparison of the two modalities. However,

relative to the sequential reading design, the two reading sessions contribute additional

variability to estimate of the CADe effect. The cross-over design may be particularly appropriate

for concurrent reading. It can also be generalized for use in evaluating more than two modalities

(e.g., if in addition to unaided reading, the CADe has two or more modalities itself).

B. Additional References

1. Gallas, B.D., Chen W., Cole E., Ochs R., Petrick N., Pisano E.D., Sahiner B., Samuelson

F.W., Myers K.J., Impact of prevalence and case distribution in lab-based diagnostic imaging

studies. Journal of Medical Imaging, 2019. 6(1): 015501.

https://dx.doi.org/10.1117%2F1.JMI.6.1.015501.

Contains Nonbinding Recommendations

2. Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L., Lijme,r

J.G., Moher, D., Rennie, D., de Vet, H.C., Kressel, H.Y., Rifai, N., Golub, R.M., Altman,

D.G., Hooft, L., Korevaar, D.A., Cohen, J.F., STARD Group, STARD 2015: An Updated

List of Essential Items for Reporting Diagnostic Accuracy Studies. Radiology, 2015.

277(3):826-832. https://doi.org/10.1148/radiol.2015151516.

3. Obuchowski, N.A., Gallas, B.D., Hillis, S.L., Multi-reader ROC Studies with Split-plot

Designs: A Comparison of Statistical Methods. Acad Radiol, 2012. 19(12): 1508-1517.

https://dx.doi.org/10.1016%2Fj.acra.2012.09.012.

4. Obuchowski, N.A., Meziane, M., Dachman, A.H., Lieber, M.L., Mazzone, P.J., What's the

control in studies measuring the effect of computer-aided detection (CAD) on observer

performance? Acad Radiol, 2010. 17(6):761-767. https://doi.org/10.1016/j.acra.2010.01.018.

5. Pepe M.S., Statistical evaluation of medical tests for classification and prediction. Oxford

University Press, 2003. 24(16). https://doi.org/10.1002/sim.2185.

6. Petrick, N., Sahiner, B., Armato, S.G. 3

, Bert, A., Correale, L., Delsanto, S., Freedman,

M.T., Fryd, D., Gur, D., Hadjiiski, L., Huo, Z., Jiang, Y., Morra, L., Paquerault, S., Raykar,

V., Samuelson, F., Summers, R.M., Tourassi, G., Yoshida, H., Zheng, B., Zhou, C., Chan,

H.P., Evaluation of computer-aided detection and diagnosis systems. Med Phys, 2013. 40(8),

087001. https://doi.org/10.1118/1.4816310.

7. Zhou, X., Obuchowski, N.A., McClish, D.K., Statistical Methods in Diagnostic Medicine,

Second Edition. Wiley, 2011. https://doi.org/10.1002/9780470906514.