• COVID-19 Chest X-Ray AI

COVID-19 Chest X-Ray AI

  • Datasets
  • COVID-19 Chest X-Ray AI

1 Abstract

X-ray and computed tomography (CT) scanning technologies for COVID-19 screening have gained significant traction in AI research since the start of the coronavirus pandemic. Despite these continuous advancements for COVID-19 screening, many concerns remain about model reliability when used in a clinical setting. Much has been published, but with limited transparency in expected model performance. We set out to address this limitation through a set of experiments to quantify baseline performance metrics and variability for COVID-19 detection in chest x-ray for 12 common deep learning architectures. Specifically, we adopted an experimental paradigm controlling for train-validation-test split and model architecture where the source of prediction variability originates from model weight initialization, random data augmentation transformations, and batch shuffling. Each model architecture was trained 5 separate times on identical train-validation-test splits of a publicly available x-ray image dataset provided by Cohen et al. (2020). Results indicate that even within model architectures, model behavior varies in a meaningful way between trained models. Best performing models achieve a false negative rate of 3 out of 20 for detecting COVID-19 in a hold-out set. While these results show promise in using AI for COVID-19 screening, they further support the urgent need for diverse medical imaging datasets for model training in a way that yields consistent prediction outcomes. It is our hope that these modeling results accelerate work in building a more robust dataset and a viable screening tool for COVID-19.

1.1 Citation

The work described here has been published on arXiv.

Cite our work (BibTeX):

    title={Intra-model Variability in COVID-19 Classification Using Chest X-ray Images},
    author={Brian D Goodwin and Corey Jaskolski and Can Zhong and Herick Asmani},

2 Introduction

The spread of the novel coronavirus, which causes COVID-19, has caught most of the world off-guard resulting in severely limited testing capabilities. For example, as of April 15, 2020 almost 3 months since the first case in the US, only about 3.3 million tests have been administered [1], which equates to approximately 1% of the US population. Reverse transcription-polymerase chain reaction (RT-PCR) is an assay commonly used to test for COVID-19, but is available in extremely limited capacity [2]. In an effort to offer a minimally invasive, low-cost COVID-19 screen via x-ray imaging, AI engineers and data scientists have begun to collect datasets [3] and utilize computer vision and deep learning algorithms [4]. All these efforts seek to leverage an available medical imaging modality for both diagnosis and, in the future, predicting case outcome. Clinical observations have largely propelled AI research in computer vision for screening COVID-19, and these reports cite differentiable lung abnormalities of COVID-19 patients from chest CT [5], x-ray [6], and even ultrasound [7]. Current research also shows that COVID-19 is correlated with specific biomarkers in x-ray [8].

Though these recent efforts are valuable in that they will lay the foundation for future work in this area, there are significant flaws in the methodology as well as in the behavior of the resultant models. Much of the initial work on COVID-19 prediction from chest x-ray used a training set that included a little over 100 images with 10 test images (that were, in fact, identical to the validation set). Though such small test data sets do not allow for declaring sweeping diagnostic value statements, unfortunately the popular media articles effectively hype the value of these models with hopeful titles like “Coronavirus Neural Network can help spot COVID-19 in Chest X-rays” [9], “How AI Is Helping in the Fight Against COVID-19” [10], and “A.I. could help spot telltale signs of coronavirus in lung X-rays” [11].

Network weights from these publications are not publicly available for these published models. We have responded to this shortcoming by providing pre-trained weights for many of the most common deep learning architectures for computer vision, and we have made the code for pre-training freely available. To our knowledge, this repository of pre-trained model weights is the first of its kind in response to the current crisis and the first to report prediction results across multiple architectures on a test set that is held out from the validation and training sets.

Our goal is to facilitate advancement of screening technology for COVID-19 and highlight the need for larger, more diverse datasets. The urgency for a clinical methodology to provide COVID-19 screens cannot be understated [12]. Our hope is twofold: 1) that the community advances computer vision for COVID-19 detection via x-ray before recommending use in a clinical setting and 2) that pre-trained model weights will help accelerate ongoing development in AI to augment the decision-making process for clinicians during a time where healthcare workers are under a severe amount of stress.

4 Results

Refer to our publication for results and details regarding the variability between networks and other performance metrics. Otherwise, Table 4.1 highlights basic performance metrics on existing x-ray dataset for detecting COVID-19. However, since the publication of this work, the dataset sample size has increased.

Table 4.1: Average performance metrics by model architecture. Note that all metrics except ACC, which is multiclass accuracy, are for COVID-19 detection only (i.e., binary classification). TPR: true positive rate (or recall); FPR: false positive rate; FNR: false negative rate; PPV: positive predictive value (or precision); F1: F1-score; ACC: overall accuracy (TP+TN)/n.

         Arch	 TPR	      FPR	 FNR	      PPV	       F1	      ACC
 mobilenet_v2	0.75	0.0222968	0.25	0.3344489	0.4548416	0.8694500
  densenet121	0.71	0.0087307	0.29	0.5394747	0.6063897	0.8775348
     resnet18	0.77	0.0260578	0.23	0.2960767	0.4219726	0.8774023
  densenet169	0.65	0.0186702	0.35	0.3230125	0.4301389	0.8709079
  densenet201	0.79	0.0099396	0.21	0.5193145	0.6248680	0.8837641
    resnext50	0.80	0.0123573	0.20	0.4809069	0.5960700	0.8683897
     resnet50	0.71	0.0171927	0.29	0.3840065	0.4893698	0.8713055
    resnet101	0.73	0.0205507	0.27	0.3348723	0.4556031	0.8791252
    resnet152	0.63	0.0135662	0.37	0.3911076	0.4805679	0.8780649
 wideresnet50	0.78	0.0146407	0.22	0.4182053	0.5439933	0.8652087
   resnext101	0.80	0.0185359	0.20	0.3694735	0.5045063	0.8812459
wideresnet101	0.78	0.0212223	0.22	0.3360063	0.4678163	0.8735586

5 Caveats and Considerations

Due to the rapid pace at which literature is published on the topic of COVID-19, our review of the literature is by no means extensive. Our intent is to release an updated summary that highlights the most relevant findings in the context of AI development for COVID-19 detection.

Given the length of the incubation period and the variability in the symptom onset latency from infection, it is difficult to control for the time at which the image was acquired relative to the time of infection. However, efforts have been made to include an offset in COVID-19 imaging datasets by accounting for the number of days since the start of symptoms or hospitalization [13]. Those who curate these datasets also must deal with inherent ambiguity in medical records such as image acquisition “after a few days” of symptoms (in this case, Cohen et al. assume 5 days).

As with any diagnostic procedures, we feel AI should only be used inside these workflows for decision support, and diagnoses should NOT rely on AI results alone. This imperative is perhaps best exemplified by an x-ray mortality predictor where the AI task was to predict mortality based on x-ray images alone. Surprisingly, the results from this test were remarkably accurate [14]. However, flaws in the training dataset pushed the AI to detect image artifacts resulting from the x-ray device and medical instrumentation attached to the patient. For example, x-ray images acquired in an ambulance (versus a hospital room) were associated with higher mortality rates. Therefore, AI developers need to take considerable care in ensuring that features elucidated via AI originate from patient anatomy and not from imaging and/or instrumentation artifacts prominent in many medical imaging datasets.

6 References

2. Ai T, Yang Z, Hou H, Zhan C, Chen C, Lv W, et al. Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: A report of 1014 cases. Radiology. 2020;200642.

3. Cohen JP, Morrison P, Dao L. COVID-19 Image Data Collection. arXiv:200311597 [cs, eess, q-bio]. 2020.

4. Wang L, Wong A. COVID-net: A tailored deep convolutional neural network design for detection of COVID-19 cases from chest radiography images. arXiv:200309871 [cs, eess]. 2020. Accessed 10 Apr 2020.

5. Yan Q, Wang B, Gong D, Luo C, Zhao W, Shen J, et al. COVID-19 chest ct image segmentation – a deep convolutional neural network solution. 2020.

6. Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, et al. Clinical features of patients infected with 2019 novel coronavirus in wuhan, china. The Lancet. 2020;395:497–506.

7. Born J, Brändle G, Cossio M, Disdier M, Goulet J, Roulin J, et al. POCOVID-net: Automatic detection of covid-19 from a new lung ultrasound imaging dataset (pocus). 2020.

8. Apostolopoulos ID, Aznaouridis S, Tzani M. Extracting possibly representative covid-19 biomarkers from x-ray images with deep learning approach and image data related to pulmonary diseases. 2020.

9. Heaven WD. A neural network can help spot covid-19 in chest x-rays. 2020. Accessed 20 Apr 2020.

10. Dickson B. How ai is helping in the fight against covid-19. 2020. Accessed 20 Apr 2020.

11. Dormehl L. A.I. Could help spot telltale signs of coronavirus in lung x-rays. 2020. Accessed 20 Apr 2020.

12. Kanne JP, Little BP, Chung JH, Elicker BM, Ketai LH. Essentials for radiologists on covid-19: An update—radiology scientific expert panel. 2020.

13. Cohen JP, Morrison P, Dao L. COVID-19 Image Data Collection. arXiv:200311597 [cs, eess, q-bio]. 2020. Accessed 11 Apr 2020.

14. Lu MT, Ivanov A, Mayrhofer T, Hosny A, Aerts HJWL, Hoffmann U. Deep Learning to Assess Long-term Mortality From Chest Radiographs. JAMA Netw Open. 2019;2:e197416. doi:10.1001/jamanetworkopen.2019.7416.

Table of Contents


With the release of the Open Research Dataset, a mass of 44,000 coronavirus-related research papers and articles, Synthetaic implores members of the AI community to mobilize and apply innovative AI techniques to generate novel insights in the battle against COVID-19.