Atrium Health Wake Forest Baptist

Finding the Missing Millions with COPD

Project Lead: Brian Wells, MD, PhD

Project aim(s):  Using logistic regression, a well-established machine learning method, we will identify people who have a high probability of Chronic Obstructive Pulmonary Disease (COPD) but have never been screened. Nationally, 30 million people have COPD, but half (15 million) of them are un-diagnosed. We will exploit EHR data from Atrium Health Wake Forest Baptist (AHWFB) to create statistical models for predicting results of pulmonary function tests (PFTs). The data contains the necessary structured PFT results obtained through text parsing previously completed at AHWFB. The resultant mathematical formulas created by this project can be used to create equitable direct-to-patient alerts offering diagnostic testing. Data from future diagnostic tests will also prospectively validate the model. 

Narrative Description

Fifteen Million Americans are diagnosed with COPD. An estimated 15 million additional people have the disease but are not yet diagnosed. When undiagnosed they lack appropriate care, effective medications, and pulmonary rehabilitation. Most patients are not diagnosed until after 10 years of symptoms, and most have lost 50% of their lung function by the time of diagnosis, negatively impacting their quality of life. Nationally, over 150,000 people die each year from COPD. Approximately 7 percent of North Carolina's population has COPD. (700,000 people). The prevalence is twice that value in rural areas.

Disparities in access to care, lack of financial resources, racial disparities, and low educational levels hinder patients in rural North Carolina from getting a diagnosis. Less than 50 percent of COPD patients completed high school, and African American males have a high rate of undiagnosed disease. Without awareness of the disease, and an opportunity for accurate diagnosis, they go without care. Nearly 25% of the disease occurs in people who have never smoked. Non- smokers are often not considered candidates for diagnostic testing for COPD using spirometry. So, they may be overlooked. Therefore, we need new ways to improve diagnostic ability for the millions of people with COPD. Prediction analytics conducted on existing clinical and research data is one such way to improve the diagnosis for many people.

Our intervention consisted of the following:

  • Data extraction from Clarity using Oracle SQL.
  • Creation of analytic dataset using R including: (1) patients >=50 years of age; (2) must have had at least 2 visits with primary care within the 3 years prior to their first PFT test; and (3) no ICD code for COPD, including emphysema or chronic bronchitis, in the problem list or encounter diagnoses at the time of the PFT.
  • Create prediction model using predictor variables and develop risk equation.
  • Validate prediction model using 10-fold cross validation.
  • Create automated query using SQL to extract data, match data elements to final variables to calculate COPD probability.
  • Create advisory panel to develop process for communicating findings to patients and expectations/processes for what to do with information once received.

The following table shows the accuracy (discrimination) as measured by the concordance index with the addition of each variable in the final predictive model. The full model achieved a concordance index of 0.82.


Concordance Index

Coin Flip






Ever Smoker


Body Mass Index (BMI)*


COPD Diagnosis Code


Number of Chest Xrays past year


Number of ED visits past year*


Number of Outpatient visits past year*


Number of Beta Agonist Solution Rx Ever*


Number of Beta Agonist Inhaler Rx Past year


As always, the EHR data presented some challenges. In particular, classifying medications into the appropriate categories, ensuring that the medications were active at a given point in time, and handling combination medications presented challenges. Literature review and discussions with clinicians and researchers have reinforced the importance of the tool that we have built. The numbers of patients with undiagnosed and misdiagnosed COPD were astonishing in our data. Our tool, which can screen vast numbers of medical records has the potential to improve the efficiency of identifying COPD patients using targeted screening. After the tool has been prospectively validated in an unbiased sample of patients, we will contact high-risk patients directly through the EHR portal and through care navigators to offer pulmonary function testing. Using care navigators and contacting patients directly will avoid the need for physician directed electronic alerts.

Publications and Resources

Norton DL, Saha A, Wells BJ, Maus S, Ohar JA. Effect of Race and Gender on Overdiagnosis of COPD. Chest 2023:164(4);A5041-5042

Diagnostic quality problem type, failure, or category (symptoms, observed problems, gaps in performance) addressed by the intervention

  • Failure in information gathering
  • Failure in information integration
  • Failure in information interpretation
  • Failure to establish an explanation (diagnosis)

Root causes/causative factors addressed by the intervention

  • Health information sharing and accessibility via health IT

Setting of the diagnostic quality improvement intervention

  • Ambulatory medical care setting (e.g., clinic, office, urgent care)