Open Access Highly Accessed Open Badges Research

Cluster analysis in severe emphysema subjects using phenotype and genotype data: an exploratory investigation

Michael H Cho12*, George R Washko2, Thomas J Hoffmann3, Gerard J Criner4, Eric A Hoffman5, Fernando J Martinez6, Nan Laird3, John J Reilly7 and Edwin K Silverman12

Author Affiliations

1 Channing Laboratory, Brigham & Women's Hospital, Boston, MA, USA

2 Division of Pulmonary and Critical Care Medicine, Brigham & Women's Hospital, Boston, MA, USA

3 Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA

4 Division of Pulmonary and Critical Care, Temple University School of Medicine, Philadelphia, PA, USA

5 Department of Radiology, Carver College of Medicine, University of Iowa, Iowa City, IA, USA

6 Division of Pulmonary and Critical Care Medicine, University of Michigan Health System, Ann Arbor, MI, USA

7 University of Pittsburgh Medical Center, Pittsburgh, PA, USA

For all author emails, please log on.

Respiratory Research 2010, 11:30  doi:10.1186/1465-9921-11-30

Published: 16 March 2010



Numerous studies have demonstrated associations between genetic markers and COPD, but results have been inconsistent. One reason may be heterogeneity in disease definition. Unsupervised learning approaches may assist in understanding disease heterogeneity.


We selected 31 phenotypic variables and 12 SNPs from five candidate genes in 308 subjects in the National Emphysema Treatment Trial (NETT) Genetics Ancillary Study cohort. We used factor analysis to select a subset of phenotypic variables, and then used cluster analysis to identify subtypes of severe emphysema. We examined the phenotypic and genotypic characteristics of each cluster.


We identified six factors accounting for 75% of the shared variability among our initial phenotypic variables. We selected four phenotypic variables from these factors for cluster analysis: 1) post-bronchodilator FEV1 percent predicted, 2) percent bronchodilator responsiveness, and quantitative CT measurements of 3) apical emphysema and 4) airway wall thickness. K-means cluster analysis revealed four clusters, though separation between clusters was modest: 1) emphysema predominant, 2) bronchodilator responsive, with higher FEV1; 3) discordant, with a lower FEV1 despite less severe emphysema and lower airway wall thickness, and 4) airway predominant. Of the genotypes examined, membership in cluster 1 (emphysema-predominant) was associated with TGFB1 SNP rs1800470.


Cluster analysis may identify meaningful disease subtypes and/or groups of related phenotypic variables even in a highly selected group of severe emphysema subjects, and may be useful for genetic association studies.