Data Sets Used by DS2

Introduction

Public and non-public data sets that have been used or evaluated for DS2 research are documented on this page.

Be cautious of Berkson's Fallacy – just because there's a correlation between, say, Toxoplasmosis and HIV in hospital discharge diagnoses does not necessarily imply that the correlation exists in outpatients or in the population in general. So when experimenting with a predicate, it's important to be confident that its correlation statistics are relevant: Primary care problem lists are different than hospital discharge diagnoses are different than outpatient encounter diagnoses; different correlation statistics should ideally be used for each of those. That said, we can still learn a lot from experimenting with a predicate even if its correlations aren't relevant to the setting in which we'd ultimately like it to be used.

On this page:

Also see:

Test Data Sets

CDC Hospital Discharge Data

Between 1996 and 2010, the Centers for Disease Control published public use data files from the CDC National Hospital Discharge Survey - an annual probability sample survey of discharges from nonfederal, general, and short-stay hospitals. Discharges, diagnoses, demographics. We used these files to experiment with classifiers using WEKA, and to test the Predicate Reducer and Inference Analyzer.  The data is publicly available for download and no license or application is required.

Documentation at: ftp://ftp.cdc.gov/pub/Healthstatistics/NCHs/DatasetDocumentation/NHDS/

Datasets at: ftp://ftp.cdc.gov/pub/Health_statistics/NCHs/Datasets/NHDS/

From CDC:

NHDS covers discharges from noninstitutional hospitals, excluding District of Columbia. Only short-stay hospitals (hospitals with an average length of stay for all patients of less than 30 days) or those whose specialty is general (medical or surgical) or children's general are included in the survey. These hospitals must also have six or more beds staffed for patient use. These criteria, used from 1988 through the current survey year, are slightly different from those used prior to 1988, specifically with respect to certain aspects of the sampling design. First, the 1988 redesign included a third stage of sampling that was performed using a subsample of primary sampling units (PSUs) that had been selected for 1985-1994 National Health Interview Survey; and second, facility sampling took into account whether or not discharge data were available in electronic format. In 2010, the sample consisted of 239 hospitals. Of these hospitals, 3 were found to be out- of-scope (ineligible) because they went out of business or otherwise failed to meet the criteria for the NHDS universe. Of the 236 in-scope (eligible) hospitals, 203 hospitals responded to the survey for an unweighted response rate of 86 percent. The weighted response rate is 79 percent.

In DS2 we primarily worked with the CDC NHDS 2010 dataset and, when a larger dataset was desired, a concatenation of all the CDC NHDS datasets from 1996 to 2010.

Warning: Weights

NHDS discharges are weighted, and applying the weights may impact classifier training and performance.  See more information in the DS2 Data Scripts "README" file or in the white paper.

CDC 2010 Discharges

Discharge diagnoses for 151,551 hospital discharges in 2010. The 2010 dataset is special because it has the top 15 discharge diagnoses for each discharge (previous years only had the top 7).

CDC 1996-2010 Discharges

Discharge diagnoses for 4,448,125 hospital discharges between 1996 and 2010. DS2 applied a simple concatenation of the CDC NHDS datasets for every year 1996-2010. Note that the 2010 dataset is special because it has the top 15 discharge diagnoses for each discharge (previous years only had the top 7); but because this combined dataset includes both 2010 and pre-2010 data, the 2010 discharges were truncated for use in DS2 to only include the top 7 just like all the previous years' discharges.

Files and number of discharges are:

   313259 NHDS00.PU.TXT
   330210 NHDS01.PU.TXT
   327254 NHDS02.PU.TXT
   319530 NHDS03.PU.TXT
   370785 NHDS04.PU.TXT
   375372 NHDS05.PU.TXT
   376328 NHDS06.PU.TXT
   365648 NHDS07_PU.TXT
   165630 NHDS08.PU.TXT
   162151 NHDS09.PU.TXT
   151551 NHDS10.PU.TXT
   282008 NHDS96.PU.TXT
   300464 NHDS97.PU.TXT
   307475 NHDS98.PU.TXT
   300460 NHDS99.PU.TXT
  4448125 total

AHRQ HCUP NIS

The Agency for Healthcare Research and Quality (AHRQ) Healthcare Cost and Utilization Project (HCUP) hosts a family of databases at http://www.hcup-us.ahrq.gov/databases.jsp.  They are derived from administrative data; encounter-level, clinical and nonclinical information including diagnoses and procedures, discharge status, patient demographics, and charges. 

According to AHRQ:

The Nationwide Inpatient Sample (NIS) is part of a family of databases and software tools developed for the Healthcare Cost and Utilization Project (HCUP). The NIS is the largest all-payer inpatient health care database in the United States, yielding national estimates of hospital inpatient stays. Unweighted, it contains data from approximately 8 million hospital stays each year. Weighted, it estimates roughly 40 million hospitalizations.

Developed through a Federal-State-Industry partnership sponsored by the Agency for Healthcare Research and Quality (AHRQ), HCUP data inform decisionmaking at the national, State, and community levels.

The HCUP NIS data is available for a nominal fee from AHRQ upon completion of an online training course and a signed Data Use Agreement.  The NIS was used to experiment with various classifiers and dimension reduction techniques.  More information is available at the HCUP NIS website.

Northwestern Memorial Hospital - SHARPS De-identified Data Set

Northwestern Memorial Hospital (NMH) made available two de-identified datasets to SHARPS researches under a confidential data use agreement: An Audit Log dataset (for audit log research) and an EMR dataset.  The EMR dataset consisted of encounter diagnoses, medications, procedures, and problem lists for a subset of de-identified NMH patients, and was used by DS2 for classifier experimentation using WEKA, and also in testing the Predicate Reducer and Inference Analyzer.

Other Test Data Sets

We evaluated a number of other data sets.

UMLS Co-occurring Concepts

  • See:
  • UMLS File: MRCOC.RRF
  • Co-occurrences from three different sources:
    • MEDLINE - Medical journal citations
    • CCPSS - Canonical Clinical Problem Statement System, problem-problem co-occurrences based on real patient data from a 1990’s study
    • AI/RHEUM - Artificial intelligence rheumatology consultant system, a disease-finding and finding-disease co-occurence database based on real patient data from a study in the late 1970's

Other potential sources at HealthData.gov

  • Searchable at: http://healthdata.gov/dataset/search
  • Index of 926 publicly-available healthcare data sets
  • Everything from drug use surveys to hospital discharges to vaccine adverse events
  • CMS, CDC, NIH, AHRQ, NCI, local public health, etc.

Vocabularies/Terminologies and Categorization Schemes

AHRQ CCS

According to AHRQ:

The Clinical Classifications Software (CCS) for ICD-9-CM is a diagnosis and procedure categorization scheme that can be employed in many types of projects analyzing data on diagnoses and procedures. CCS is based on the International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM), a uniform and standardized coding system. The ICD-9-CM's multitude of codes - over 14,000 diagnosis codes and 3,900 procedure codes - are collapsed into a smaller number of clinically meaningful categories that are sometimes more useful for presenting descriptive statistics than are individual ICD-9-CM codes.

In our classifier experimentation using WEKA, and also in testing the Predicate Reducer and Inference Analyzer, we mapped ICD9 codes to CCS categories to reduce the number of attributes.  We also used the NLM's SNOMED CT to ICD-9 mapping to map SNOMED CT codes to ICD9 codes so that they could be mapped to CCS codes (see below). 

CCS is publicly available for download at http://www.hcup-us.ahrq.gov/toolssoftware/ccs/AppendixASingleDX.txt and no license or application is required.

SNOMED CT and UMLS Terminology Services at NLM

Via UMLS Terminology Services, licensed users (the license is free in the United States) can access a variety of data and services, including SNOMED CT and:

  • SNOMED CT Browser
  • Metathesaurus Browser (includes ICD-9, ICD-10, SNOMED-CT, DSM-IV, and other coding systems)
  • Semantic Network Browser

UMLS also has a problem list subset - about 6,000 SNOMED-CT codes most likely to be used in a problem list: [1]

As well as an ICD-9-CM to SNOMED CT Mapping (one-to-one and one-to-many): [2]

In DS2, we used these tools to learn about concepts and relationships in order to inform the project; we also traversed SNOMED CT to extract codes for deterministic rules; and mapped SNOMED CT codes to CCS categories using NLM's SNOMED CT to ICD-9 mapping.