The Small Data Revolution

Home | Subscribe | Resources | Reprints | Writers' Guidelines

February 2015

The Small Data Revolution
By Greg Chittim
For The Record
Volume 27 No. 2 P. 8

The power of Big Data applications to drive effective health care analytics has the potential to revolutionize population health management. As providers transition to EHR systems—in 2013, nearly six in 10 (59%) hospitals had adopted at least a basic EHR system, an increase of 34% from 2012—it's apparent that HIT may be on the cusp of Big Data. The problem is Big Data makes an assumption that the source data are accurate. In other words, the data quality matches the quality of care delivered.

As millions of health-related records are generated every day, the trustworthiness of the data held within HIT systems has been called into question, while a standard for the verification of the quality of these diverse sources has yet to emerge. For hospitals to manage complex populations, they must harness comprehensive and reliable data on patient health and demographics. Unfortunately, many of the actors in the health care ecosystem possess only an incomplete or imprecise picture of their care populations, making it difficult to paint a clear picture of patient health. Organizations must streamline processes to better organize their patient data in a way that allows providers to make better decisions at the point of care.

Before focusing on Big Data—systems that rely on mass amounts of accurate data to intuit connections and causality without significant human programming—health care organizations must get their small data right. Ensuring that EHRs are accurately capturing data, storing it efficiently, and transporting it directly is a necessary precursor to an effective Big Data program.

Small Data in Action
A large provider in the Northeast with more than 2,000 physicians that's participating in numerous value-based contracts found there was a profound variation between its data sources. To identify flaws in their existing clinical measure-reporting infrastructure, each practice transmitted a data feed from its EHR to a third-party analytic tool that calculates a set of quality measures for reporting purposes. These data feeds were formatted according to vendor-defined continuity of care document (CCD) specifications, which were the analytic tool's only source of clinical data.

Consultants and client subject matter experts examined four practices within the organization with a total of 5,800 patients served by 50 providers, all of whom used the same EHR platform.

Data quality can mean many things, ranging from predictable data encoding errors to complete corruption and even absence of data. This organization focused on processes that would cause a lack of correspondence between reported measures and captured data. Therefore, the team focused primarily on dimensions of concordance, plausibility, and currency as it tried to ascertain whether the EHR offers a valid, plausible, and relevant representation of a patient's state at the time of reporting.

In the process of studying potential data quality gaps, the team tested existing EHR data against a subset of accountable care organization quality measures issued by the Centers for Medicare & Medicaid Services. Each measure was defined by a set of metrics (eg, how many male patients aged 45 and older?), which were the basis for calculating a measure denominator (number of patients applicable to the measure) and numerator (number of patients compliant with the measure).

To isolate the sources of measurement reporting errors, the team calculated metrics using three tiers of inclusion criteria meant to simulate available clinical data at each step in the data flow. These tiers, or "cuts," were based on previous data quality assessment experience, manual examination of patient records, and interviews with the clinical staff.

The reported cut included only data elements available for external reporting. This organization's criteria were based on the vendor-specific CCD feed, which was the primary mechanism for populating external analytic and reporting tools. Although CCD feeds follow a standard specification, vendors may define how the feed is populated. Since the existing analytic platform was limited to clinical data from the CCD feed, the reported cut was the most limited dataset.

The structured cut included only structured data elements. This organization's criteria for structured data elements included any elements represented by a code (eg, national coverage determination, ICD, and vendor-specific codes), number, or date. This level of inspection added patients for inclusion or compliance whose data were mapped to structured fields that would not normally be captured for reporting by the CCD feed.

The unstructured cut included all available, appropriate data elements, both structured and unstructured. Unstructured data elements, which included free-text and note fields, contained data in the form of a known pattern (eg, "mammogram NOT ordered") or an incorrectly formatted string (eg, 7.6% instead of 7.6). For any data to be included in a measure calculation, the data must have been entered in some form by clinical staff, in a known location, and in a predictable format or phrasing. Nonetheless, cases that did not meet the above criteria may exist, and would have represented data quality gaps that this process does not quantify.

Some of the most prevalent data gaps were identified between the reported and the structured cuts. For example, colorectal and breast cancer screenings were reported as 0% compliant using the reported cut, while the structured cut revealed a compliance rate between 80% and 95%. Failure to report information that was properly structured in the system indicates an issue with data transport. However, there also were significant data quality gaps between the structured and unstructured cuts. For example, influenza vaccinations were underreported by 20 to 30 percentage points in both the reported and structured cuts compared with the unstructured cut.

Recommendations
Based on this analysis, consultants identified several points critical to safeguarding data quality in the data flow, including the following:

• Capture, representing the stage at which clinical professionals and/or automated systems enter data into the EHR. Valid data capture requires that a clinical event happened and the results were accurately entered into the system (ie, a patient encounter or a returned lab result).

• Structure, representing the process by which captured data are stored in an appropriate format and location. Valid structure depends both on the way in which the data are entered as well as on the configuration of the EHR platform. If a number is entered into a text field, its accessibility for reporting and analysis is reduced.

• Transport, representing the process by which data are extracted from storage and made available to external systems for reporting or analysis. Which fields are extracted and how records are selected for inclusion are characteristics of the transport mechanism that impact the quality of outgoing data.

After characterizing all metrics, removing gap-free categories, and combining overlapping categories, the team arrived at six categories of data element sources: diagnoses, vitals, labs/orders, medications, procedures, and contraindications. Each source was associated with the most significant root causes of data quality gaps identified during the analysis and the measures most substantially impacted by those gaps.

While successful population health management depends on dedicated and effective clinical care, it also is ultimately reliant on accurate and comprehensive health care data. Powerful data mining tools are critical in fueling the Big Data revolution, but most will fail without small data success. By enabling the identification and correction of quality gaps in health care data, providers are one step closer to a meaningful and effective infrastructure.

Specifically categorizing points of failure and identifying the most prevalent data element types for those failures allows organizations to gain visibility into when, where, and to what extent data gaps are incurred. It also clarifies what can be done to resolve the issues. By tying in knowledge of the underlying populations, improved data quality enables improved quality of care.

— Greg Chittim is a senior director at Arcadia Healthcare Solutions.