Home  |   Subscribe  |   Resources  |   Reprints  |   Writers' Guidelines

May 2015

The Deidentification Dilemma
By Elizabeth S. Roop
For The Record
Vol. 27 No. 5 P. 16

As data analytics gains traction, it becomes more difficult for health care organizations to find a balance between reaping its rewards and maintaining patient confidentiality.

Patient data volume is increasing, and so too are requests for those data to be deidentified and shared for use in everything from scientific research and clinical outcomes improvements to operational decisions regarding pricing and utilization rates. Thus, when Health and Human Services (HHS) issued guidance on the topic, its goal was to explain the two most common methods of deidentification.

But is it enough to protect the holders and users of those data as well as the patients themselves? The answer is not as simple as you'd expect.

"If you're following the guidance and rules on which 18 factors need to be removed to meet the requirements for deidentification … you should be in the clear," says Angela Dinh Rose, MHA, RHIA, CHPS, AHIMA's director of HIM practice excellence. "Can it be reidentified? It depends. I never say we're 100% in the clear because we're human."

Weighing the Risks
According to HHS, the two accepted methods for deidentifying protected health information (PHI) in accordance with the HIPAA Privacy Rule are expert determination and safe harbor. Each has its own set of benefits and risks.

Expert determination, which involves applying statistical or scientific principles, carries a "very small risk that [the] anticipated recipient could identify [the] individual," wrote HHS in "Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule."

Under this method, HHS says a covered entity may determine that health information is deidentified if "a person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable" does the following:

• determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, to identify an individual who is a subject of the information; and

• documents the methods and results of the analysis that justify such determination.
According to HHS, safe harbor involves removing 18 identifiers (see sidebar) of the individual and of his or her relatives, employers, and household members, leaving behind "no actual knowledge [or] residual information [that] can identify [the] individual." These include names, Social Security numbers, birth dates, medical record numbers, health plan beneficiary numbers, and biometric identifiers such as finger- and voice prints.

While the more commonly applied safe harbor method seems like it would be fairly foolproof in terms of reidentification, as Dinh Rose hints, there is no such thing as 100% certainty. Her caution is echoed by Dixie B. Baker, PhD, a senior partner with Martin, Blanck and Associates, which provides health care consulting services.

"Today's Big Data analytics are extremely powerful, and it's hard to believe that these powerful computing algorithms would not be able to identify an individual from their deidentified data, particularly if the data were longitudinal," she says. "I think this speculation warrants testing and quantification. However, the safe harbor approach is described in law so it should be strong enough to legally protect an organization."

Unlike organizations choosing the safe harbor method, those that opt for executive determination must assess the potential that data could be reidentified. According to David Holtzman, vice president of compliance at CynergisTek, an HIT consulting firm focused on privacy, security, and compliance, that risk can be measured in the following three ways:

• Replicability: Prioritize health information features into risk levels according to the chance it will consistently occur in relation to the individual.

• Data source availability: Determine which external data sources contain patient identifiers and replicable features in the health information as well as who is permitted to access the data source.

• Distinguishability: Determine the extent to which the subject's data can be distinguished in the health information.

"The greater the replicability, availability, and distinguishability of the health information, the greater the risk for identification," Holtzman says. "Laboratory values are examples of a low risk of individual identification because while they are very distinguishing, they are generally not independently replicable and rarely disclosed in multiple data sources to which many people have access. The risk of identification is greater for demographics because they are highly distinguishing, highly replicable, and available on public data sources."

A Cultural Shift
The HHS guidance does its job in that it offers methods of deidentification that, when followed, mitigate any risk the data owner or users may face in the event of a breach. However, it's impossible to fully deidentify data if the goal of their use is improving quality and outcomes, according to Michael Ebert, a partner in KPMG's Advisory Services Practice and health care leader of its Information Protection (Security, Privacy, and Continuity) practice. "The root problem is that it's increasingly difficult to deidentify data because it's being used in so many ways to improve outcomes," he notes. "To do that, there are many different variables that need to be included."

For example, dates are needed to effectively measure quality because it's necessary to understand the events surrounding an incident, including when it occurred. However, all date elements, including admission and discharge dates, are among the 18 factors that must be excluded when deidentifying data.

Further, Ebert says, the problem isn't so much the method by which data are deidentified, but rather how requests are handled. The guidance does its job of informing how to deidentify data sufficiently to avoid risks but does nothing about the deeper issue of oversharing. "If you partially deidentify data, thumbs up. Good job. That's what they're looking for. But too often we are sending the full data sets [because] it's easier. We need to change how we modify and distribute the information," Ebert says.

The burden for making specific requests falls on both the data owner and the data user. It's easier for everyone to request and receive all the data then parse out what isn't needed. Doing so, however, increases the risk of exposure and therefore the liability on both sides.

Ebert says a better approach is to submit requests that specify the exact data needed to accomplish the project objective. When overly broad requests are received, the data owner must start pushing back rather than acquiescing and turning over the full data set, he adds.

"If something goes wrong, it's on the person who provided the data," Ebert says. "We're talking about a cultural change. We have to start thinking about what data we want and what we actually need. … Think about the risk and exposure you're dealing with. You have to be careful about the risk you're assuming because you'll ultimately be liable, along with whomever gave you the data.

"We've been training [our clients] constantly now on 'don't take data you don't need.' If they give it to you, give it back. It's changing the culture, the mindset, in which we work today," he says.

Dinh Rose notes that the use of deidentified data is growing exponentially as technology and informatics play increasingly important roles in value- and quality-based health care. For example, effective population health management requires the ability to categorize patient populations by age, chronic conditions, and other specific factors. Under safe harbor, much of those data must be excluded.

"Data analytics is booming, so I would imagine there are even more uses for deidentified data now," Dinh Rose says. "Let's use the information generated from ICD-10, for example. It's part of a patient health record; it's PHI. But that information, when deidentified, can help determine population health at a greater level that ICD-9 can't even grasp."

Patient Rights
Once deidentified, health information is no longer PHI and therefore not subject to the protections provided under HIPAA. It creates an interesting dilemma for those who understand not only the value of deidentified data but also the patient's right to privacy.

"The HHS guidelines simply explain what is in the law, and although I'm not an attorney, my understanding is that if a HIPAA-covered entity deidentifies the PHI it holds in accordance with the HIPAA privacy rule, then those data are no longer considered PHI and therefore fall outside the breach notification rule," Baker says. "The guidelines don't address genomic data, which is increasingly becoming part of individuals' health records."

Twila Brase, RN, president and cofounder of the Citizens' Council for Health Freedom, says HHS admits that the guidance doesn't guarantee privacy. In fact, the data cease to be protected once an expert deems they're no longer PHI or as soon as certain data elements are eliminated, she says, adding that this leaves patient information free to be shared as long as it's not reidentified.

"A lot of people talk about protecting privacy after they've taken the data into their office or their system, but they are only talking about protecting the security of the data," Brase says. "The person's privacy was violated when they accessed the person's data without consent."

Brase, whose organization works to protect patient privacy, says private data are not public just because someone says they have been deidentified. Patients should have consent rights over the use of their own data, including the right to refuse their use for any purpose the patients might object to whether or not the data have been deidentified.

Brase identifies four significant risks when it comes to the use of deidentified data without consent: lack of access to timely, accurate, patient-centric care; the end of patient trust; the elimination of reliable research; and the loss of patient control over treatment decisions.

"Increasingly, patient data are being used for 'quality measurements' and payment methodologies which are being used in an attempt to control physician treatment decisions. These controls threaten individualized care, impose one-size-fits-all treatment protocols, [and] interfere in critical thinking skills—all of which may threaten the lives and quality of life of patients," she says.

"A person's consent rights do not end at the clinic door or patient's bedside," Brase continues. "Although most patients may be technological savvy, they are not technology or database experts. Most would not understand that deidentified data are often reidentifiable. They won't intuitively know what HHS admits in its guidance, starting off with the caveat 'when properly applied.'"

Balancing Risk and Reward
Though there are few, if any, examples of deidentified data being reidentified, studies have shown that the potential is there. In 2013, a Harvard researcher published results of an experiment in which she was able to reidentify 241 out of 1,130 participants in a genomic surveillance study. In this case, the reidentified individuals had submitted their date of birth, gender, and ZIP code. When combined with public records, those three elements were enough for researchers to correctly identify the subjects.

Also in 2013, researchers at Whitehead Institute used publicly accessible online resources such as Ancestry.com to identify nearly 50 individuals who had submitted personal genetic material for participation in a genomic study. Holtzman notes that the researcher's identification of public demographic data from the National Institute of General Medical Sciences Human Genetic Cell Repository at New Jersey's Coriell Institute ultimately led to changes in how US genetic information repositories provide information through publicly available databases.

"The broader challenge for discussion is our approach to how technology is evolving to link data sources to identify individuals through data that was previously considered unreplicable," he says. "As society allows greater sharing and collection of information through the use of technology applications, euphemistically referred to as the Internet of Things, the greater the risk of identification of previously deidentified data."

A third study, led by a graduate student at the Massachusetts Institute of Technology Media Lab and published in January 2015, added to the base of evidence regarding the ability to reidentify data. In the study, a group of data scientists analyzed credit card transactions made by 1.1 million people in 10,000 stores over a three-month period. They found that knowing just four random pieces of information was enough to reidentify 90% of the shoppers as unique individuals and to uncover their records. By then combining their unique behavior with publicly available information such as Instagram or Twitter posts, individuals' records could be reidentified by name.

These studies demonstrate why there have been calls for deidentification methods that can keep pace with HIT advances, particularly as more applications are found for genomics. "I would note that much of my own work has to do with the use and protection of genomic data, which is not specifically addressed by the HIPAA privacy rule," Baker says. "However, a 'biometric' is one of the 18 data elements that must be removed under the safe harbor method. An individual's DNA is arguably the strongest biometric that could be used to identify an individual, so it stands to reason that genomic data cannot be deidentified. Fortunately, that fact seems to be more and more widely accepted."

Ultimately, it comes down to strong yet flexible governance and an informed patient population that understands both how and why data are utilized. When it comes to deidentification, Ebert prioritizes the following three considerations:

• Are all data elements necessary?

• Is third-party access appropriately restricted?

• Are processes in place to properly manage responsibilities and liabilities?
"It comes back [to governance] every time," Ebert says.

— Elizabeth S. Roop is a Tampa, Florida-based freelance writer specializing in health care and HIT.


(2)(i) The following identifiers of the individual or of relatives, employers, or household members of the individual are removed:

(A) Names

(B) All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:

(1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and

(2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000

(C) All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older

(D) Telephone numbers

(E) Fax numbers

(F) E-mail addresses

(G) Social security numbers

(H) Medical record numbers

(I) Health plan beneficiary numbers

(J) Account numbers

(K) Certificate/license numbers

(L) Vehicle identifiers and serial numbers, including license plate numbers

(M) Device identifiers and serial numbers

(N) Web Universal Resource Locators (URLs)

(O) Internet Protocol (IP) addresses

(P) Biometric identifiers, including finger- and voice prints

(Q) Full-face photographs and any comparable images

(R) Any other unique identifying number, characteristic, or code, except as permitted by paragraph (c) of this section [Paragraph (c) is presented below in the section "Reidentification"]; and

(ii) The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.

(c) Implementation specifications: reidentification. A covered entity may assign a code or other means of record identification to allow information deidentified under this section to be reidentified by the covered entity, provided that:

(1) Derivation. The code or other means of record identification is not derived from or related to information about the individual and is not otherwise capable of being translated so as to identify the individual; and

(2) Security. The covered entity does not use or disclose the code or other means of record identification for any other purpose, and does not disclose the mechanism for reidentification.