Home  |   Subscribe  |   Resources  |   Reprints  |   Writers' Guidelines

Winter 2023

Cancer Registries Use Natural Language Processing in Data Abstraction
By Susan Chapman, MA, MFA
For The Record
Vol. 35 No. 1 P. 14

Natural language processing (NLP) is becoming an important tool throughout health care. Over the last several years, the health care industry is increasingly employing this artificial intelligence in cancer registries to help with the challenging task of abstracting data from the large amounts of documentation that flow into cancer registries each day.

The Role of Cancer Registries in Health Care
Yale-New Haven Hospital in New Haven, Connecticut, established the first cancer registry in 1926, and its functions are now considered the precursors of today’s system of collecting, managing, and analyzing information on patients who have been diagnosed with cancer. Over the last century, cancer registries have become integral in helping medical professionals and others learn more about cancer and its treatment.

There are effectively three different types of cancer registries. There are those that are within a health care system or facility. The role of this type of registry is to collect and store information on patients who have been diagnosed and treated for cancer within that organization or facility. Central registries, or those that collect and store cancer information within specific areas, are the second type. Third are the registries that fulfill a specific function, otherwise known as special-purpose registries. Such registries collect information on a particular type of the disease.

Among the information that cancer registries collect are diagnoses; types of cancer; patients’ medical histories and demographics; treatments and therapies; and patient follow-ups, the latter of which can include disease recurrence and ongoing treatments.1

After cancer registries collect their data, they report that information to their respective states according to each state’s reporting laws and requirements. In turn, states send their data to the National Cancer Database (NCDB), which produces statistics and reports nationwide. Throughout the process—from initial data collection to the NCDB’s receipt of those data—all patient information must be kept confidential in accordance with HIPAA requirements, and no identifiers are used in data publication or analysis.

While the state is considered a cancer-incident registry—important because those incidents can help identify pockets of cancer—the NCDB offers comparative data for treatment. The NCDB’s data analysis can help facilities understand if they are on target for national quality measures. Among the information the NCDB provides, for example, are cancer survival rates, which can serve to guide physicians in choosing proper treatment options.

Because of their important role in public health as a whole, cancer registries also assist on a much broader scale, helping public-health officials make critical decisions on the allocation of research funding, the placement of screening programs, and public education. Equally important is that the information cancer registries gather plays a vital role in advancing what is known about treatment efficacy, cancer occurrence, and the disease’s survival rates.

Overseeing the information-management work of cancer registries are cancer registrars, who abstract vast amounts of information and collaborate with physicians, researchers, and health administrators. Registrars collect data on the entire cancer experience, essentially all aspects of cancer, from patient diagnosis through treatment. It is the role of the cancer registrar to ensure data integrity and that all information reporting complies with state and federal standards.1

NLP and Cancer Registries
NLP can support cancer registrars by extracting disease-related data from the EHR. This information “can then be integrated into disease registries, allowing data-rich registries to be utilized both prospectively for clinical trials and retrospectively for informed decisions on disease interventions utilizing real world data.”2

According to George Cernile, chief technology officer and vice president of engineering at Inspirata Inc, and Michele A. Webb, CTR, the company’s clinical product specialist, results from their company’s recent survey conducted with cancer registrars show that three-quarters of responding organizations are looking to automate case-finding [identifying patients who have been diagnosed with cancer]. About one-half aim to automate case follow-up, and every fifth organization would like to apply NLP to abstracting.

“NLP solutions can dramatically enhance cancer registry tasks associated with case-finding and auto-extraction of data elements used in case abstraction,” the Inspirata team states. “NLP engines can process, in milliseconds, what humans take hours to perform. Once the reportability of a case is determined and the predefined data elements, or attributes, are extracted from the EHR using the appropriate standards and guidelines, the registrar can quickly and efficiently verify the accuracy and completeness of the case record.”

There are other, more nuanced forms of documentation, other than those in the EHR, from which valuable information can be gleaned and for which NLP can be useful, and the technology is still evolving to be able to gather and analyze that information. Cernile and Webb say that over the past five years, NLP’s ability to benefit the work of cancer registrars has evolved, particularly in terms of new models such as deep learning. “It was not until the 1980s that computerized data-management systems emerged,” they explain. “Today, we’re challenged by rapid changes in science and medicine that impact reporting requirements for multiple standard-setting agencies. Vendors are having to shorten their development and quality- control cycles in order to deliver software solutions to the end-users to support these changes. While showing promising results in more generic applications, such as chat bots and search queries, this new technology is not as robust in terms of accuracy in deciphering discrete medical data. There are lots of things in medical reports that confound such NLP systems. To decipher medical information, the system needs access to domain knowledge on how to interpret the data and then to use this domain knowledge for inference.”

In the study, “Use of Natural Language Processing to Extract Clinical Cancer Phenotypes From Electronic Medical Records,” published in 2019, coauthor Guergana K. Savova and her fellow researchers acknowledge that documentation beyond the EHR is important to cancer registry data gathering efforts. And in their research, the team surveyed “advances in information extraction from the free-text of electronic medical records as related to the complex domain of oncology.”3 The study notes, “Data produced during the processes of clinical care and research in oncology are proliferating at an exponential rate. In the past decade, use of electronic medical records (EMRs) has increased significantly in the United States.”3 Among the factors the team cites as catalysts for this growing use of EHR are 2009’s HITECH Act and databases such as the NCDB and the National Cancer Institute’s Surveillance, Epidemiology, and End Results program, among others. Yet, despite the increasing use of EHR, vital data still exist in free text, and that information is valuable in cancer treatment and research.3

The researchers go on to explain that in spite of widespread use of the EHR, vital information is often detailed only in clinical texts, and NLP can help extract this nuanced information, adding, “As a subfield of artificial intelligence, clinical NLP, which refers to the analysis of clinical or health care texts (as opposed to clinical application, per se) has been around for decades. However, only in recent years have compute power and algorithms advanced sufficiently to demonstrate its power towards broadening oncologic investigation.”3

And from an industry perspective, the Inspirata team concurs that data abstraction from documentation sources beyond the EHR is critical and notes that NPL needs to continue to evolve.

The Value of Clinical Trials
Beyond data abstraction for research, tracking, and treatment purposes, Savova and her fellow researchers additionally highlight the critical role clinical trials play in advancing cancer-patient care and enabling new medical treatments to emerge. But while this area of research is burgeoning, adult participation is low, something that is particularly notable among underrepresented populations. If patients are not availing themselves of clinical trials, then this lack of participation hinders researchers’ abilities to complete their trials and ultimately produce safe and effective treatments. As a way to mitigate this issue, clinical-trial matching has become important.3 Savova and her team note, “This is not a simple problem, given the need to extract information from trial protocols written in natural language and match the findings with characteristics from individual EMRs.”3

A widespread issue in health care is connecting patients with appropriate clinical trials. NLP provides the capability for organizations to scan a pathology report in the registry to help identify applicable clinical trials for patients that are geographically accessible.

In Houston, Baylor College of Medicine, in partnership with NLP firm Melax Tech, has developed processes that will automate patient cohort identification and streamline research.4 Melax Tech CEO Andre Pontin states, “There is increasing interest within the cancer community to use NLP for a wide array of purposes. For example, undertaking efforts to develop the ability to support and maintain national cancer registry reporting through the use of NLP.”4

The joint project, expected to conclude at the end of 2023, is hoping to use the abstracted data to advance clinical research studies. Chris Amos, PhD, principal investigator of the project, director of the Institute for Clinical and Translational Research, and associate director for population and quantitative research at the Dan L Duncan Comprehensive Cancer Center at Baylor, explains, “Pathology reports often contain copious and valuable information that are relevant for treating patients, but the data structure makes them difficult to use for research studies.”4

Understanding NLP’s Current Limitations
Cernile and Webb believe that while NLP can play an important role in data abstraction, it’s not without its drawbacks. “NLP is not perfect because it sometimes struggles to decipher medical information. NLP is not a panacea or a catch-all for dealing with medical data. To be successful, it requires a deep understanding of the problem domain and of the desired outcome and outputs. In other words, the system needs to be geared for specific purposes to be effective,” they explain. “Another challenge relates to the format in which some reports exist. Not everything comes in text-based form. In many cases, NLP must deal with low-resolution faxes or PDFs with graphics. This represents a challenge to any computerized system, irrespective of how good its design is.”

To help ensure NLP systems are used effectively, the Inspirata team believes that proper review and understanding of the content and the goals for each specific use case are paramount. Additional best practices include the collection of sample reports as well as robust testing and adjustment to achieve the desired functionality and accuracy. For it to continue to be successful in the future, “NLP must continue to be developed and applied to appropriate tasks and functions in healthcare and the cancer registry. The cancer registrar’s role will transition from data collection to data curation. NLP will enable registrars to collect, visually verify, and apply the standards and guidelines to multiple, complex datasets faster and more accurately,” Webb and Cernile say.

For Savova and her team, collaboration among stakeholders is another best practice. Oncology and cancer research are two separate fields, and finding individuals who are well-versed in both is rare. Because of this, experts in their respective fields must work together in order to prioritize the best uses for NLP.3 They state, “Once an NLP technology is developed, oncologists and cancer researchers should take a primary role in evaluating it to determine its utility for research and their clinical value. While standards for clinical evaluation of software, including artificial intelligence systems, are evolving … NLP tools that directly affect management decisions should be considered for evaluation in a trial setting by clinical investigators familiar with the technology and FDA guidelines. In partnership, computer scientists, oncology researchers, and clinicians can take full advantage of the recent advances in NLP technology to fully leverage the wealth of data stored and rapidly accumulating in our EMRs.”3

It’s clear that advances in NLP technology have played key roles in helping cancer registrars abstract information that ultimately benefits individual patients and the greater population as a whole. And organizations, both public and private, have vested interests in ensuring that this technology continues to evolve, ultimately helping researchers and practitioners learn more about cancer and its treatments on the path to enhancing patient care.

— Susan Chapman, MA, MFA, is a Los Angeles–based freelance writer and editor.


1. Chapman S. Cancer registrars tackle tough assignments. For The Record. 2012;24(7):14.

2. Streamline data abstraction for disease registries with Linguamatics NLP. Linguamatics website. https://www.linguamatics.com/solutions/disease-registries

3. Savova GK, Danciu I, Alamudun F, et al. Use of natural language processing to extract clinical cancer phenotypes from electronic medical records. Cancer Res. 2019;79(21):5463-5470.

4. Melax Tech and Baylor College of Medicine collaborate on cancer natural language processing project. PR Newswire website. https://www.prnewswire.com/news-releases/melax-tech-and-baylor-college-of-medicine-collaborate-on-cancer-natural-language-processing-project-301409939.html. Published October 27, 2021.