Tribal health leaders have long recognized the necessity of having complete and accurate race data as a first step to addressing health disparities experienced by American Indians/Alaska Natives (AI/AN). Numerous studies have shown high prevalence of race misclassification for AI/AN in data sources such vital statistics and cancer registries. This results in underestimated morbidity and mortality, hampering public health decision-making and the appropriate allocation of disease control resources.
Using the most complete listing of AI/AN currently available—a roster of individuals who have registered at tribal, Indian Health Service, and urban Indian clinics in the northwest—we perform record linkages with health data systems in Idaho, Oregon, and Washington. The prevalence of misclassified and missing race data in this region can range from 30-60%, which if left uncorrected, would significantly underestimate the burden of health outcomes for this population. Our work directly benefits both state partners and tribes by improving the accuracy of race data in state surveillance data systems, and providing more accurate and complete health status data to northwest tribal communities. To date, linkages have been conducted with state cancer registries, death records, hospital discharge data, STD surveillance systems, and several tribe-specific projects. This work is widely supported by tribal health leaders and our state partners.
What is record linkage?
Record linkage is the process of comparing records across data sets to identify individuals contained in both. In Indian Country, one common example involves taking a data source with accurate information about American Indian/Alaska Native ancestry and linking it with a second dataset to improve the quality of race information in the second database.
Linkages can supplement or validate data across data sets, as well as identify duplicate records on the same individual within one data set. Common examples include merging death information from a vital statistics file with cancer information from a central center registry; or linking data from death certificates, inpatient hospitalizations, and law enforcement citations to generate crash and injury reports as in NHTSA’s CODES Program. Likewise, the detection of duplicates is a fundamental requirement for accuracy and validity of event counts in any disease registry.
Linkages fall into two main categories, deterministic and probabilistic. Deterministic linkage compares data fields to look for exact matches across data fields of a record; a fairly straightforward process, but may result in many missed matches if there are coding errors or missing data. Probabilistic matching has several advantages over exact matching methods, such as the ability to:
- Account for coding differences between the two files, such as the use of nicknames, middle initials vs. full middle names, and transposed digits in a social security number
- Account for both the likelihood that two records represent the same person (sensitivity), and the likelihood that they do not (specificity)
- Assign score weights depending on the frequency of a value (e.g., your dataset contains many “Smiths” but few “Hoopes” so a match on “Hoopes” would be weighted higher)
- Allow for phonetic name matching (e.g., NYSIIS and Soundex)
For more detailed information about linkage concepts, see the Linkage Concepts PowerPoint in the “Resources” section.
The process of pursuing record linkages varies across states, departments, and institutions, but here we offer some tools that may help you get started. First, it is important to contact the manager of the data source with which you wish to link to discuss the project and determine specific approval processes that the organization may have. We generally develop a simple IRB protocol, which often qualifies for expedited review, and negotiate a data sharing agreement with the agency we’ll be linking with. Confidentiality pledges can be used to specify data handling and disclosure protocols required of staff with access to confidential data. Examples of these documents are provided below.
This PowerPoint presentation covers the basic concepts of deterministic and probabilistic linkage.
Here are several user-friendly software options available that require very little programming knowledge.
This list is only a sample of the programs available and is not meant to be exhaustive.
The Link King – free public domain linkage and de-duplication software (user manual available).
Link Plus (a component of Registry Plus) – free, publicly available linkage and de-duplication software designed by CDC for use by central cancer registries (but usable with any fixed width or delimited data type). There is no user manual available from the developer, but technical support may be provided by phone and email.
LinkSolv – a commercial linkage solution software for purchase from Strategic Matching. Training and technical support are available.
Link Plus Resources
This project has more experience using Link Plus than other software options. The following resources provide more detail about getting started if you choose to use this program.
This PowerPoint presentation provides an overview of the Link Plus software and its use for probabilistic linkage, including helpful tips for manual review.
This is a manual developed by staff of the Registry/IDEA-NW Project. It is intended to walk a new user through a linkage using Link Plus.
Link Plus Overview
Link Plus Manual
Link Plus Tip Sheet
Here is an example of an IRB Protocol describing linkage methods using Link Plus Software.
This document contains a sample template for a Data Sharing Agreement (may also be called Data Use or Data Exchange Agreement) and Use and Disclosure of Client Information. Within the data sharing agreement there are important areas to consider for inclusion. At a minimum, the agreement should specify the following: parties involved, including contact information; the purpose or need for the data sharing agreement; nature of the data to be collected; access and confidentiality of data; how the data is to be used; how and in what situations the agreement can be severed by either party; and relevant legal authorities (tribal, state, national).
A Confidentiality Pledge may be used to outline the rules for internal access to a data set containing direct personal identifiers, such as a patient registration list or tribal enrollment list, which may be used for record linkages. Technical details of data exchange between multiple parties should be detailed separately in a data sharing agreement.
Sample IRB Protocol
Sample Confidentiality Pledge
Please contact us with questions or comments on this material, using the “Contacts” tab to the right.