The past several years have been characterised by an era of “big data”. During this time, the volume, velocity, and variety of data collected from numerous sources across many fields including medicine have increased.Reference Merelli, Perez-Sanchez, Gesing and D’Agostino 1 In order to make use of this increasing volume of data, multiple different platforms and techniques have been developed aiming to better manage, integrate, analyse, and provide real-time feedback to various industries regarding their data. Data linkages in particular are important as they enable questions to be answered that cannot be addressed with individual data sets alone – for example, in the automotive industry, capturing data generated by electric cars on driving habits impacting battery use such as typical acceleration and braking, and linking this with data regarding frequency and location of battery charging stations, aids in better design of the next generation of vehicles and charging infrastructure. 2
In the field of paediatric cardiovascular disease, numerous clinical registries, administrative databases, research data sets, and other data sources currently exist, and they contain a wealth of important information that can be used to facilitate research and quality improvement. These include multiple data sets that capture information related to paediatric heart failure and transplant.Reference Davies and Pizarro 3 In addition, data are being increasingly captured via a variety of newer modalities including the electronic health record, continuous capture of data generated from medical monitors and devices, genetic and biomarker data, and patient-reported outcomes data regarding quality of life and other important longer-term outcomes; however, many current limitations constrain the knowledge gained from these data sets.Reference Pasquali, Jacobs and Jacobs 4 Each data set contains limited information, most often isolated to a specific procedure or hospitalisation, and there is a primary focus on short-term outcomes only. Databases do not readily communicate with each other, and there are limited mechanisms for efficient “real-time” collection of new or additional data points to answer new or additional clinical questions. There is also limited ability to capture longitudinal follow-up data.
Rationale for linking databases
Linking information across data sources can address many of the limitations associated with the use of individual data sets described in the preceding section.Reference Pasquali, Jacobs and Shook 5 Linking databases expand the pool of available data for analysis and capitalises on the strengths of different types of data sources. Linkage allows analyses otherwise not possible with single-centre data or with individual data sets alone. Finally, linking data sets can be more time-efficient and cost-efficient than creating additional new data sets and can involve several different methodologies.Reference Pasquali, Jacobs and Shook 5
Data linkage methodologies
Linking on unique identifiers
Local patient records, and some larger data sets, contain unique patient identifiers such as social security number that can facilitate linkages with other data sourcesReference Jacobs, Edwards and Shahian 6 – Reference Saleeb, Li, Warren and Lock 8 – for example, investigators have previously linked outpatient records regarding paediatric cardiology visits for chest pain to the National Death Index and Social Security Death Master File to evaluate for subsequent mortality in this cohort.Reference Saleeb, Li, Warren and Lock 8 New limitations on the availability of the Social Security Death Master File for research purposes may pose a greater challenge to the use of this methodology in the future.
Linking on indirect identifiers
Although linkage on direct or unique identifiers is the easiest way to accomplish linkages between data sets, these are often not collected in many databases due to a variety of regulatory requirements and privacy concerns, and may only be available at the local level.Reference Dokholyan, Muhlbaier and Falletta 9 Therefore, a methodology has also been developed to link database records through the use of “indirect” identifiers.Reference Hammill, Hernandez, Peterson, Fonarow, Schulman and Curtis 10 These include date of birth, date of admission, date of discharge, sex, and the centre of hospitalisation. It has been shown that nearly all records at a given centre can be uniquely identified using these indirect identifiers, and that a crosswalk can then be created between two data sets, linking patients based on the values of the centre where hospitalised and the indirect identifiers. This method has been used to successfully link adult cardiac databases.Reference Hammill, Hernandez, Peterson, Fonarow, Schulman and Curtis 10
Recently, this methodology was adapted to the paediatric cardiovascular population to link a large clinical registry (Society of Thoracic Surgeons Congenital Heart Surgery Database) with a paediatric administrative data set (Pediatric Health Information Systems Database).Reference Pasquali, Jacobs and Shook 5 Linking these two data sets allows utilisation of the detailed operative and outcomes data from the clinical registry, and the valuable resource utilisation data from the administrative data set, to conduct analyses not otherwise possible with each individual database alone. The present linked data set includes records from >60,000 children undergoing congenital heart surgery at 33 different hospitals from 2004 to 2010, with plans to further expand and update the data set. Several comparative effectiveness studies, using the clinical data from the registry and medication data from the administrative set, and analyses of the quality–cost relationship, using the clinical data from the registry and the resource utilisation and cost estimates from the administrative data set, have been successfully conducted.Reference Pasquali, Li and He 11 – Reference Pasquali, Jacobs and He 13 Similar methodology has also been used to merge clinical trial data from the Pediatric Heart Network Single Ventricle Reconstruction Trial with data from the Children’s Hospital Association Case Mix data set in order to perform economic analyses, which are not possible using the trial data alone.Reference McHugh, Pasquali, Hall and Scheurer 14
Centre-level linkages
Linking registry data to other centre-level data through matching on centre can be easily accomplished – for example, survey data regarding intensive care unit care models and nursing education and staffing levels have been successfully linked to the Society of Thoracic Surgeons Congenital Heart Surgery Database.Reference Burstein, Jacobs and Sheng 15 These linkages enable evaluation of the variables collected in the survey in relation to outcomes data collected in the registry.
Supplementary data modules
Data linkages can also be efficiently accomplished through the development of a modular data-collection system that enables collection and linkages of supplemental data points to the main registry. The modules are generally web-based and can be quickly created and deployed to allow “real-time” collection of additional data needed to answer important clinical questions. They are more time-efficient and cost-efficient compared with traditional research methods that may duplicate data already being collected in a registry. This methodology has been recently successfully used by the Pediatric Cardiac Critical Care Consortium to collect supplemental data to their main registry to study the relationship between Vasoactive-Inotropic Score and outcome after infant cardiac surgery, and facilitated efficient data collection with 391 infants prospectively enrolled across four centres in 5 months.Reference Gaies, Jeffries and Niebler 16
Collaboration/partnering between databases
Data can also be shared or linked through collaboration and partnering between different organisations and data sets – for example, the Society of Thoracic Surgeons and the Congenital Cardiac Anesthesia Society recently collaborated to add an anaesthesia section to the surgical data collection forms.Reference Vener, Jacobs, Schindler, Maruszewski and Andropoulos 17 Anaesthesia data are now collected, harvested, reported, and analysed along with surgical data for participating centres. This approach was likely more time-efficient and cost-efficient than creating a separate anaesthesia database in which many of the fields regarding patient characteristics and the operative procedure would have been duplicated between databases. Determining data access, sharing, access, and governance policies between organisations is important in this type of approach.
Expanding linkages
As described in the preceding sections, most current linkages have involved 1:1 linkages of a certain data set to another to answer a specific question. More comprehensive integration of data across multiple sources would be desirable in order to reduce data entry burden, facilitate research, most efficiently utilise available information, and to promote longitudinal outcomes assessment. In order to facilitate such linkages, both information technology solutions and further collaboration among stakeholders are necessary.
An option may be the creation of a global unique identifier – known as a GUID – and collaboration among researchers and professional societies to share and merge data sets containing these identifiers at the national level.Reference Pearson, Kaltman and Lauer 18 Developed by the autism research community, the global unique identifier allows multiple linkages and also maintains privacy. It is generated based on a set of identifiers unique to the patient, and undergoes encryption before being shared with a central system so that identifiers are never transmitted or stored outside the local site.Reference Pearson, Kaltman and Lauer 18 In autism research, the global unique identifier is used to track patients between various research data sets.Reference Johnson, Whitney and McAuliffe 19 Downsides of the global unique identifier are that some of the data elements required to generate it in its present form are not necessarily found in the medical record and require direct patient contact – for example, the data element of “city of birth”. This may not be feasible in the work flow of large registries, which generally capture data directly available in the medical record or other existing sources. In addition, in order to facilitate linkages, the global unique identifier must not only be generated and incorporated into individual data sets, but professional societies and researchers must also agree to collaborate and share their data sets with a central repository so that linkages can be made and analyses performed.
An alternative option involves supporting local linkages and the creation of a “distributed data network”.Reference Toh, Platt, Steiner and Brown 20 Local linkages between data sources are feasible because most often research and registry data also reside locally at each participant site’s institution in addition to being aggregated into larger multi-cenre databases. Local linkages are relatively easy to perform as direct or unique identifiers are readily available. Merged local data sets can then be de-identified, and groups of institutions or heart centres can collaborate to share and aggregate information. Alternatively, data may be recorded at each site and standard algorithms can be developed to query and analyse the data. This approach addresses some of the limitations identified with the use of a global unique identifier, and makes linked information available for both local purposes as well as for aggregate research, but would require more investment at the local level. Either approach would also need to address the limited current data available regarding basic long-term outcomes such as survival and quality of life. In addition, data sharing and governance policies would need to be developed with either approach.
Further data integrations in the future may also be facilitated by technologies such as social media and mobile devices, which allow more efficient engagement with patients and collection of patient-reported quality of life and functional outcomes.Reference Schumacher, Stringer and Donohue 21 Better integration with the electronic health record may also allow for further linkages and reduce data entry burden; however, these efforts will require additional work to improve the quality and standardisation of the data currently contained in the electronic record. There are also several ongoing efforts to better collect and integrate real-time monitoring data across intensive care units and other settings. 22 Integration of these data with clinical outcomes data may allow for improved prediction and treatment of adverse events.
With the expansion in the number and types of data sets and linkages, it remains important to consider several key factors regarding data collection and analysis to ensure accurate scientific investigation. These include issues related to accuracy and completeness of data, appropriate case ascertainment, standardisation – or lack thereof – of data elements, capture, and definitions, as well as the availability of variables within the data set to perform appropriate risk adjustment or adjustment for differences in case mix. The use of linked or integrated data sources does not necessarily mitigate any of these important issues.
Databases and registries on paediatric heart failure and transplantation
Paediatric heart failure and transplantation is uncommon and heterogeneous. It affects ~12,000–35,000 children in the United States of America each year, and encompasses patients with a variety of diagnoses.Reference Hsu 23 Owing to the small sample size and heterogeneity of diagnoses,Reference Davies, Russo and Hong 24 single-centre studies can provide only a limited view of these patients. Multi-institutional data sets provide an important opportunity to investigate the treatment and outcomes of children with heart failure more broadly. Understanding the strengths and weaknesses of existing data sets is essential for critically evaluating the literature, understanding the capabilities of each database, and identifying where linkages between data sets may provide the most utility.
The Scientific Registry of Transplant Recipients/United Network for Organ Sharing Database
Known by several names,Reference Davies and Pizarro 3 the Scientific Registry of Transplant Recipients is a mandatory data set that contains records of all paediatric heart listings and transplants performed in the United States of America since 1988. Members of the United Network for Organ Sharing are required to submit to the data set as a condition of membership. 25 Data are collected at listing, at discharge from the transplant admission, and during the yearly follow-ups thereafter. The two main strengths of the data set are mandatory submission and the public availability of raw data; however, limitations in the number of variables collected are especially problematic for children, where the aetiology of heart failure and surgical history, which result in a wide spectrum of risk,Reference Davies, Russo and Hong 24 , Reference Davies, Russo, Yang, Quaegebeur, Mosca and Chen 26 – Reference Lamour, Kanter, Naftel, Morrow, Clemson and Kirklin 29 are not collected. Audited data are reliable and complete, but the utility of the data set is limited by the high frequency of missing variables in non-audited fields.Reference Davies, Russo, Morgan, Sorabella, Naka and Chen 30 In addition, over time, new variables have been added, others have been removed, and specific definitions have been changed. This can make studies including patients over a wide time span challenging. Public availability of the data set results in variability in research design, including the robustness of statistical methods and the handling of missing data.Reference Davies, Russo, Morgan, Sorabella, Naka and Chen 30 – Reference Russo, Hong and Davies 32 Therefore, critical reading of resultant publications is crucial to ensuring valid conclusions.Reference Davies and Pizarro 3 Finally, data collection during listing and following transplantation occurs at defined time points rather than being event-driven. This limits the precision of certain outcomes of significant interest, such as implantation or removal of ventricular assist devices.Reference Davies, Haldeman, McCulloch and Pizarro 33 , Reference Davies, Russo and Hong 34
International Society for Heart and Lung Transplantation
The International Society for Heart and Lung Transplantation maintains an international registry of thoracic organ transplantation. In the United States of America, data are submitted directly from the Scientific Registry for Transplant Recipients, and therefore the same strengths and weaknesses noted in the preceding section apply to this data set as well;Reference Stehlik, Hosenpud, Edwards, Hertz and Mehra 35 however, it also collects data from centres in 32 other countries, enabling international comparisons and a global perspective on thoracic transplantation.Reference Stehlik, Hosenpud, Edwards, Hertz and Mehra 35
Pediatric Heart Transplant Study Database
The Pediatric Heart Transplant Study Database is a multi-centre registry. At present, it includes 46 centres in the United States of America and five internationally. At present, these account for 70–75% of the transplants performed in the United States of America. A historical weakness of the data set – that it contains only transplants performed at the busiest centres – is being mitigated by broader membership. Data submission is voluntary and event-driven. Data collected since 2010 include details regarding the use of mechanical support and have the potential to answer important questions regarding the outcomes among children requiring devices while awaiting transplantation.Reference Davies and Pizarro 3 Overall, this data set has more information relevant to paediatric heart transplantation compared with other transplant data sources, including diagnosis and procedural history; however, there have been multiple iterations of the data collection forms, resulting in some heterogeneity in the diagnostic categories collected.Reference Davies and Pizarro 3
Data storage and analysis for the Pediatric Heart Transplant Study Database are performed at the University of Alabama at Birmingham. In contrast to the Scientific Registry of Transplant Recipients, study approval and statistical analysis are largely centralised. 36 This provides quality control but limits the number of research projects. In addition, information regarding the frequency of missing data within the data set is not easily obtainable.Reference Davies and Pizarro 3 Despite these caveats, this data set provides the most robust available source of data regarding paediatric heart transplantation.
Pediatric Cardiomyopathy Registry
Transplant data sets have an inherent bias in that they exclude children not considered candidates for transplantation. This could be for a variety of reasons including the following: children who are too well to be transplanted, those who are too sick to be transplanted, those with other co-morbidities precluding transplantation, and those with other potential contraindications. An accurate understanding of outcomes among children with heart failure requires an understanding of how all these children do – not merely those considered candidates for transplantation.
The Pediatric Cardiomyopathy Registry is a registry funded by the National Heart Lung and Blood Institute. It consists of both prospective and retrospective cohorts collected in 2 geographic regions of the United States of America – New England and the Central Southwest – and was designed to provide estimates of the incidence of selected cardiomyopathies in children and evaluate their outcomes.Reference Grenier, Osganian and Cox 37 At present, it contains data on more than 3500 cases of cardiomyopathy in those regions.Reference Wilkinson, Landy and Colan 38 Patients were recruited from hospitals within each of these regions, and thus it is a limited data set with circumscribed geographic coverage. The New England Research Institute functions as the data and statistical coordinating centre, and research using this data set is ongoing.
INTERMACS/PediMACS
INTERMACS is a North American registry initially started as a collaboration between the member institutions, industry, and the federal government to follow-up adults implanted with approved durable ventricular assist devices. In the most recent funding period, financial support is moving away from government grants and towards funding through a combination of industry and member fees.Reference Holman 39 The University of Alabama, Birmingham, functions as the statistical and data coordinating centre. A variety of data is collected at regulated time points following implantation of a ventricular assist device and at the occurrence of any of a specific set of adverse events. Data include quality of life information rarely collected in other data sets.
PediMACS is the paediatric component of Inter and encompasses both durable and temporary devices in children. The inclusion of temporary devices is critical in paediatrics, where the limited number of devices available often forces the use of temporary devices, and patients may remain supported primarily by “temporary” devices for months.Reference Chen, Richmond and Charette 40 Children are followed-up from implant until death or transplant, or one year following explantation. More than 60 centres contribute data to PediMACS. This data set provides comprehensive data regarding clinical condition and outcome among children undergoing device implantation; however, it is limited by its relatively recent inception, as well as the present lack of long-term follow-up data.
Extracorporeal Life Support Organization
The Extracorporeal Life Support Organization is an international consortium of providers of extracorporeal life support. Members submit data regarding patients on extracorporeal support to a national registry. In the most recent year (2014), the registry collected information on over 5000 cases at over 250 centres. 41 Data are collected regarding the clinical indications and condition at the time of support initiation, the incidence of complications during the run, as well as survival to hospital discharge. Data requests can be submitted by any active member. The data set has strengths with regard to extracorporeal support data, but it lacks detailed information regarding cardiac diagnoses and procedures, and also lacks long-term follow-up information. In addition, missing data can be problematic with some variables missing in over 20% of cases.Reference Rajagopal, Almond, Laussen, Rycus, Wypij and Thiagarajan 42 , Reference Thiagarajan, Laussen, Rycus, Bartlett and Bratton 43
Potential benefits of linking paediatric heart failure and transplantation data sets
Linkage of databases containing complementary data could expand the potential for research and capitalise on the strengths of various heart failure and transplant data sets, as well as other data sets in the field of paediatric cardiology and cardiac surgery containing information on this patient population – for example, the United Network for Organ Sharing data set contains long-term follow-up data but little specificity with regard to congenital diagnoses.Reference Davies, Russo, Yang, Quaegebeur, Mosca and Chen 26 In contrast, the Society of Thoracic Surgeons Congenital Heart Surgery Database has extensive information regarding congenital diagnoses and procedures but no information about long-term follow-up. A complete understanding of the long-term outcomes following transplantation in specific congenital heart diagnoses could be supported through combining information from both these data sets. Similarly, information regarding the use of mechanical circulatory support, including dates of implantation and explantation and specific support types, has not been available within the United Network for Organ Sharing data set or the Pediatric Heart Transplant Study data set, until recently, but is more complete within INTERMACS/PediMACS. Combining information from these data sets would provide a better understanding of the short-term impact of ventricular assist devices, as well as whether duration of support, conversion from one device to another, or weaning from mechanical support affect either early or late post-transplant outcomes. Furthermore, although the list of publications resulting from the United Network for Organ Sharing and Pediatric Heart Transplant Study data sets is extensive, both data sets contain only patients who were listed for transplant. There are populations of children at-risk for heart failure but often not candidates for transplantation including children with recent cancer, certain forms of muscular dystrophy, and those with a high risk of noncompliance. Especially as technological advances in mechanical circulatory support expand the number of children who are potentially supportable with ventricular assist devices, an understanding of outcomes with both medical and surgical management of heart failure is critical to optimising treatment options and supporting optimal quality and length of life. In each of these cases, no single data set contains the relevant information. Only by linking data sets with diverse purposes, variables, and populations can these and other important questions be answered.
Conclusions
Linkages across a variety of data sets in paediatric cardiovascular disease are possible and can involve several different methodologies. Expanding these linkages and applying similar methodology to the variety of existing paediatric heart failure and transplant data sets could facilitate answering important scientific questions in this area, which cannot be answered with single data sets alone at present.