Healthcare Data

Learning Healthcare Systems seek to capture and generate knowledge from the data flowing from routine care. They then feed this knowledge back into the healthcare system in a way that changes the behavior of actors to improve outcomes. They are fueled by routinely collected data. Much of that data currently comes from sources such as, Hospital Episode Statistics (HES) in England, claims data in the US and clinical records. These records are generally organised around the provider rather than the patient (Porter and Teisberg 2006), so a complete picture requires input from primary care, secondary care, payors/commissioners, patients and other systems.

As well as outcomes measures, Learning Healthcare System use cases will require contextual data (patient and other factors) and data relating to the interventions provided. Much of this is recorded in the medical record, traditionally a paper file associated with each patient. Increasingly, these records are being moved onto electronic systems. However, digitisation does not necessarily make the information amenable to analysis (Payne, Corley et al. 2015).

Capturing data

There is almost always a reduction in the depth of meaning when data is recorded during a clinical encounter. In other words, what is recorded often does not reflect exactly what took place. For example, important negatives are often not recorded and context is lost (Foley and Fairmichael 2015). There are doubts about whether the data recorded in the course of treatment is currently rich enough to support statistically firm conclusions (Loder 2015).

Poor data input means that, at best, complex mapping and parsing is required to make sense of it and, at worst, the data is not suitable for secondary uses within the Learning Healthcare System (Brown 2015). There are differing approaches to obtaining data from an EHR (Wallace 2015). Highly structured input fields can be used to ensure high data quality or Natural Language Processing (NLP) can be applied to make sense of free text (Wallace 2015). Currently, much HES data is entered by non-clinical coders who must interpret handwritten paper notes.

Optum Labs, a US research partnership with access to health records for 40 million patients and claims data for a further 150 million, has used both approaches. They found that clinicians do not like to use structured fields and often leave them blank, so around 70% of their data is codified using NLP (Wallace 2015) .

Participants reported that NLP is either a very useful tool already (Wallace 2015) or has the potential to progress quickly (Platt 2015), within a research context, but it was not currently considered ready for use in safety critical systems such as decision support (Brown 2015, O’Hanlon 2015).

Intelligent dynamic templating systems, that facilitate structuring of what the clinician is recording, in real time, offers a hybrid approach that has proven effective in some EHR implementations (Foley and Fairmichael 2015).

Genomic data is expected to become increasingly important within Learning Healthcare Systems. Just as with data from any other source, genomic data that are standardised, comparable, and consistent would be more easily reused for discovery in multiple contexts beyond the original one (Institute of Medicine 2015). The sources of genomic data currently include tests for inborn errors of metabolism (such as newborn screening tests), chromosome studies (such as cytogenetic tests), array comparative genomic hybridization, DNA-based Mendelian disorder testing, and tumor sequencing. Many of these results are currently stored as PDFs in the EHR, making it difficult to achieve any secondary use (Institute of Medicine 2015).

Completeness

Analytics generally requires structured data that is comparable between different patients and different providers.

In the US, claims data provides the most comprehensive picture of patient interactions with the health system across multiple providers (Stowell 2015). This lacks clinical detail, which would ideally come from EHRs (Wallace 2015) . Unfortunately, a single patient may be seen by several different providers, each with their own incompatible EHR, which can make it difficult or impossible to complete the data.

For example, a single hospital may be unable to complete a re-hospitalisation study with just their own data because they are unable to track patients who were admitted to another hospital. Patients who die, move away or remain healthy and are not re-hospitalised would appear the same in such a study. This Creates the potential to introduce bias by only using available data (Brown 2015).

In England, HES data, collected by the Health and Social Care Information Centre (HSCIC) provides, demographic, diagnostic and intervention data for all inpatient episodes, along with very limited details of outpatient episodes and data on A&E attendances. Data is recorded through Patient Administrative Systems (PAS) at an individual provider level. This includes administrative data such as patient demographic information, as well as diagnostic and procedure codes. This is then submitted to a data warehouse called Secondary Uses Service (SUS). Elements of this data then undergo cleaning and data quality checks before being published as HES data (HSCIC 2014).

In 2013, the care.data project was due to add GP data to the collection. That phase has been delayed due to Information Governance concerns (Dunhill 2015).

Quality of data

There are concerns about the quality of comprehensive claims and HES data. For example, the error rate in hospital data in some circumstances can be greater than 20% (Manning 2015, Sharp 2014) . Currently around 70% of hospital records in the UK are still handwritten and preparation of HES data is completed by human coders who effectively practice NLP at the provider level. It is often difficult to assess whether HES data accurately represents what actually happened in the clinical encounter (Manning 2015).

Incomplete data such as under-coding of comorbidities is also a big problem. For example, there may be a record of an intervention for diabetes, such as an amputation procedure in secondary care, but the data generated on discharge is often lacking the diabetes flag. This can be quite significantly under coded, with the National Diabetes Audit estimating that approximately 30-40% of cases are not flagged (Dunbar-Rees 2015).

These errors can be compensated for by linking primary care datasets with the secondary care data sets (Dunbar-Rees 2015) or by augmenting HES data with additional manual coding from paper records (Morrow 2015). These approaches can be time consuming and can raise Information Governance concerns.

Geisinger Health System have found that the data items that are most frequently viewed, and that are viewed by several stakeholders, tend to be the most accurate. Sharing data with patients has proven to be an effective way of improving data quality (Foley and Fairmichael 2015).

Storage and access

In Learning Healthcare Systems that span more than one organisation, data can be stored and accessed through centralised or distributed networks. In a centralised network, the data is periodically uploaded to a central repository, from which it can then be accessed for secondary uses. HES data from the English NHS is an example of this approach. It simplifies access to data as only one location needs to be queried. The researcher may also be able to “eye ball” the underlying data. Although patient data can be deidentified, centralised systems can still create security, proprietary, legal and privacy concerns because patients can sometimes be reidentified and the provider, who may be legally responsible for the security of data that they have collected, loses operational control (Brown, Holmes et al. 2010).

The alternative, distributed network configuration leaves the data holder in control of their protected data. The FDA Mini-Sentinel program (an active surveillance system for monitoring the safety of FDA regulated medical products) is a good example. Queries are sent to each node (organisation) in the network. They each return results (often aggregated) in an agreed common data model, to a coordinating centre. A mapping must be agreed between each node and the common data model and the distributed implementation is more complex than the centralised approach. However, it overcomes many of the privacy issues and the participating organisations maintain operational control of their data (Brown 2015) . They may even choose to review each query before releasing the data.

Interoperability

Interoperability is the ability of one system to work with another. Learning Healthcare Systems are often networks of networks, rather than a single unified system (Rubin 2010). It is therefore necessary to share and use data that has been collected and stored in different systems. This requires standards for (ONC 2014):
• The terminology used to describe things that exist in the real world
• The content and format of data
• The transport of data
• The security of data

There is no perfect solution to the challenge of interoperability. There are many different approaches and the one that is appropriate depends on the particular use case (Wallace 2015). Interoperability for research and interoperability for patient management are very different things. To simply move data from one provider to another, so that it can be viewed, is relatively straightforward. It does not matter how it is transmitted, providing that it is labelled properly. However, if it is to be analysed, then it needs to be standardised and this is much more difficult (Brown 2015) . A feature of the Learning Healthcare System concept is that clinical care and research would become harder to distinguish (Foley and Fairmichael 2015).

Even within one organisation, there are often multiple separate systems that are divided by speciality and function, such as pharmacy, radiology, laboratory etc. This presents problems for linking an individual person’s data together and can lead to duplication of data. This gives rise to a need for a lot of “plumbing” technologies (Foley and Fairmichael 2015). Therefore, there is a lot of work involved in generating a longitudinal record, even for an individual. This can be automated, but each system requires a different approach (Foley and Fairmichael 2015). Longitudinal data presents an additional problem, because systems and data representations often change over time (Foley and Fairmichael 2015). The UK NHS number facilitates this process, while in the US, there is no universal identifier for healthcare.

Institutions with electronic records do not necessarily have good data. Typically, datasets have a lot of inconsistencies and must be checked and cleaned before secondary uses can be exploited (Foley and Fairmichael 2015).

The first point of failure usually occurs at the point of collection. True interoperability requires consistent data collection, however, there is often no agreement on what the acceptable values are. For example, biochemistry results are recorded differently in different places. These could be standardised on their way into the EHR, rather than requiring complex mapping and parsing steps at a later stage (Brown 2015).

One of the biggest challenges to interoperability is in the semantics, the meaning, of the data. For research, semantics are often more important than in clinical practice because it is crucial to know exactly what a piece of data refers to and how it was measured when comparing across different systems. For example, in the US, there is not an agreed way to define certain types of “visits” and it is often difficult to distinguish between a hospitalisation and an emergency department visit. From outside, it is difficult to make sense of that data. The metadata is often not good enough to describe the dataset or data streams for appropriate use in research. This makes it difficult for the researcher who is trying to re-use this data. (Brown 2015)

In the UK, HES, SUS and Read code data are standardised across providers. This overcomes many of the interoperability issues (Morrow 2015). However, when extracting richer data from EHRs and other systems, the coding structures used by providers can create major challenges (Manning 2015). Different coding structures can also make it difficult to link primary and secondary care data. These definitional issues make connectivity of data very challenging (Manning 2015).

There are problems with how EHRs are deployed by different vendors that can make it difficult to share data across provider settings (Simpson 2015). Often the functionality reflected in a particular implementation of an EHR locks its data into that particular use case and scale. It may be highly specialised and is not easily adapted to other purposes. The assumption is often made that huge amounts of data in EHR can be easily processed, but it is simply not always possible (Foley and Fairmichael 2015).

EHR vendors recognise the potential of secondary uses, but they argue that interoperability is not their primary task and that they do not exist to supply data for research purposes. They highlight the fact that there are significant costs incurred from providing data sharing from host systems and that it can compromise their primary objective of providing an uninterrupted clinical records service (O’Hanlon 2015, Foley and Fairmichael 2015). This perspective supports the view that currently, there are insufficient requirements or incentives in place to achieve interoperability (Simpson 2015).

This is recognised by the Institute of Medicine (IoM), who are exploring open source platforms and open APIs. The IoM does not have the mandate to impose a technical solution, so it is trying to achieve consensus. It would be for government to insist that any system, funded from the public purse, should satisfy agreed interoperability requirements (McGinnis 2015). The US Office of the National Coordinator for Health Information Technology has published a 10-year vision to achieve an interoperable health IT infrastructure, capable of supporting a Learning Healthcare System (ONC 2014).

There was consensus that standards have a significant role to play in ensuring data quality and interoperability. It was agreed that the focus of standards should be on how data is collected and processed rather than what is collected, so that standards do not have to be continually rewritten (Foley and Fairmichael 2015).

A central body driving standards is important to progressing this field. The lack of such a body in the past has meant that commercial entities were left to develop their own standards and as competitive organisations, they do not necessarily have the public good as their only concern (Foley and Fairmichael 2015).

In the UK, HSCIC is the industry standards body. They do not seek to control how hospitals handle data within their own systems but they do see it as their role to specify the format of external data exchange. HSCIC is establishing standards for data exchange and reporting. This will be based on the Academy of Medical Royal Colleges documentation for interchanging data (Manning 2015). There was broad support, from our focus group, for the progress that HSCIC have already made (Foley and Fairmichael 2015) . There is a role for government in providing legislation and a contractual framework to promote the adoption of standards (Foley and Fairmichael 2015), however, the standard setting process should be as inclusive as possible (Foley and Fairmichael 2015).

Evidence:

Distributed health data networks: a practical and preferred approach to multi-institutional evaluations of comparative effectiveness, safety…

By Brown, J. et al. AbstractBACKGROUND: Comparative effectiveness research, medical product safety evaluation, and quality measurement will require the ability ...

Capturing data

Completeness

Quality of data

Storage and access

Interoperability

Biobanks and Electronic Medical Records: Enabling Cost-Effective Research

Data Saves Lives

Enabling a learning health system

NHS Data Collections as a platform for a Learning Health System

NHS England – Global Digital Exemplars

East London Patient Record

Cambridge University Hospitals NHS Foundation Trust (CUH)

Digitising UK General Practice (Section of Wachter Review)

Dr Gerry Morrow Interview

Mr John Loder Interview

Dr Shaun O’Hanlon Interview

Mr Kingsley Manning interview

Technical Feasibility Focus Group

Improving the Underlying Data Focus Group

Dr Rupert Dunbar-Rees Interview

Dr Lisa Simpson Interview

Dr Michael McGinnis Interview

Dr Caleb Stowell Interview

IBM Watson Site Visit

Professor Richard Platt Interview

Dr Jeff Brown Interview

Dr Paul Wallace Interview

Mr Joshua Rubin Interview

Site visit to Geisinger Health System

Dr Know: A Knowledge Commons in Health

Implementing the Learning Health System: From Concept to Action

Achieving a Nationwide Learning Health System

Distributed health data networks: a practical and preferred approach to multi-institutional evaluations of comparative effectiveness, safety…