Menu Close

Practice to data

In a Learning Health System of any scale, informatics provides an opportunity to learn from every patient who is treated. The first step is to collect and assemble data that accurately represents what is happening within the system. This can include data on patients that is generated within healthcare organisations or elsewhere, as well as data on staff, facilities, finance and the environment.

Data Capture

There are different approaches to collecting data [40]. On an Electronic Health Record (EHR), highly structured input fields can be used to capture structured data or Natural Language Processing (NLP) can be applied to make sense of free text [1] (see Optum Labs Box). NLP systems vary in their performance in different contexts and do not currently offer an acceptable solution across settings. Some providers that lack electronic records still rely on clinical coders to extract data from handwritten patient notes. Structured data, such as genomic or pathology results, can flow from labs when coding standards are agreed.

Optum Labs

Optum Labs is a UnitedHealth-owned US research partnership that has access to the health records of over 100 million patients. It found that clinicians prefer not to use structured fields and often leave them blank. Accordingly, Optum opted to codify around 70% of its data from EHRs using NLP [40]. This real-world evidence has been used to validate and complement Randomised Controlled Trials.

Data collected outside healthcare settings can be useful: for example, from smartphones, apps, wearable technology, online communities and social media [41]. The Covid-19 health system response has shown how patients can be supported remotely with data on their vitals, eg blood oxygen levels flowing back to clinicians in real time via mobile apps [42].

It was thought that these many sources of data could be routed back into EHRs, which would act as central repositories for each provider. However, issues with performance, the ease of access to interfaces for importing and extracting data to and from EHRs, and the ongoing challenges of interoperability between EHRs have limited this approach. Many providers rely on data warehouses to hold data from multiple sources, and other platforms have emerged to manage data flowing in from outside the clinic and between multiple organisations. Many now advocate completely separating the EHR front end from the data.

The focus is often on quantitative data, but it is not always possible to measure some important factors through quantitative metrics [43]. It is therefore also vital that a Learning Health System can capture qualitative data, such as experience and narrative. This can be achieved through a number of methods, such as interviews and focus groups, as well as mass-participation tools for capturing and visualising narratives from across a population [44]. Health systems can also embed opportunities for comment and feedback into routine practice, such as providing feedback links in electronic decision support screens or free text options in staff or patient surveys.

Data Quality and Provenance

Within a Learning Health System, data are analysed to generate knowledge that can be applied to practice. If the data driving that process is inaccurate, incomplete or out of date, the insights generated will be incorrect, resulting in poor decisions and possibly even harming patients. Data are often derived from multiple sources and may undergo processing by multiple parties. The technique of describing the history of data items, where they came from, how they came to be in their current state and who has acted upon them, is known as data provenance [45].

Routinely collected data are particularly prone to poor data quality because it is often generated as a by-product of care provision. Those generating the data are often unaware of how it will be used and interpreted beyond direct care. Improving data quality requires a multilevel, sociotechnical effort with patients and staff at the front line, clinical coders, system designers and analysts (see DQMI Box).

Data Quality Maturity Index

The Data Quality Team within NHS Digital have created the automated Data Quality Maturity Index (DQMI), which is used to assess the quality of data submitted by providers across four metrics:

• Coverage: has data been received from all expected data suppliers?
• Completeness: do data items contain all expected values?
• Validity: does data satisfy the agreed standards and business rules?
• Use of default values: percentage of times that the default value is selected.

Results of the Data Quality Maturity Index are published and fed back to data providers to drive improvements. of the Data Quality Maturity Index are published and fed back to data providers to drive improvements.

This index cannot, however, detect several important aspects of data quality. For example, it would not necessarily pick up inaccurate data flowing from a provider. This currently requires familiarity with the data set, the providers, historical data and the clinical environment where the data are generated.

Geisinger Health System

Geisinger Health System is a leading US non-profit integrated healthcare system. It has aspired to become a Learning Health System since 2014 and has embraced patient-engaged research. When investigating data quality, the Geisinger team found that the most accurate data items tend to be those that are most frequently viewed, and that are viewed by several stakeholders. Sharing data with patients has therefore proved an effective way of improving data quality.

Data Storage and Access

In Learning Health Systems that span more than one organisation, data can be stored and accessed through centralised or distributed networks. In a centralised network, the data are uploaded, in real-time or periodically, to a central repository, from which it can be accessed for secondary uses. Hospital Episode Statistics (HES) data from the English NHS is an example of this approach. Each hospital in the English NHS submits data to NHS Digital on a monthly basis. This data are combined into a centralised database, elements of which can be disseminated to others [46].

A benefit of this approach is that it simplifies access to data as only one location needs to be queried. However, although patient data can be de-identified, centralised systems can create concerns over security, proprietary, legal and privacy issues, because patients can sometimes be reidentified and the provider, who may be legally responsible for the security of data that they have collected, loses operational control [47].

By contrast, a distributed network configuration leaves the data holder in control of its protected data. The US Food and Drug Administration (FDA)’s Sentinel program [48], an active surveillance system for monitoring the safety of FDA-regulated medical products, is a good example.

Through this approach, queries are sent to each node (organisation) in the network. Each node returns (often aggregated) results to a coordinating centre, using an agreed common data model. A mapping must be agreed between each node and the common data model, if the individual nodes represent data differently. Moreover, multiple queries are needed to obtain complete data.

Distributed implementation is therefore more complex than the centralised approach. However, it overcomes many of the privacy issues, and the participating organisations maintain operational control of their data [49]. They may even choose to review each query before releasing the data.

Data can be disseminated for analysis (bringing the data to the analysis). Alternatively, the analysis can be carried out within a secure environment operated by the data controller (bringing the analysis to the data). The latter option has become increasingly popular as a way to reduce the risk of patient-identifiable data being accessed without authorisation. Traditionally, queries were performed in a secure room, from which data could not be removed. More recently, the secure room has been replaced by a secure web-based portal, where analysis can be conducted. NHS Digital has developed such a Data Access Environment, which permits authorised organisations to analyse data and extract aggregate results, but not patient-level data [50].

Information Governance

Information Governance (IG) is critical to maintaining trust in a Learning Health System, ensuring that individuals can trust organisations to use their data fairly and responsibly [51]. Data must be obtained, held, used and shared within a robust, ethically based IG framework.

In the UK, IG is underpinned by the General Data Protection Regulation (GDPR). This European Union regulation is unaffected by Brexit, having been passed into UK law by the Data Protection Act 2018 [52]. The legislation applies to anyone who collects information about individuals; it is upheld by the Information Commissioner’s Office [53], which provides detailed guidance on its application. In the US, data sharing is regulated by the Health Insurance Portability and Accountability Act that regulates patients’ data use and disclosure [54].

In the United States, access to “protected health information” (identifiable) is governed by the Health Insurance Portability and Accountability Act. Covered entities (clinicians, insurers, medical service providers, business associates) must obtain written permission from patients to share data unless it is needed to facilitate treatment, payment or operations, or for legal reasons. Most of the activities of a LHS would fall into the facilitation of treatment or operations category.

General Data Protection Regulation

Under GDPR and the UK Data Protection Act, organisations must establish and publish a basis for the lawful processing of data. There are six lawful bases [53]:

  1. Consent: The individual has given clear consent for you to process their personal data for a specific purpose.
  2. Contract: The processing is necessary for a contract you have with the individual, or because they have asked you to take specific steps before entering a contract.
  3. Legal obligation: The processing is necessary for you to comply with the law (not including contractual obligations).
  4. Vital interests: The processing is necessary to protect someone’s life.
  5. Public task: The processing is necessary for you to perform a task in the public interest or for your official functions, and the task or function has a clear basis in law.
  6. Legitimate interests: The processing is necessary for your legitimate interests or the legitimate interests of a third party, unless there is a good reason to protect the individual’s personal data which overrides those legitimate interests. This lawful basis cannot apply if you are a public authority processing data to perform your official tasks.

Data protection regulations have often been viewed as a challenge for Learning Health Systems [49]. They must be carefully considered and resourced when planning a Learning Health System.


Interoperability is the ability of one system to work with another. Learning Health Systems are often networks of networks, rather than single unified systems [55]. It is therefore usually necessary to share and use data that has been collected and stored in different systems. This requires standards for [56]:

  • The terminologies and classifications used to describe things that exist in the real world
  • The structure and format of data
  • The transport of data
  • The security of data

There are many different approaches to achieving interoperability [57], with the appropriate choice depending on the particular use case [1]. To move data from one provider to another so that it can be viewed is relatively straightforward. It does not matter how it is transmitted, providing it is appropriately labelled. However, if it is to be analysed, then it needs to be standardised; this is much more difficult [1].

Even within a single organisation, there are often multiple separate systems that are divided by speciality and function, such as pharmacy, radiology, or laboratory. This presents problems for linking a person’s data together and can lead to duplication of data. It gives rise to the need for a lot of “plumbing” technologies [58] . There is therefore a lot of work involved in generating a longitudinal record, even for an individual. Longitudinal data presents an additional problem because systems and data representations often change over time [59]. The nations of the UK each have an NHS number that facilitates this process, while in the US, there is no universal healthcare identifier.

Data collection is often the first point of failure. While true interoperability requires consistent data collection, there is often no agreement on acceptable values. For example, biochemistry results are recorded differently in different places. These could be standardised on their way into the EHR, rather than requiring complex mapping steps at a later stage [49].

In England, the NHS Digital Data Standards Team (IReS) [60] maintains terminology and classification products for the nation. These include:

  • SNOMED CT: A structured clinical vocabulary used in EHRs [61]
  • The Dictionary of Medicines and Devices (dm+d): Identifiers and descriptors of licenced medicines
  • ICD-10: A classification of diagnoses [63]
  • OPCS-4: A classification of interventions and surgical procedures [64]

NHS Digital has legal powers in England to mandate the submission of data by providers, in a pre-set format at given intervals [65]. This overcomes many interoperability issues [65]. However, when extracting richer data from EHRs and other systems, the coding structures used by providers can create major challenges [66].

At the regional level in England, there have been efforts [199]to create interoperability networks between providers[200] . More recently, many of these have been funded centrally through the Local Health and Care Records programme. However, each region has taken a different approach to sharing data for direct care and secondary uses [21]. There are opportunities for researchers, policymakers, managers and industrial partners to collaborate around the new LHCR infrastructure.

In addition to those administered by NHS Digital, other standards commonly applied within LHSs include:

Terminologies and classifications:

  • LOINC: A classification of health measurements (eg, laboratory studies, radiology results, vital signs)
  • CPT: Billing codes for procedures

Structure and format:

  • Consolidated-CDA: an XML-based standard format for export of clinical data
  • OMOP: A clinical data model to organize data in different domains (i.e. diagnoses, laboratory studies, medications)
  • i2b2: A clinical data warehousing and analytics research platform primarily used by researchers


  • HTTPS: a secure protocol for data transport


  • FHIR: A standard for defining data resources and transmitting them via an API
  • HL-7: An older standard for defining and transmitting data
  • DICOM: the standard format for transmission of imaging data

There is a long-held view that there are insufficient requirements or incentives in place to achieve interoperability [58]. In the US, the Office of the National Coordinator for Health IT (ONC) has implemented regulations set out in the 21st Century Cures Act (2016) [67], which improve interoperability by using an Application Programming Interface (API) approach [59]. While many in the industry have welcomed this move, it has been opposed by some EHR vendors and others, who cite concerns over privacy, feasibility and costs [68]. The success of this measure will be determined through careful monitoring, assessing whether it can provide a model for other countries [69].