Menu Close

Data to knowledge

Once data are consistently obtained in a standardised, comprehensive, exchangeable, analysable form, they must be used to derive knowledge. Within Learning Health Systems and clinical informatics more generally, the emphasis has often been on the collection, storage, analysis and dissemination of data. Too often, health systems collect reams of data but lack the means of converting them into reproducible, generalisable knowledge [57]. This section will signpost some important methods for deriving knowledge from data.

Data and knowledge

Knowledge is the insight produced by processing data. The data might be of any form, and the processing could be conducted by humans or machines. Knowledge can include instructions to complete a task, a predictive model, the results of an experiment, a physical law, a practice guideline, or anything else that can be known.

Knowledge can be generated using a wide range of qualitative and quantitative methods. In a Learning Health System, the data collected is often the product of a knowledge-generating process. For example:

  • Data that incorporates clinician judgment, assessment, and expertise, such as structured reasons for ordering a test or stopping a medication
  • Data that synthesises more basic data elements, such as risk scores or predictions
  • Data that indicates trajectory or trend
  • Data that patients directly contribute and that reflects lived experience, such as patient-reported outcomes, experience and satisfaction
  • Data that reflects staff experience
  • Data generated through randomisation or quasi-randomisation to minimise bias

NYU Langone CTPA

At NYU Langone Health, a clinical decision support tool was created to help clinicians optimise the ordering of CT pulmonary angiograms for the detection of pulmonary emboli. The tool incorporated an automatic risk calculation. During the design phase, clinicians noted that they trusted their clinical intuition more than the risk score. The decision support tool therefore incorporated specific reasons for ignoring the risk score, including “high clinical suspicion”. In a number of cases, patients with low risk scores were still sent for tests, based on high clinical suspicion; subsequent analysis showed that these patients were just as likely to have a pulmonary embolism as those with high risk scores. This indicated that “high clinical suspicion” was a very valuable prognostic data element.

Randomisation and the Learning Health System

Among the most sophisticated methods of creating value added data is to embed trials within routine care. Randomised trials are now the mainstay of knowledge generation in clinical medicine and have been widely adopted by other industries, rebranded as A/B testing. Randomisation is particularly valuable in situations in which there is likely to be substantial bias in observational data (eg selection bias, regression to the mean, differential loss to follow up). While they remain rare in health system operations and implementation, there are many opportunities for randomised studies in the health system beyond tests of novel therapeutics. Interventions can be studied versus usual care (no additional intervention).

However, sometimes healthcare providers object to randomisation because they don’t want to withhold a perceived beneficial intervention. Capacity-constrained interventions (such as intensive case management or post-discharge telephone calls) are excellent opportunities to randomise while still providing the intervention to the same number of people: converting data collected on convenience and often biased sample to data collected on an unbiased, randomized population. Interventions being given to all those eligible are also excellent candidates for randomisation without withholding care by randomising the intervention form, content, frequency or delivery.

Randomisation can take place at any level (patient, clinician, hospital unit, ambulatory practice, health system) depending on the target of the intervention. Randomisation can be simple (intervention vs none) or can involve more complex designs. Factorial designs allow simultaneous testing of multiple [70] different interventions – important given the multicomponent nature of many health system interventions – or of multiple variations of the same intervention (eg changes in both content and timing). New statistical methods allow for “fractional” factorial designs that do not test every single possible iteration, but efficiently test those of most interest. Adaptive trials allow for changes in group allocation or intensity as the trial progresses to optimize sample size and speed of the trial [71].

Pure randomisation is not always practical when embedding randomisation into routine care. In such cases, quasi-randomisation (eg, week on/week off, randomisation by odd vs even medical record number) is often an option. In such cases, it is important to minimize potential bias. For instance, randomisation by odd vs even record number is likely to be more truly “random” than randomisation by first letter of last name, since patients from certain racial or ethnic backgrounds might be more prevalent in certain letters. Randomisation by sequential week on/week off is likely to be less biased than randomisation by day of the week, given that there are known differences in clinical severity and volume by day of week.

In cases where it is desirable for the intervention to be given as is to everyone, a randomised Stepped Wedge Design can be employed, in which clinicians or units/practices begin the intervention at randomly assigned sequential time periods [72]. This allows for control group comparisons across the intervention period while still enabling everyone to receive the intervention.

NYU Langone Health Exemplar Box

NYU Langone Health in New York, USA, established a rapid randomised quality improvement project unit and randomised over a dozen interventions in its first year. The health system was able to rapidly determine the effectiveness of a variety of interventions, including community health worker facilitation, post-discharge telephone calls, mailer outreach, and electronic clinical decision support [30]. A similar system has been implemented at Vanderbilt University Medical Centre, with the aim of creating generalisable knowledge [183].    

Quasi-experimental analyses

Randomised interventions can often use very simple analytics, such as Chi square tests for categorical outcomes or t-tests for normally distributed continuous outcomes. However, because unobserved confounders are typically only equally distributed among groups in randomised studies, non-randomised interventions require more complex analyses, as bias between groups is likely. For situations in which randomisation is not possible, quasi-experimental analytic methods can often be used. These methods all have in common some form of comparison to the intervention group, whether by time or by a different cohort. All of these methods can also incorporate statistical adjustment for confounding factors, such as differences in demographics or comorbidities between patients.

Difference-in-differences analysis

Also known as a controlled before and after study, this method compares the difference in average outcomes before and after an intervention in a control population to those in the intervention group. It assumes that trends in the control and intervention group were similar in the pre-intervention period, and that those trends would have remained similar in the absence of the intervention. It further assumes that no changes besides the intervention affected the intervention group, and that the intervention did not affect the control group. It is essential to explore the validity of these assumptions when conducting such an analysis, because they are often not met.

Interrupted time series

This method (also known as a repeated measure study) essentially examines whether an intervention a) changed the absolute level of an outcome and b) changed the trend by which that outcome was changing over time [73]. This gives it some advantages over difference-in-differences analysis, in that it accounts for pre-intervention trends and can be conducted without a control group [74]. However, it requires similar assumptions that underlying trends would have continued unchanged and that [75] no other changes affected the population in the intervention period. A comparative interrupted time series can also be constructed, which compares differences in outcomes over time between control and intervention groups [76].

Regression discontinuity

This method is useful when there is a specific point at which an intervention is applied (ie a target laboratory result, a qualifying income, a population percentage). It examines differences in outcomes for subjects immediately on one side or another of the intervention qualifying point [77]. These subjects are expected to be otherwise similar except for the arbitrary cut point qualifying them for intervention. If a substantial difference is found between those just on one side of the dividing line and those on the other, the intervention can be considered effective [75].

Causal inference

Observational studies are typically limited by an inability to detect causality; we can conclude that outcome B is often associated with intervention A but not necessarily that A caused B. A ream of statistical techniques has been developed to try to disentangle causality from association. Structural equation modelling analyses causal paths prespecified by the investigator on the basis of hypothesized relationships [78, 79]. Directed acyclic graphs (DAGs), commonly used in Bayesian analyses, similarly can be constructed to model and study causal relationships. Instrumental variables can serve as artificial randomizers to help mitigate selection bias [80].

Artificial control groups

There is a ream of methods for constructing artificial control groups to use in standard analyses through matching or similarity analyses. While these should be employed with caution, given the high risk for unmeasured confounding that cannot be accounted for in the matching process, they nonetheless can be useful to reduce bias in comparisons with unselected control groups. Propensity score matching, Mahalanobis distance matching and coarsened exact matching are commonly used methods [81].

Statistical process control

Borrowed from manufacturing, statistical process control methods examine whether outcomes are stable over time – within expected statistical variation – or whether they vary in a random fashion or in a non-random fashion (special cause variation). This is an alternative method of examining time-series data. In the case of an intervention, one would look for special cause variation in the intervention period using control limits established in the baseline period. Special cause variation has been defined as [82]:

  • Any point above or below three standard deviations (99.7%)
  • A run of at least eight consecutive observations above or below the mean, or 12 of 14 successive points above or below the mean
  • Two of three points more than two standard deviations away from the mean (and on the same side of the mean)
  • At least four out of five successive points on the same side of the mean and more than one standard deviation from the mean

Machine learning and artificial intelligence

The newest methods of generating knowledge from data have come from the field of artificial intelligence. Broadly speaking, these approaches use all data available, rather than a prespecified subset, to learn patterns. While the promise of AI in healthcare has so far exceeded its application, there have been some promising examples of new knowledge generation from AI analyses of data – for instance, discovering undiagnosed disease [83], identifying new subtypes of disease [84], predicting future events [85, 86], optimising treatment selection [87] and managing complex [88] medications.

Generating knowledge from local experience

In some cases, pure descriptive statistics may suffice to generate useful insight. For example, consider a new service from Stanford University; if there is a question about a patient’s optimal treatment or ultimate prognosis, the tool can rapidly query the EHR of all other, similar patients[89] “Green Button” query would return a descriptive summary of what happened in similar case studies, which could help guide clinician decisions in the absence of stronger evidence. Of course, this approach can be limited by small samples and is prone to bias. More complex statistics could also be applied, to more robustly match similar patients or to generate stronger predictions.

Engineering approaches to generating knowledge from data

It is not always necessary to generate reams of complex statistics to gain knowledge from data. Visual or qualitative representation of data can also often generate important insights.

Process mapping enables a visual representation of the various stages of a particular activity and can identify redundancies, waste and gaps. Alternate versions of process maps include swimlane maps, which organise steps by location or discipline; service blueprinting, which divides steps into those that are visible to the end-user and those that are not [90]; patient experience mapping, which depicts typical emotional reactions and lived experience at each stage of a [91] process; and lean value stream mapping, which focuses on time, waste and material use at each step of the process [92].

Qualitative methods

Knowledge generation within a Learning Health System is not the preserve of machines. Many of the most powerful insights come from people, working alone or in groups. People are uniquely well-equipped to make sense of and manage complex sociotechnical environments that may defy quantification. Each person has a unique perspective, which can be captured within a Learning Health System.

Qualitative methods are invaluable to the Learning Health System. Interviews, focus groups, direct observations/ethnography, user feedback and free text comments in surveys are examples of approaches that can generate insights into WHY and HOW systems are functioning. This in turn provides essential insights into WHAT can be measured quantitatively [93]. Indeed, it is difficult to imagine a successful Learning Health System that does not make use of qualitative methods.

Learning communities

Learning Communities have a place at the heart of any Learning Health System. A Learning Community is a group of stakeholders who come together in a safe space to reflect and share their judgements and uncertainties about their practice and to discuss ideas or experiences to collectively improve [94]. This can extend to governance and improvement of the Learning Health System, along with knowledge generation.

Learning Communities require members, facilitators and sponsors. They must be co-designed. The Learning Community Handbook [94] contains detailed evidence on the rationale for Learning Communities, as well as guidance on establishing and running such groups. It suggests a four-phase cyclical development process [94].

  • Phase 1 – Negotiating the Space: Working with the sponsor to give permission, resources and time for people to join a safe space. An agreement must be reached on the facilitator, metrics and reporting mechanisms.
  • Phase 2 – Co-Design Process: The group takes ownership of the Learning Community. Ground rules and processes are agreed.
  • Phase 3 – Facilitating Learning Communities: Sessions take place, including presentations and discussions. Collective learning is captured, and reporting arrangements agreed. The community reflects on the process.
  • Ongoing – Reflection and Evaluation: This will form part of each session and can comprise regular, standalone sessions.

Learning Communities should foster a positive error culture, where people feel comfortable talking about their mistakes and organisations see them as an opportunity for improvement. As well as generating knowledge, Learning Communities can build trust, capacity, skills, confidence and agency for change among members. They can challenge members, provide reassurance and help members deal with uncertainty. They can be action-focused and sustainable, with low overhead costs [94].

The Department of Learning Health Sciences at the University of Michigan has produced a practical guide [95] to operationalising a Learning Community for a Learning Health System. This provides detailed guidance on building a Learning Community around a problem of interest, illustrated by the case study of a gastroenterology community.