By Dr Tom Foley, Dr Fergus Fairmichael.
Dr Brown is an Associate Professor in the Department of Population Medicine (DPM) at Harvard Medical School and the Harvard Pilgrim Health Care Institute. He is Associate Director and Director of Scientific Operations for the FDA’s Mini-Sentinel project. Dr Brown is the lead architect of PopMedNet (www.popmednet.org), an open-source software platform that facilitates the creation and operation large-scale interoperable distributed health data networks.
There are various different approaches when using research data across large networks or sites. One example is the FDA Mini-Sentinel program (an active surveillance system for monitoring the safety of FDA regulated medical products). This uses a distributed network as opposed to a centralised approach. The distributed approach helps to protect patient privacy and the sensitivities of partners and this is important where you have multiple data partners who are responsible for protecting their own data. These partners prefer to maintain operational control of their data and its uses.
One of the biggest challenges to interoperability is in the semantics of the data. For research, semantics are much more important than in clinical practice. The real question is, what needs to be interoperable and what are you trying to do? If the answer to this is that you are trying to do everything, then you won’t get far. If simply trying to move information from one provider to another, then it does not matter how it is transmitted, providing that it is labelled properly. However, if it is to be computed, then it needs to be standardised which we have failed to do well in the past. We have seen standards such as HL7, however these do not always work well, sometimes a box is labelled well but what is actually inside it is not always what is expected.
Barriers to interoperability
The collection of data at point of care could be viewed as a common point of failure. If we want true interoperability data needs to be collected consistently. However, there is often no agreement on what the acceptable values are. For example biochemistry results are recorded differently in different places. These could be standardised on their way into EHR rather than requiring complex mapping and parsing steps at a later stage.
Even if there was standardisation of values such as these we would still be stuck with the semantic problem from a research perspective. For example at present there is not good way to define certain types of “visits” and it is often difficult to distinguish between a hospitalisation and an emergency department visit. From outside it is difficult to make sense of that data. The metadata is not good enough to describe the dataset or data streams for appropriate use in research. This makes it difficult for the researcher who is trying to re-use this data.
Data sources and quality
Data is often of poor quality and we need to do a better job on the clinical/input side for this to improve. Currently the most interesting data is often in notes. However, there is a need to be careful around the use of natural language processing. There are significant impacts if the conclusions are wrong:
• Economic impact on a company affected by the removal of a treatment from the market
• Impact on the patient if an effective therapy is removed from the market erroneously
• Impact on the patient if an ineffective or dangerous treatment remains available
One particular problem in the US is that data streams are often incomplete. This may be because you do not get all of the results due to patients moving between payers and providers. You can never be sure if you have all of the data. We are currently in a position where a single hospital is unable to complete a re-hospitalisation study with just their own data because they are unable to track patients who were admitted to another hospital. Patients who die, move away or remain healthy and are not re-hospitalised would appear the same in such a study. Linking of insurance data is helpful as this is would have a record if further care was paid for. However, incompleteness remains to be the underlying fear of the epidemiologist. There is a worry that this may introduce bias by only using available data.
We often hear the question, “Is imperfect data better than nothing or worse than nothing?”. The answer to this comes down to putting the right data to the right use. We should endeavour to use data to the point that it doesn’t make sense any more. Deciding whether data is good enough is difficult and often depends on the use case. The data, its intended uses and the methodology have to be right to draw meaning from them.
At present we are more likely to get the wrong answers from comparative effectiveness research as we do not understand the data well enough.