Thursday, January 13, 2011

A Universal Exchange Language Example

In preparation for the PCAST Workgroup discussion, the Workgroup chairs asked Wes Rishel and I to find examples of the Universal Exchange Language proposed by the report. We asked Sean Nolan from Microsoft for his comments. His guest post is below:

"As requested, I’ve put together the below thoughts and samples from HealthVault and Amalga and offered my perspective on how they reinforce the core ideas in the PCAST recommendation. I hope they are useful and look forward to any follow up that would be useful.

To me, the most compelling aspect of the PCAST recommendation is the idea of data-atomicity … the idea that where appropriate we start encouraging exchange of the most granular data elements possible, rather than aggregates, snapshots and pre-normalized information. And further, that we defer concerns about harmonization and structure as late as possible in the information lifecycle, rather than requiring all data sources to conform to a fixed set of standards. This is something we embody both in our HealthVault and Amalga product lines.

Note that much of this is most immediately applicable in the “secondary use”/research/CER context --- which is where I know PCAST started out … concepts will certainly apply more broadly, but that’s at least how I’ve been thinking about it.

The challenge with the “traditional” approach to secondary use is that by definition we lose information as we pass the data through filters and “clean things up.” We make pre-judgments about which metadata is important and which is not. We lose granularity by aggregating along dimensions we think are the important ones --- and lose the ability to re-aggregate against others in the future. And so on.

Of course, there is a conservation of energy --- in order to compute on data, it has to be transformed at some point. But in the traditional approach, every source system bears the burden of the work and encodes a fixed set of transformations. This is incredibly brittle and lossy. Instead, with a data-atomic approach the work is delayed until the last moment, when the needs are best understood, the technology can be deployed efficiently, and technology for normalization may have advanced since the original data elements were captured.

The Extreme Case: EAV

This philosophy leads us to develop exchange mechanisms that are supportive of maximal completeness rather than maximal normalization. In its truly simplest form, in some cases in Amalga we reduce storage to an open-ended “Entity-Attribute-Value” structure, where each data element is known as a bucket of arbitrary name/value pairs. This structure supports the “extrusion” of multiple transformed views of the same information. For example, in the extreme:

Entity ID
Blood Pressure
Given Name
Family Name
Personal Physicians HealthCare
Device Name
Omron 7 Series
Device Model
Device Serial Number

The idea is to capture as much metadata as possible from the source system and make sure it survives alongside all of the other data with the item. And what is most important is that while there is a natural “grouping” for the item as the otherwise meaningless Entity ID --- any attribute can serve as the means to construct just-in-time entities for different purposes. For example, I may want a view of all of John’s readings, so I use the demographic or other patient ID attributes to create an extrusion pivoted on the patient. Or I may want to understand the penetration of different device models around the country, so I can use the “device model” attribute to create another extrusion for that. In the Amalga implementation, these “extrusions” are often created automatically as physical transformations under the covers in response to dynamic query patterns.

The “envelope” format in this case is almost trivial --- we typically use XML as a convenient format but virtually anything will work.

Softening the Approach: codable values and common data

The EAV approach works really well for source systems (they just sent what they can and forget about it) and it can be super-effective in many cases, primarily intra-institution. However, everything is a balance, and the burden EAV puts on the receiving system can be inordinately high. In order to encourage an easier onramp to interoperability, we have adopted a number of techniques that are evident in the HealthVault data model. For example:

We capture a common set of core metadata for every item, including:
* a codified item “type” such as “blood pressure reading”
* various meaningful timestamps (e.g., “created”, “updated”, “effective”)
* audit information about the entity that submitted or updated the item

In many cases, there is common data that “just is” as part of that data type. For example, a blood pressure is not a blood pressure without “systolic” and “diastolic”. So we create very simple schemas for the 80% case of data elements that just have to be there to make sense.

We provide “slots” for other common structured data that may or may not be available --- for example, “pulse” is often present with a bp reading but not always --- so we have a place to put it if available, but it is not required.

Wherever data can be coded, we create constructs that facilitate the capture of those codes without allowing data to be lost. We talk about this as the “codable value” --- an XML construct that allows an item to be identified both with “display text” and zero or more codes that self-describe their codeset. Note this model follows very closely to constructs originally created as part of the ASTM CCR.

We always provide the capability for other metadata to be associated and “travel with” the item --- to ensure the completeness principle.

Examples of these elements can be seen in the following HealthVault item XML fragments. The first is a complete blood pressure reading imported from John’s CCD and shows common metadata and core type information. The second and third are from my record and demonstrate optional items and codable values respectively.

    <thing-id version-stamp="f4dd8faa-2ba6-410e-b367-b0b5f96f0aaa">15657ae9-7955-4a1d-9f23-bf56e76b640d</thing-id>
    <type-id name="Blood Pressure Measurement">ca3c57f4-f4c1-4e15-be67-0a3caf5414ed</type-id>
        <app-id name="Microsoft HealthVault">9ca84d74-1473-471d-940f-2699cb7198df</app-id>
        <person-id name="Sean Nolan">11141dc8-eb3c-4923-99aa-0094bd4d0648</person-id>
        <app-id name="Microsoft HealthVault">9ca84d74-1473-471d-940f-2699cb7198df</app-id>
        <person-id name="Sean Nolan">11141dc8-eb3c-4923-99aa-0094bd4d0648</person-id>
            Personal Physicians HealthCare
                <relationship-type>Extracted from CCD</relationship-type>


Each of these techniques is designed to allow us to hold COMPLETE information first, enable flexible representation of STRUCTURE and CODING, and ease the burden on CONSUMERS of the data where possible. That last requirement is the one that will “give” when needed, because if we have all the data --- we can always improve and reinterpret it over time.

Item Provenance

Especially in a consumer-controlled environment, ability to track provenance of data is very important. Internally HealthVault maintains a full audit log of changes to information (see some of the common metadata highlighted in yellow above) --- but as a more permanent provenance mechanism the platform allows digital signatures to be applied to any data atom. The fragment below shows a sample signature block from a real HealthVault item.

There is no reason that this mechanism could not be applied within the PCAST context. As certificates are becoming more prevalent for systems such as Direct and Federal identity initiatives, the ability to trace the integrity of information back to its source will become more and more important.

    <signature xmlns="">
            <canonicalizationmethod algorithm="">
            <signaturemethod algorithm="">
            <reference uri="">
                    <transform algorithm="">
                        <xs:transform version="1.0" xmlns:xs="">
                            <xs:template match="thing">
                                <xs:copy-of select="data-xml">
                                <xs:value-of select="data-other">
                <digestmethod algorithm="">

Privacy Intent vs. Anonymization

The PCAST document includes a recommendation to apply “patient privacy wishes” as part of the common metadata that would travel with each data atom. A number of organizations have worked to define “languages” by which this can be represented, and there is no reason to think that information could not easily be transmitted along with any other metadata. Enforcing those wishes in the diverse healthcare IT environment is a very daunting challenge indeed. Digital rights management technology has advanced significantly over the past few years, in particular for situations where attacks are distributed. That is --- a community of thousands may attack DRM on a newly-released movie; many fewer are likely to be targeting any single data atom. Still, this seems like a recommendation that might be considered more directional than immediate.

An alternative --- especially in the case of secondary use --- may be to start with anonymization. If the identifying elements in each data atom are masked and/or skewed before they are submitted to the DEAS environment, there may be an option here that would kickstart the ecosystem without having to solve all of the technical problems at once.

Both Amalga and HealthVault represent privacy and security within their own environments --- that is, there is full control over who sees what information, but once the information is disclosed it is not tracked further. So these comments are speculative only and not the result of our direct experience.

I hope these samples and thoughts are helpful to the committee as it does its work. I would be more than happy to continue to the discussion, answer questions and offer clarification of anything in the above. Note you can also read more about the HealthVault data model and see samples at the following links:


I’ve also posted a few different blog entries about Amalga and its internal data structures. This post is a good start."


dmccallie said...

Sean provides a nice explanation of the EAV approach to granular data storage. This well-proven approach works well for those data elements that do not require a lot of "structure" to qualify their meaning. A routine BP provides a good example.

But many medical data elements are more complex, and require structured modifiers to be adequately captured. Even a BP can be modified by relevant factors like "large cuff, left arm, standing, automatic device." To store this kind of structured data in an EAV model requires either a combinatorial explosion of pre-coordinated "entity names" or some kind of non-trivial post-coordination of multiple "rows" of EAV data.

Another issue is that if only granular data is stored, then the clinical "story" can get lost in all the numbers. Granular EAV models work well for flow-sheet data, vital signs, labs, etc. But if the "cloud" doesn't contain the textual story, a clinician may have trouble correctly interpreting all the EAV-coded facts.

The HL7 CDA (clinical document architecture) offers a mixed approach to creating structured documents which allow the preservation of both the "story" and the discrete data. Unfortunately, CDA-encoded documents are still rare in most settings, and of course the vast historical archive of documents are not CDA-encoded.

Another approach to preserving the narrative text while exposing codified data is to store the original documents in the cloud and also to use NLP (language parsing) tools to recognize and extract the clinically relevant data elements (SNOMED, etc.) from the text. These machine-encoded documents can then be stored in a searchable index along with the original text and the EAV data, providing searchable, structured access to both discrete and narrative data.

Of course, these are not mutually-exclusive approaches. The PCAST vision will require combinations of all our best practices.

David McCallie, MD

Anonymous said...

Where does the information for "Device manufacturer", "Model #" and "Serial #" come from? Does it have to be manually entered every time a blood pressure is taken? Seems like a very expensive approach!

Unknown said...

This captures the reality of data transfer: we face many idioms, many means of expression and there's no esperanto that will ever capture it all.

But I think this comes out best not with vitals or medications which reduce to standard equivalents with little sweat, not even for labs - let's see how we do with them in the coming year - but with all those questionnaires and check-lists that populate nursing notes. ADLs, your functional status. They're not free text but nor are they codes and they reflect the approach to care at a given institution.

Anonymous said...

So could someone explain why it is that a report published with the seal of the President so directly recommends three technologies:
- personal health records
- data aggregation middleware
- cloud computing and search/indexing

... and has two authors from industry - Craig Mundie from Microsoft who leads the Microsoft Health Solutions Group, and Eric Schmidt, CEO of Google.

... who just happen to have launched products that target precisely the same technologies recommended: HealthVault for a personal health record, Amalga for data aggregation middleware and Google/Bing for search and indexing.

The report has plenty of simplistic recommendations in its pages. One thing it doesn't contain is a statement of the clear and real conflict of interest that it's authors have.

I guess that's what $1.6 million of lobbying buys you (Microsoft's declared spending on lobbying last year)

Anonymous said...

But this approach has exactly the same problem as any other proprietary approach at defining standards. Microsoft HealthVault have decided that a blood pressure measurement consists of a systolic and diastolic reading, at a certain date/time with a certain device/manufacturer number.

And if every device manufacturer and every application uses that definition then it will all work. If they don't they'll have to transform it to their use.

But in reality, if there's just one use case where a blood pressure measurement is different - say it needs to record whether the reading was for the patient standing or lying down - and that different attribute is not in the HealthVault model - then the whole thing falls apart.

HealthVault (and any other simplistic model like it) is based on the premise that Microsoft is big enough to dictate the standard (a proprietary standard) then force everyone else to use it.

A conceptual model that may work much better is the ISO13606 approach which splits the information concepts into a structural component and an information component, using the concept of information archetypes to provide a flexible and extensible model. This approach uses XML but has a much more sophisticated approach to semantics, structure, composition and context.

Vidjinnagni Amoussou said...

Dr. David McCallie said it very well: "PCAST vision will require combinations of all our best practices." And that's the challenge: building a unified framework that combines the wisdom of many communities of interest (HL7, OpenEHR, biomedical ontologies, etc.)

See my thoughts here.

Vidjinnagni Amoussou said...

BTW an RDF triple is made of an object, a property, and a value. RDF is a standardized and vendor-neutral incarnation of the EAV model and is already used today in the fields of medical terminologies and biomedical ontologies.