Tuesday, January 11, 2011

A Primer on XML, RDF, JSON, and Metadata

A new workgroup, formed under the auspices of the HIT Policy Committee and the HIT Standards Committee is beginning its work to help ONC analyze public comments on the President’s Council of Advisors on Science and Technology (PCAST) report, discuss the implications of the report on current ONC strategies, assess the feasibility and impact of the PCAST report on ONC programs, and elaborate on how these recommendations could be integrated into the ONC strategic framework.

Membership includes:
Paul Egerman, Entrepreneur, Chair
William Stead Vanderbilt University, Vice-Chair
Dixie Baker,SAIC
Hunt Blair,Vermont HIE
Tim Elwell, Misys Open Source
Carl A. Gunter, University of Illinois
John Halamka, Beth Israel Deaconess Medical Center, HMS
Leslie Harris, Center for Democracy & Technology
Stan Huff, Intermountain Healthcare
Robert Kahn, Corporation for National Research Initiatives
Gary Marchionini, University of North Carolina
Stephen Ondra, Office of Science & Technology Policy
Jonathan Perlin, Hospital Corporation of America
Richard Platt,Harvard Medical School
Wes Rishel, Gartner
Mark Rothstein, University of Louisville
Steve Stack, American Medical Association
Eileen Twiggs, Planned Parenthood

To advise ONC about the report's recommendations, workgroup members need to understand terms such XML, RDF, JSON and Metadata as well as learn about the standards efforts to date to create human readable and computable data elements for healthcare.

XML is an abbreviation for Extensible Markup Language, a set of rules for encoding documents in machine-readable form.   Here's an example of data about me in XML, which is both human readable and computable
<name><fullname>John David Halamka, M.D.</fullname><firstname>John</firstname><lastname>Halamka</lastname></name>
<address1>Beth Israel Deaconess Med Ctr</address1><address2>Information Systems, 6th Fl</address2><address3>1135 Tremont St  </address3><address4>Roxbury Crossing, MA 02120</address4><telephone>617/754-8002</telephone><fax>617/754-8015</fax><latitude>42.33555200000000</latitude><longitude>-71.08822700000000</longitude></address>

It's a machine friendly form of my Harvard Catalyst Profiles web page with discrete data elements that any computer language can interpret and search.   The complete XML document about me is available here.

XML has been used to describe healthcare data by HL7 using the Clinical Document Architecture (CDA) and by ASTM using the Continuity of Care Record (CCR)

Here's an example of CDA that illustrates immunizations
<informationsource><author><authortime value="20000407130000+0500"><authorname><prefix>Dr.</prefix><given>Robert</given><family>Dolin</family></authorname></authortime></author></informationsource>
<immunizations><immunization><administereddate value="199911"><medicationinformation><codedproductname code="88" codesystem="2.16.840.1.113883.6.59" displayname="Influenza virus vaccine"><freetextproductname>Influenza virus vaccine</freetextproductname></codedproductname></medicationinformation></administereddate></immunization></immunizations>

Metadata is "data about data" - the details behind this data such as who gathered it, when, and for what purpose.

The metadata in the CDA example includes an Object Identifier (OID) of 2.16.840.1.113883.6.59 which is a code for the Center for Disease Control's CVX immunization vocabulary.   Code 88 is the CVX code for Influenza virus vaccine.   The vaccine was administered in November of 1999.   The information source is Bob Dolin.  The full CDA summary is available here.

Here's an example of CCR that illustrates immunizations
<type><text>Home</text></type><line1>11 Alden Road</line1><city>Wellesley</city><state>MA</state><postalcode>02481</postalcode></address>

<immunization><ccrdataobjectid>BB0024</ccrdataobjectid><datetime><type><text>Date Updated</text></type><exactdatetime>2011-01-08T19:49:19Z</exactdatetime></datetime><datetime><type><text>Start date</text></type><exactdatetime>2010-10-11T04:00:00Z</exactdatetime></datetime><type><text>Immunization</text></type><actor><actorid>AA0001</actorid></actor><product><productname><text>Tetanus</text><code><value>35</value><codingsystem>HL7 CVX</codingsystem><version>2.5</version></code><code><value>396412003</value><codingsystem>SNOMEDCT</codingsystem><version>2005</version></code><code><value>C0039619</value><codingsystem>UMLS Concept ID</codingsystem><version>2005</version></code></productname></product></immunization>
<directions><direction><route><text>IM</text></route><site><text>Right Arm</text></site></direction></directions>

The metadata in the CCR example includes that the patient is John Halamka, born 5/23/1962, Male, lives in Wellesley.  Additional metadata identifies that a tetanus shot exists in the record.   The concept "Tetanus shot" is described using the Center for Disease Control's CVX immunization vocabulary, the SNOMED-CT vocabulary, and the National Library of Medicine Meta-thesaurus vocabulary.  Metadata about the reliability of the information includes who reported the tetanus shot and when it was reported.   The metadata in my record describes me as the source of the reported information, updated January 8, 2011.  The full CCR summary is available here.

XML is a very general construct.   Anyone can create any tags for data and metadata.   HL7 has chosen to create a Reference Information Model (RIM) to describe the meaning of its tags and metadata.  ASTM has created a well described fixed set of data elements.   The challenge that different XML tagging creates is that you have to figure out where to look for the information you want.  For the XML example above about my name and address, everyone creating a person directory could create the XML differently.  In one directory, a person's "lastName" could be root element, in another it could be a child of an element called "name", in another it could an attribute of a "person" element.  The XML below is just as valid a way to describe my address as the example above
<address city="Boston" postalcode="02120" state="MA" streetaddress="1135 Tremont">
    <phonenumber number="617 754-8002" type="home"></phonenumber>
    <phonenumber  number="617 754-8015" type="fax"></phonenumber>

The Resource Description Framework (RDF) is a metadata model that provides a standardized approach to describing web resources.   The general idea is to provide a subject-predicate-object model such that the predicate includes of definition of what is being described.  RDF was created to solve the problem of organizations implementing XML tags heterogeneously.

Here's an RDF description of me
<rdf:description rdf:about="http://connects.catalyst.harvard.edu/profiles/profile/person/46034/viewas/rdf" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:core="http://vivoweb.org/ontology/core#" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:vitro="http://vitro.mannlib.cornell.edu/ns/vitro/public#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"></rdf:description>
<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Thing"></rdf:type>
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"></rdf:type>
<rdf:type rdf:resource="http://purl.org/ontology/bibo/core#Faculty"></rdf:type>
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Agent"></rdf:type>
<rdfs:label xml:lang="en-US">John David Halamka, M.D.</rdfs:label>
<rdf:type rdf:resource="http://vivoweb.org/ontology/core#FacultyMember"></rdf:type>
<core:preferredtitle>Associate Professor of Medicine</core:preferredtitle>

The subject is my Harvard Catalyst Profiles Page.

The predicates include "the subject has lastname, a firstname, and a preferred title"

The objects are Halamka, John, and Associate Professor.

The definitions of lastname, firstname, and preferred title are found in two places - the friend of a friend definition site and the VIVO site.    The complete RDF document about me is available here.

Thus, RDF provides a means of displaying metadata while also enabling easy access to the definitions of data elements used.

With RDF,  data is always represented as subjects, predicates, and objects, so reading, parsing, and storing it is consistent across all applications. It also enables query of different systems via a common approach . For example, if I exist as a faculty member in Profiles and as a provider in a clinical system that uses RDF, it should be possible to query for topics where I have both faculty and clinical expertise, without having to transform one data source into the other's schema. Similarly, if the government makes all grants, publications, trials, etc. available in RDF, then these things should automatically be available to tools like Profiles, without having to write any additional code.

There is a standard query language called SPARQL that can be used to search RDF resources.

Finally, there is an emerging alternative to XML called Javascript Object Notation (JSON) that is more compact that XML and easier for computer languages to manipulate than XML.  Here's an example of my address information in JSON
     "firstName": "John",
     "lastName": "Halamka",
     "age": 48,
         "streetAddress": "1135 Tremont",
         "city": "Boston",
         "state": "MA",
         "postalCode": "02120"
           "type": "office",
           "number": "617-754-8002"
           "type": "fax",
           "number": "617-754-8015"

JSON has replaced XML as a data interchange format in many social networking applications.   It does have the same issue as XML that authors can create arbitrary formats, so there could be a person object containing firstname and lastname or lastname could be an object - you have to know the way the author organized the data before you can use it.

In summary, CDA and CCR already provide XML data for healthcare that is "data atomic", metadata rich, and searchable using standard tools.    RDF is a standardized way of describing metadata.  JSON is an efficient way of representing, transmitting, and interpreting data that is similar but more compact than XML.  

Our report is due in April.  I welcome the discussion with the PCAST workgroup over the next 3 months!


David C. Kibbe, MD MBA said...

Bravo! DCK

The Road of Life said...

Excellent Primer. The PCAST report has brought a lot of different concepts together to tackle the Healthcare systems, especially for DAES (Data Access Element Services).

caregraf said...

a while back, I put Dr Halamka's CCD into RDF (demo, RDF).

RDF is a succinct and very flexible way of representing a patient's record and is easy to serialize as a CCD. It is ideal for healthcare data which takes the form of "directed graphs" where nodes point to other nodes. As a result, it's a natural fit for data in system's like the VA's VistA (Semantic VistA).

This is great news, that it is under consideration by the ONC.

Ray Holt said...

Anything related to the efficiency and accuracy in transmission of patient data is good news!

Keith W. Boone said...

John, I am grateful that you are available to provide this expertise to the committee. But, I have to admit to great alarm that such a primer is needed.

PB said...

Concise yet comprehensive!

Thanks Dr Halamka.

Joel Amoussou said...

Brilliant! RDF and OWL open up new possibilities in terms of model consistency checking and reasoning. They can also help with knowledge integration (for example in translational medicine through the RDF merging or linking of genomic and clinical data). Linked Open Data (LOD) principles can be helpful here as well.

Kristina said...


Ragheed said...

Nice summary!

I suspect part of the reason social networks are among the first to abandon XML in favor of JSON is because at such a large scale the challenges associated with generating, parsing, and manipulating XML are magnified.

If a national-scale HIE is the vision, we should be mindful of emerging standards which can simplify data exchange.

Of course, JSON may not be a good fit for all tasks, but I can see it being useful for relatively simple ones such as transmitting prescriptions and lab orders.

Health Perspectives said...

The primer is needed because it brings JSON and RDF into the same conversation as the more traditional approaches

steve said...

This was an excellent post, but it left out one important difference between JSON and XML. While it's certainly possible for content producers to encode arbitrary elements using each, XML (unlike JSON) has well-defined and widely-used mechanisms for defining and enforcing data schemata (the set of allowable elements and the ways in which they may be combined). There are several different efforts in the JSON world to do these sorts of things, but nothing's gotten too much traction- largely because, IMHO, the act of imposing that level of structure on JSON removes one of JSON's big advantages over XML- its lightweight and dynamic nature.

This means that it's easier to encode semantically rich content (such as health-related data) in XML than in JSON, which is really more of a data serialization format than a data interchange format. That's not to knock JSON- I use it all the time, and I think there are definitely possible uses for it in HIE-related applications. It's just a matter of using the right tool for the right job.

Incidentally, I personally think that RDF+OWL represent the Right Way to solve the technical aspects of of HIE, but I'm not holding my breath for wide-scale adoption. RDF is such a simple concept- subject/predicate/object- yet, in practice, it's a surprisingly hard approach for most programmers (myself included, at first) to get their heads around.

EconomyWonk said...

I think we need to discuss some more detailed ideals for JSON and XML as regarding the CCD. The great advantage the CCD (and the CCR) have is that in CCD there are structures which are verified against that force the CCD to comply with a set of standards that prove useful to the "end user" of the document, namely, the physician. I am not aware of this same convenience available with JSON. JSON seems much simpler for intuitive understanding, but what good is it if there is no structure behind the instance?

Sandeep said...

Dr. Halamka, I wish to disagree somewhat from your comments on HL7 Reference Implementation Model (RIM). In a way RIM does provide meaning to XML tags, but XML is just a serialisation format for information encoded in model structure conforming to RIM. RIM aims to standardise on the structure and semantics of information so as to facilitate consistent generation, transmission, and consumption of information across systems. In that sense RIM goes beyond metadata and is closer to RDF as far as structure of information is concerned

Tyler Tallman said...

Would you consider discussing json-ld in this or following posts