Wednesday, March 2, 2011

Freeing the Data

I'm keynoting this year's Intersystems Global Conference on the topic of "Freeing the Data" from the transactional systems we use today such as Enterprise Resource Planning (ERP), Customer Relationship Management (CRM),  Electronic Health Records (EHR), etc.  As I've prepared my speech,  I've given a lot of thought to the evolving data needs we have in our enterprises.

In healthcare and in many other industries, it's increasingly common for users to ask IT for tools and resources to look beyond the data we enter during the course of our daily work.   For one patient, I know the diagnosis, but what treatments were given to the last 1000 similar patients.  I know the sales today, but how do they vary over the week, the month, and the year?   Can I predict future resource needs before they happen?

In the past, such analysis typically relied on structured data, exported from transactional systems into data marts using Extract/Transform/Load (ETL) utilities, followed by analysis with Online Analytical Processing (OLAP) or Business Intelligence (BI) tools.

In a world filled with highly scalable web search engines,  increasingly capable natural language processing technologies, and practical examples of artificial intelligence/pattern recognition (think of IBM's Jeopardy-savvy Watson as a sophisticated data mining tool), there are novel approaches to freeing the data that go beyond a single database with pre-defined hypercube rollups.   Here are my top 10 trends to watch as we increasingly free data from transactional systems.

1.  Both structured and unstructured data will be important

In healthcare, the HITECH Act/Meaningful Use requires that clinicians document the smoking status of 50% of their patients.   In the past, many EHRs did not have structured data elements to support this activity.    Today's certified EHRs provided structured vocabularies and specific pulldowns/checkboxes for data entry, but what do we do about past data?   Ideally, we'd use natural language processing, probability, and search to examine unstructured text in the patient record and figure out smoking status including the context of the word smoking such as "former", "active", "heavy", "never" etc.

Businesses will always have a combination of structured and unstructured data.   Finding ways to leverage unstructured data will empower businesses to make the most of their information assets.

2.  Inference is possible by parsing natural language

Watson on Jeopardy provided an important illustration of how natural language processing can really work.   Watson does not understand the language and it is not conscious/sentient.   Watson's programming enables it to assign probabilities to expressions.     When asked "does he drink alcohol frequently?", finding the word "alcohol" associated with the word "excess" is more more likely to imply a drinking problem than finding "alcohol" associated with  "to clean his skin before injecting his insulin".    Next generation Natural Language Processing tools will provide the technology to assign probabilities and infer meaning from context.

3.  Data mining needs to go beyond single databases owned by a single organization.

If I want to ask questions about patient treatment and outcomes, I may need to query data from hundreds of hospitals to achieve statistical significance.   Each of those hospitals may have different IT systems with different data structures and vocabularies.   How can a query a collection of heterogenous databases?   Federation will possible by normalizing the queries through middleware.   For example, data might be mapped to a common Resource Description Framework (RDF) exchange language using standardized SPARQL query tools.   At Harvard, we've created a common web-based interface called SHRINE that queries all our hospital databases, providing aggregate de-identified answers to questions about diagnosis and treatment of millions of patients.

4.  Non-obvious associations will be increasingly important

Sometimes, it is not enough to query multiple databases.   Data needs to be linked external resources to produce novel information.  For example, at Harvard, we've taken the address of each faculty member, examined every publication they have ever written, geo-encoded the location of every co-author, and created visualizations of productivity, impact, and influence based on the proximity of colleagues.   We call this "social networking analysis"

5.  The President's Council of Advisors on Science and Technology (PCAST) Healthcare IT report will offer several important directional themes to will accelerate "freeing the data".

The PCAST report suggests that we embrace the idea of universal exchange languages, metadata tagging with controlled vocabularies, privacy flagging, and search engine technology with probabilistic matching to transform transactional data sources into information, knowledge and wisdom.    For example, imagine if all immunization data were normalized as it left transactional systems and pushed into state registries that were united by a federated search that included privacy protections.  Suddenly every doctor could ensure that every person had up to date immunizations at every visit.

6.  Ontologies and data models will be important to support analytics

Part of creating middleware solutions that enable federation of data sources requires that we generally know what data is important in healthcare and how data elements relate to each other.   For example, it's important to know that an allergy has a substance, a severity, a reaction, an observer, and an onset data.   Every EHR may implement allergies differently, but by using common detailed clinical model for data exchange and querying we can map heterogeneous data into comparable data.

7.  Mapping free text to controlled vocabularies will be possible and should be done as close to the source of data as possible.

Every industry has its jargon.   Most clinicians do not wake up every morning thinking about SNOMED-CT concepts of ICD-10 codes.   One way to leverage unstructured data is to turn it into structured data as it is entered.   If a clinician types "Allergy to Pencillin", it could become SNOMED-CT concept 294513009 for Pencillins.  As more controlled vocabularies are introduced in medicine and other industries, transforming text into controlled concepts for later searching will be increasingly important.   Ideally, this will be done as the data is entered, so it can be checked for accuracy.  If not at entry, then transformations should be done as close to the source systems as possible to ensure data integrity.   With every transformation and exchange of data from the original source, there is increasing risk of loss of meaning and context.

8.  Linking identity among heterogenous databases will be required for healthcare reform and novel business applications.

If a patient is seen in multiple locations how can we combine their history together so they get the maximum benefit of alerts, reminders, and decision support?    Among the hospitals I oversee, we have persistent linkage of all medical record numbers between hospitals - a master patient index.   Surescripts/RxHub does a realtime probabilistic match on name/gender/date of birth for over 150 million people in real time.   There are other interesting creative techniques such as those pioneered by Jeff Jonas for creating a unique hash of data for every person, then linking data based on that hash.   For example John, Jon, Jonathan, and Johnny are reduced to one common root name John.   "John" and the other demographic fields are then hashed using SHA-1.  The hashes are compared between records to link similar hashes.   In this way, records about a person can be aggregated without ever disclosing who the person really is - it's just hashes that are used to find common records.

9.  New tools will empower end users

All users, not just power users, want web-based or simple to use client server tools that allow data queries and visualizations without requiring a lot of expertise.  The next generation of SQL Server and PowerPivot offer this kind of query power from the desktop.    At BIDMC, we've created web-based parameterized queries in our Meaningful Use tools, we're implementing PowerPivot, and we're creating a powerful hospital-based visual query tool using I2B2 technologies.

10.  Novel sources of data will be important

Today, patients and consumers are generating data from apps on smart phones, from wearable devices, and social networking sites.   Novel approaches to creating knowledge and wisdom will source data from consumers as well as traditional corporate transactional systems.

Thus, as we all move toward "freeing the data" it will no longer be ufficient to use just structured transaction data entered by experts in a single organization, then mined by professional report writers.   The speed of business and the need for enhanced quality and efficiency is pushing us toward near real time business intelligence and visualizations for all users.   In a sense this mirrors the development of the web itself, evolving from expert HTML coders, to tools for content management for non-technical designated editors, to social networking where everyone is an author, publisher, and consumer.

"Freeing the data" is going to require new thinking about the way we approach application design and requirements.   Just as security needs to be foundational, analytics need to be built in from the beginning.

I look forward to my keynote in a few weeks.  Once I've delivered it, I'll post the presentation on my blog.


Anonymous said...

When talking about freeing data, atleast at a large healthcare facility or IDN level, the idea of Vendor Neutral Archiving continues to be brought up as a relevant and viable solution. What are you thoughts on how VNA and how it fits into the larger picture of "freeing data."

Carl Frappaolo said...

Great post John. I wish I could be there to hear your presentation. What you summarize here has been a long time coming - and applicable to virtually every industry. I too have been using the IBM Jeopardy Watson search engine as a mainstream popularized example of what NLP and intelligent search is all about. For several years most would not believe me when I spoke of such things. Watson may do for intelligent search what Google did for "basic" search.

I look forward to your posting of your presentation.

Bob Rogers said...

We are very energized by this blog entry. While we are well-aligned with your vision in general, we have added some specific comments on our experience developing search and reconciliation technology.

“1.  Both structured and unstructured data will be important”

We agree. This short video ( demonstrates how we’ve taken both your personal CCD and unstructured PDF medical records and made them instantly available in a unified, comprehensive view in response to a directed clinical query by the physician (for example “palpitations”).

“2.  Inference is possible by parsing natural language”

This is paramount for increasing search precision and extracting useful information. We find that by applying statistical semantic analysis techniques one can infer context and improve search quality. As a real example, a search for “anxiety” returns the problem “stress” but not the procedure “stress test,” a common cardiology assessment.

“3.  Data mining needs to go beyond single databases owned by a single organization.“

At the HIMSS11 Interoperability Showcase (see our white paper at we have demonstrated the ability to run a live query across many IHE actors, thereby supporting key clinical information sharing. The Direct Project holds great promise as well.

“5.  The President's Council of Advisors on Science and Technology (PCAST) Healthcare IT report will offer several important directional themes to will accelerate "freeing the data".

We are in favor of a Universal Exchange Language and are ready to contribute to its development. In the meantime, we have found a way to combine and reconcile multi-source data using a suite of natural language processing (NLP) techniques. Perhaps an alternate solution is to develop a "Universal way of Exchanging Language."

“6.  Ontologies and data models will be important to support analytics”

We too believe that middleware solutions that enable federation of data are the keystone to data access and analytics. Our current solution is to allow any provider or consumer application to query clinical information from our system via REST and SOAP APIs.

“7.  Mapping free text to controlled vocabularies will be possible and should be done as close to the source of data as possible.”

We absolutely agree with you. The clinician does not think about search in terms of SNOMED-CT or ICD-10 codes. We find that our users prefer to search using “doctor-friendly terms” such as “chest pain” to retrieve related information such as CKMB results, “GERD” and “Holter monitor.”

While we agree with your vision of associating controlled vocabularies with free text as the data is entered, in today’s environment we find overwhelming amounts of incomplete structured data and diverse textual data. To ensure data integrity in the face of such data heterogeneity, our DisplayMerge technology reconciles and merges data at display time and provides simple ways for users to control quality and change data associations (merges). Changes made by any user persist within the network in a Wiki fashion, allowing errors to be resolved over time.

“9.  New tools will empower end users”

There is a lot of work to be done in this space, but at this point we believe search is the easy button of healthcare information. (John- Next time we meet, we would love to show you our next generation smart search capabilities.)

“10.  Novel sources of data will be important”

We have learned that it is possible to extract medical knowledge from the Web to create associations between medical terms.

This is an exciting time in our industry. While today’s technology allows clinical data coming from multiple sources to be instantly accessible and searchable, our biggest challenge lies in how we can expand these capabilities to broad provider and consumer networks, while maintaining privacy and security.

We look forward to your keynote presentation and future dialog.

Bob Rogers
Chief Scientist
Apixio Inc.

Medical Quack said...

You are right on the money with talking about how we design applications for connectivity and thus perhaps we are moving up to the next level again:)

Mapping the free text once it gets there will really enhance knowledge and I'm still watching to see how that end of the scheme progresses, like you said to not lose meaning.

I liked Watson too and actually found a needier place than healthcare I think:) With speech recognition, connection to the web, it could open doors for our lawmakers to get the timely connected information they need to query too in order to create more effective laws with fewer unintended consequences,(just like we want in healthcare) I would like that as much as anyone out there.

I did see where Watson made a cameo appearance but would be nice if it were parked in DC full time:)

Natural language processing might be an asset there too as well as in healthcare:)

Anonymous said...

It was great to see you in the Apple iPad video.

Ahier said...

Data LiberaciĆ³n!!!!!

Unknown said...

Great post John. I really like the idea of leveraging unstructured data, I think this is often overlooked but can probably give the user some of the most valuable information. We have bookmarked this post on our community for IM professionals ( Look forward to reading your work in the future.

Doktor Towerstein said...

Just a small comment: all the technology you mention is already out t/here :)

Instead, I'd ask you: what are the biggest obstacles to their deployment in a healthcare environment?

Thanks, and greetings,

Javier Torres