Your NLP Solution is Only As Good As The Data You Give It.
Manas Ranjan Kar
The success of Natural Language Processing (NLP) solutions for healthcare require more than just a solid AI program behind them. NLP solutions require extensive domain expertise on the part of the programmers and coders creating the solution alongside complex machine learning algorithms. Additionally, underneath every good NLP engine is a good set of training data – a dataset put together to show the machine the right way to make connections between entries. For instance, if a set of training data places “diabetes” next to the code “E119” enough times, the program understands that when it finds “diabetes” on a random medical chart it analyzes, the code “E119” will be the most probable result. Though this example may be fairly straightforward, few cases are this simple and will require a robust dataset to teach the machine the proper connections between codes and diagnoses. Keep reading to see how NLP works and what to look for in a vendor to make sure their NLP solution can return accurate results.
Healthcare Data: Structured vs. Unstructured
Healthcare data typically comes as what’s called “unstructured data” – data that can be any combination of text, images, or multimedia. Critically, this data isn’t separated or labelled in any way that separates out important information on a document like a diagnosis from unimportant information like a page number.
The lack of structure makes NLP much more complicated, and it isn’t an easy problem to solve. Not only does the structure of healthcare data vary widely between organizations, there may even be multiple ways of putting together medical records within the organization itself. On top of those issues, the data structure is seldom – if ever – passed on to the medical record retrieval and analytics vendor (like Episource).
NLP solutions are powerful because they help us make sense of this unstructured data, but they need a guidemap. That’s where the training dataset comes in. Coders create a blueprint of how all of the data points fit together by manually structuring unstructured data. The NLP program then analyzes this dataset to learn the connections between Dx and HCC codes.
Difficulties with Healthcare Data
Creating structured data sets to train NLP solutions is more complicated in healthcare than in most other industries, largely because of the special knowledge required to assemble a correct dataset. Coders creating a structured training dataset have to have a wide breadth of knowledge that allows them to identify codes and properly contextualize them.
For example, there is no standardization of abbreviations. One doctor would use “dia” to mean diabetes, while another could use “DM2.” Similarly, one abbreviation could mean multiple conditions. PVD, for instance, could be Peripheral Vascular Disease or Posterior Vitreous Detachment. Understanding the context within the rest of the chart and having deep medical knowledge is the only way to properly identify these codes.
How to Make Sure Your NLP Solution is Actually a Solution
Creating a solid NLP solution is more than just having good technology. It’s also having the resources and domain expertise to be able to create a set of training data that teaches the solution how to properly identify and contextualize HCC and Dx codes. Episource has implemented three ways to make sure that our training data is high quality:
We employ over 3,000 domain experts and medical coders
We code millions of charts each year, providing a robust amount of data to pull from
We do 3 levels of QA on our data and annotate it.
Each of these checkpoints allow us to make sure that our NLP solution can bring together the best of technology and human expertise to create a robust and accurate coding process. To learn more, contact us today.