Saturday, May 30, 2009

Speech Recognition Software - The Case For Using A Medical Vocabulary And Language Model In The Generation Of EMR/EHR Documents


Dragon NaturallySpeaking (DNS), the most widely used front-end speech recognition software application, comes in several different versions. They have many basic features in common, a number of which I’ll discuss in this post. One version, the medical version, has a set of specialized vocabularies (shown in list below) that make it uniquely suitable for use throughout the healthcare industry.



This post will touch upon a number of the features that all of the versions of DNS have in common as well as some of the special capabilities of the medical version.

Because every person's voice is different, and words can be spoken in a range of different nuances, tones and emotions, the computational task of successfully recognizing spoken words is considerable and has been the subject of many years of continuing research work around the world.

A variety of different approaches are used, with the most widely used underlying technology being the Hidden Markov Model (discussed briefly below). These techniques all attempt to search for the most likely word sequence given the fact that the acoustic signal will also contain a lot of background noise. The task is made easier if the system can be trained to recognize one person's voice pattern rather than that of many people, and it is also easier if isolated words are to be recognized rather than continuous speech. Similarly, the task is easier if the vocabulary is small, the grammar constrained and the context well-defined.

The complexity of these problems has meant that most of the voice recognition systems developed to date cannot recognize continuous speech from a wide variety of people and with a wide vocabulary as successfully as any human listener.

Nonetheless, despite these challenges, the present technology for speaker-dependent large-vocabulary speech recognition systems now works quite well on a PC. And, today, many healthcare-industry applications are well suited for use with this technology. For example, speech recognition is being implemented in both the front-end and back-end of the medical documentation process.

Front-end speech recognition (SR) is where the provider dictates into a speech-recognition engine, the recognized words are displayed right after they are spoken, and the dictator is responsible for editing and signing off on the document. It never goes through a medical transcriptionist (MT)/editor.

Back-end SR or deferred SR is where the provider dictates into a digital dictation system, and the voice is routed through a speech-recognition machine and the recognized draft document is routed along with the original voice file to the MT/editor, who edits the draft and finalizes the report. Both front-end and back-end SR are being used widely in the healthcare industry today.

Many electronic medical records (EMR) applications are more efficient when deployed along with a speech-recognition engine. That is, searches, queries, and form filling may all be faster when data is inputted by voiced rather than by keyboard.



Average data-entry times for a paragraph using various input mechanisms

The next figure shows an EHR system with both front-end and back-end capabilities. In this post, I’ll focus on the former.




Before proceeding, here is a review of a few of the basic terms used in any discussion of speech recognition:

* Homophone
* Phoneme
* Acoustic model
* Vocabulary
* Language model
* Bi-gram, tri-gram and quad-gram

* Hidden Markov Model (HMM)
* Health Level 7 Clinical Document Architecture (HL7CDA)

A homophone is a word that is pronounced the same as another word but differs in meaning. The words may be spelled the same, such as rose (flower) and rose (past tense of "rise"), or differently, such as carat, caret, and carrot, or to, two and too. Homophones that are spelled the same are also both homographs and homonyms. The term "homophone" may also apply to units longer than words, such as letters or groups of letters that are pronounced the same as another letter or group of letters.

DNS doesn’t apply any rules of English Grammar when attempting to “understand” your dictation. It does, however, use the statistical probability of words occurring together in the English language.

A phoneme is the smallest linguistically distinctive unit of sound. Phonemes carry no semantic content themselves.

In effect, a phoneme is a group of slightly different sounds which are all perceived to have the same function by speakers of the language in question. An example of a phoneme is the /k/ sound in the words kit and skill. In English spelling there is a very poor match between spelling and phonemes.

When you dictate into DNS, it compares your utterances to the acoustic model that contains your pronunciation of words (phonemes) set up by reading the enrollment passage(s). The phonetic equivalents are then sent to the vocabulary that contains not only words and phrases but also their phonetics. Finally homophones and near homophones are resolved using the language models. These look at the statistical probability of the phonetics of words appearing together in the installed language. The models look at pairs, e.g. “right away”, triplets, e.g., “write a letter”, and quadruplets in DNS. The language model is looking at the phonetics making up the words.

Accurate transcription requires a good acoustic model (i.e. a good microphone and sound card) and clear enunciation by the user. Local “dialect” can obviously produce inaccuracy. For example, in both the U.S. and U.K., some regional “dialects” will miss the “g” from the end of a word as in “beginning’ “ -- DNS, without training, is likely to transcribe “begin in”.

Assuming the acoustic component is clear, the language model provides the most likely words for the phonetic equivalents according to the statistical probability of “words” occurring together. Initially the Dragon language models (produced by professional linguists) will be based on “standard” grammatical English (or other language). The language model is modified by correcting “misrecognitions” consequently, in time, you can develop a more accurate one.

The bi-gram, tri-gram and quad-gram models look at associations in a single utterance. Hence you could also possibly improve your recognition accuracy by choosing to dictate in shorter phrases. Normally this would not be the best way to dictate as accuracy increases (for "standard English) when you use complete sentences or even full paragraphs.

How Medical Specialty Vocabularies Provide a Better Experience

The greatest determinant of speech recognition accuracy is the appropriateness of the vocabulary and language model. To demonstrate the difference between Dragon Medical and Dragon Professional, here’s a comparison of how the two vocabularies handle the word “embolism.”



Dragon Professional is far more likely to translate “embolism” as the word “symbolism”, because “symbolism” rates higher than “embolism” -- it’s far more commonly used by business professionals. Dragon Medical has “embolism” by contrast, statistically rated much higher because the statistical likelihood of “embolism” occurring in medical dictation would be higher than in Dragon Professional.

Adding Medical Terms to Non-Medical Recognizers Won’t Bridge the Accuracy Gap

Simply having clinicians train and add hundreds of medical terms to Dragon will not adequately raise its accuracy for use in medical settings. Without the additional benefit of Dragon Medical’s language model, which carries the knowledge of the relative frequency of use of both individual words and phrases, a non-medical speech recognizer will not have the added benefit of recognizing the context of the words which provides that additional boost in accuracy.

Were “cerebral embolism” and not just “embolism” spoken by a clinician, Dragon Medical is far more likely to recognize the phrase than Dragon Professional because it would recognize the context in which ‘embolism’ was spoken. Because the language models take into account not only the frequency of words but the frequency of multi-word phrases, Dragon Medical is significantly more accurate for medical dictation.

Other examples of phrases having far better recognition by Dragon Medical are below.



EHR and speech recognition can also help with the standardization of content within a clinical document. Physicians often use different words that have the same intent; for example; they may dictate history or HPI or history of present illness, or they may refer to findings or results. DNS allows the health care provider to implement best practices by coding all common terms alike to normalize the sections or subsections of the medical report and to enable a comprehensive retrieval of this information.

Note: Health Level 7 Clinical Document Architecture (HL7CDA) is a way to have commonality of terms within a document, and serves as the basis to enlarge and enrich the flow of data into the EHR.

The Vocabulary Editor shows you all the active words (the most commonly used words) in the Dragon Medical vocabulary. You can open Vocabulary Editor to find out whether a word is in the active vocabulary. If it’s not there, you can add it. If it is, you can create a different spoken form.

To overcome many of these shortcomings, you can use DNS’s Vocabulary Editor to

* Add words that are spoken one way but written a different way. This feature lets you add a word that, for example, types your phone number whenever you say “phone number line.” (Discussed below)

* Change the formatting properties of a word, such as whether Dragon Medical should type a space before or after the word. You can do this by using the Word Properties dialog box. (Discussed below)

The next three figures illustrate the use of the Vocabulary Editor and training that enabled me to speak “Code44” and watch “Halitosis” appear in a Microsoft Word document.







Note: The red underlining in the figure above was put in by me after the fact, using the Windows Paint application.

However, this trivial example of one-to-one translation can be extended to one-to-many translation: That is, in a matter of seconds, you could “program” Dragon Medical to type out a whole sentence in response to your speaking just a single word or code into the microphone. And, vice versa: you could “program” Dragon Medical to type out just a single word or code in response to your speaking a whole sentence into the microphone.


In addition, you can use this dialog box to view and customize the formatting properties of words even more in your active vocabulary. Click here for further details.

Before continuing, here’s a very brief overview of the anatomy and physiology of speech production and the models and technology used in speech-to-text translation.




The figure below, from an M.I.T. Lincoln Labs – Nuance Communications, Inc. presentation, shows a display of this output spectrum as a function of time.


Now for an overview


Audio input: A microphone is used as the hardware for providing audio input. The microphone captures the spoken words as sound waves and these are to be converted from analog to digital format. The microphone is connected to a computer with a sound card installed. Digital voice recorders do not require the use of a sound card. The spoken words are processed to remove any noise. The microphone used may also influence the recognition rate depending on the quality, and a good microphone should cancel out ambient noise.

Speech engine: There are two speech engines, one for recognizing speech and the other one converting text to speech. Converting text to speech is called speech synthesis.

Language model: This is a very large list of words used in voice recognition. The language model contains a list of the words and their probabilities when used with voice recognition application. A language model is sometimes called a dictionary or lexicon. For example, a radiology language model contains all the words most likely to be used when doing a radiology report. Examples of other models are Cardiology and Pathology.

Grammar: In a speech recognition system, a grammar file consists of a list of words or phrases which are recognized by the speech engine and are used to drive the application. Grammars are used to constrain what users can say in a voice recognition application. For example, grammar can be used for voice commands which can let the user save a radiology report, print a radiology report and close the application.

Acoustic Model: When voice is captured by a microphone, the analog signal is converted into a digital signal. Using digital signal processing, the signal is converted into speech frames of 10ms (illustrated in a figure above). These frames are analyzed by using an acoustic model. The model will make a comparison in order to obtain probabilities that a certain word has been spoken by a user. There are a number of acoustic models which can be used for speech recognition but the majority of speech engines available today use the Hidden Markov Model (HMM).

The HMM is a statistical model and is favored more because it is easy to understand, easy to implement, it's faster and it requires less training compared to other models. Some speech recognition systems use a hybrid combination of the various models used in speech recognition. An example would be the Hidden Markov Model and the Artificial Neural model (ANN). The ANN model is loosely based on the biological model of the human neural system

Hidden Markov Model

* Chain of Phonemes that make a word
* Used first on words and then on sentences
* Statistical analysis based on previous phrases (similar to predictive text messaging on cell phones)

A Guide To Using DNS Medical

DNS 10 Medical is a very powerful tool that can help its users deliver better healthcare at lower cost. But, its users have to know how to take care of it before they can benefit - long term - from this resource.

Failing to observe good practices with any computer application is like failing to perform regular maintenance on your car, such as changing the oil, getting regular tune-ups, maintaining the proper inflation of your tires, etc. If you don't do this, the chances are very good that your car won't last longer than 50,000 miles at best. The same applies to DNS.

Over the course of a day of continuous dictation your voice, your dictation style, your enunciation, and other factors that affect the performance of DNS change. We don't start out in the morning dictating in one manner and end the day dictating in the same manner. These changes affect how well DNS recognizes what you say.

There are several features in DNS that were introduced in DNS 9 that affect overall accuracy over time. These are the PelAudio Acoustic Scale Score and a feature called SilentAdapt which uses the PelAudio Acoustic Scale Score to analyze your dictation during the course of any dictation session, whether it be short or long. These can have a positive effect on your accuracy, but they can also have a negative impact.

Every time you dictate anything, DNS analyzes what you say using the PelAudio Acoustic Scale Score and assigns a confidence level. That is, it determines how frequently you say the same things in the same way consistently and assigns a score to each set of words and utterances. The positive effect is that, via the SilentAdapt feature, DNS learns to repeatedly recognize what you say based on the assignment of PelAudio Acoustic Scale Scores. The other positive effect is that DNS learns, from your dictation via the same methodology and functions/features, to ignore those words or phrases that you do not use frequently. For example, if you add a word or phrase to your vocabulary via the Vocabulary Editor, but don't use it again for a predetermined period of time, DNS learns to ignore it. This methodology is used to avoid misrecognitions that might otherwise occur during the course of dictation.

The negative effect is that if you don't perform corrections and simply dictate for hours leaving your corrections to the end of your dictation session, DNS will tend to place a higher PelAudio Acoustic Scale Score on misrecognitions, thus tending to repeat them rather than making the correct recognition. This does not occur immediately, but it does occur frequently over time because of these features/functions. Therefore, it behooves all users to proofread what they have dictated at reasonable intervals and make appropriate corrections along with training such.
Proofreading documents dictated using DNS is different from standard proofreading. DNS does not make spelling errors. All the words that DNS recognizes are spelled correctly even in the case where the overall recognition is not correct. Therefore, it's important to learn how to recognize grammar and context errors, as well as how to properly train DNS to correct them. Users frequently pick up misrecognized words and phrases when proofreading. One way of dealing with this issue is to make frequent use of DNS's "playback" feature that plays back what you said exactly the way you said it so that you can compare it to the actual recognize text. This is often helpful to inexperienced users in terms of learning how to proofread dictated documents by teaching them how to recognize these types of dictation errors.

You shouldn't dictate for many hours without closing and saving your user profile, followed by relaunching it. Over time, the active user profile which you are using stores volumes of information about your dictation, corrections, and other data that is used by DNS to improve your accuracy. Dictating using a single user over many hours can cause "bloat" that needs to be cleared by periodically closing and saving your user profile. Unnecessary information is stored in your user profile and removed from memory leaving your user profile relatively clean and current. Obviously, running DNS's Acoustic and Language Model Optimizer does a better job of optimizing your user profile. However, periodically closing and reopening your user profile has a moderately similar effect on overall performance.

We all normally adjust to the changes in our dictation style and voice. While we generally don't detect these changes simply because of the way that the human brain works, DNS is particularly sensitive to such changes and this sensitivity is what generally causes much of the degradation in accuracy that users experience when using their user profiles over a long period of time. In addition, putting the microphone to sleep does not turn it off because DNS continues to listen to anything coming into the microphone while waiting for wake-up command. Although this generally does not have a negative impact, it can, depending upon what DNS is hearing. So, it is generally better to turn the microphone off rather than leaving it on/asleep for any length of time, particularly if there is significant background speech and/or noise. Nevertheless, it is always a good idea to rerun the Audio Setup Wizard whenever you begin to detect an increase in the number of misrecognitions. This readjusts the microphone settings based on both the current environment (background) as well as any changes in your voice or manner of dictation. In short, this readjusts the microphone settings to reflect anything that may impact on recognition accuracy, particularly if there is any significant difference between the settings used when you first start dictating in the morning and the current status of your voice, dictation style, etc.

Lastly, remember that constant use of your system in terms of opening and closing applications, dictation using DNS, and other interactive factors occurring in the background during the course of a day have an impact on Windows performance and resources. Periodically reboot your system. This cleans memory entirely and lets you start over again from square one with full access to all the Windows resources and memory. This may seem like a pain, but it is, from time to time, essential to the proper performance of DNS, as well as the proper performance of Windows itself. Remember that as goes Windows, so goes DNS. Not the other way around.

A Recap

NaturallySpeaking and the Acoustic Model

The way you speak is totally distinctive, and no-one on earth sounds exactly the same way as you do. Dragon NaturallySpeaking relies on this individuality to create a unique mathematical model of your voice's sound patterns.

NaturallySpeaking analyses each sound you make and compares it to a database of thousands of possible syllables in the English language. As it becomes more familiar with your speech patterns (a process greatly enhanced by training the application when creating a new user profile), it becomes more accurate in identifying individual sounds. For example, the way you pronounce a “th” sound changes how Dragon NaturallySpeaking responds to any word with that sound in its pronunciation.

As the acoustic model recognizes sounds, it’s the vocabulary’s task to relate those sounds to actual words.

NaturallySpeaking and vocabularies

A vocabulary in Dragon NaturallySpeaking is compiled from a body of information that typically includes a word list and a language model. The word list adds words to the Dragon NaturallySpeaking’s active vocabulary (which is loaded into RAM and allows instant recognition) and backup dictionary (which has an expanded number of words for correction purposes) to improve the language model and recognition accuracy when the vocabulary is compiled. The language model contains usage and context information about all the words.

Therefore, Dragon NaturallySpeaking uses a vocabulary to recognize words correctly based not only on the sounds of the words, but also on the context of those words within your current document.

All words in the vocabulary have an initial set of pronunciations. The acoustic model uses these pronunciations to decide which words most closely match what was spoken. A word may have more than one pronunciation assigned to it, such as the word "either," which may be pronounced "EE-ther" or "EYE-ther; and in turn a pronunciation may have more than one word assigned to it, such as the words “to”, “too” and “two”. In this case, Dragon NaturallySpeaking’s language model assesses the context of the word within the sentence to determine which word is most correct.

Narrative Paradigm

There is often a considerable difference between what is typed or hand written into a report and what is put into a report that's created by a speech recognition system. The latter is often narrative based, capturing important nuances in addition to the bare facts.

A note on full-URL links vs. compressed links

I've been asked why I didn't used link-shrinkers in earlier posts. Here's why:

First, I should say that there are some things I like about link compression: Some link-shrinkers let you personalize the new address with a unique phrase such as your name, or show you how many people click the link after you've posted it. Furthermore, link compression is just the beginning. More and more of these outfits allow users to see all sorts of details like where a link is showing up around the Web and where the people clicking on it are located.

However, this convenience may come at a cost. The tools add another layer to the process of navigating the Web, potentially leaving a trail of broken links if a service suddenly closes shop. They can also make it harder to tell what you're really clicking on, which may make these Lilliputian links attractive to spammers and scammers.

But popularity and convenience don't eliminate the potential risks of these link loppers. If so many services are springing up, chances are some will just as quickly disappear. And if a URL shortening service goes down, the links created with it could lead nowhere.

Another worry is that you're not likely to know exactly where a truncated link will take you. So you could be directed to unsavory or illegal content or something malicious like a computer worm. This means URL shortening services need to keep an eye on the kinds of sites their users are linking to.