Friday, 25 March 2011

Recognizing specialized vocabulary with large dictionaries

One of the goals of the work which inspired this blog is to integrate a speech recognition engine into a lecture capture system (specifically, integrating CMU Sphinx into Opencast Matterhorn).

Many university lectures include a high proportion of specialist terms (e.g. medical and scientific terms, discipline-specific terminology and jargon). These are important words. They are the "content anchors" of the lecture, and are likely to be used as search terms should a student want to locate a particular lecture dealing with a topic, or jump to a section of a recording.

Hence applications of speech recognition in an academic context need to pay special attention to recognizing these words correctly. ASR engines use linguistic resources to recognize words: a pronunciation dictionary which maps words to typical pronunciations, and a language model, which is a statistical model of the frequency with which word and word combinations (n-grams) occur in a body of text.

This post examines the "size and shape" of dictionary that would be required to recognize most specialist terms correctly in a particular domain. The reference text is an edited transcript of a lecture delivered to undergraduate Health Sciences (Medical School) students on "Chemical Pathology of the Liver".

The dictionaries evaluated come from a variety of sources. Google's ngram dictionary is a list of words from English language books with a minimum frequency cutoff of 40. BEEP and CMU are ASR pronunciation dictionaries. The Bing dictionary is a list of the most frequently 1000,000 terms in documents indexed by bing, and WSJ 5K is a small vocabulary from the Wall Street Journal (WSJ) corpus.

The Wikipedia dictionaries were created from a plain text list of sentences from Wikipedia articles. The complete list of words was sorted by descending frequency of use, with a cutoff of 3. Wikipedia 100K, for example, contains the most frequent 100,000 terms from Wikipedia.

The dictionaries all contain variant forms as separate words rather than stem words (e.g. speak, speaker, speaks). The comparison of the lecture text to the dictionary compares only words which are 3 or more characters in length (on the assumption that 1- and 2-letter English words are not problematic in this context, and excluding them from the Wikipedia dictionaries avoids some noise).

The reference text contains 7810 words which meet this requirement, using a vocabulary of 1407 unique words. Compared against the candidate dictionaries, we find:

Dictionary Size OOV
OOV% Unique
OOV words
Google 1gram Eng 2009 4 631 186 12 0.15% 8 0.57%
Wikipedia Full 1 714 417 22 0.28% 13 0.92%
Wikipedia 1M 1 000 000 27 0.35% 16 1.14%
Wikipedia 500K 500 000 41 0.52% 23 1.63%
Wikipedia 250K 250 000 112 1.43% 43 3.06%
Wikipedia 100K 100 000 269 3.44% 90 6.40%
BEEP 1.0 257 560 413 5.29% 124 8.81%
CMU 0.7.a 133 367 455 5.83% 146 10.38%
Bing Top100K Apr2010 98 431 514 6.58% 125 8.88%
WSJ 4 986 2 177 27.87% 696 49.47%

So if we are hoping to find more than 99% of the words in our lecture in a generic English dictionary, i.e. an out of vocabulary (OOV) rate of < 1%, we require a dictionary of between 250K and 500K terms.

Looking at the nature of the words which are OOV at different dictionary sizes, 250K to 500K is also the region where the number of unrecognized general English words becomes insignificant, leaving only specialist vocabulary. So in Wikipedia 250K, missing words include:
sweetish, re-expressed, ex-boss
which are slightly unusual but arguably generic English. Using Wikipedia 500K, the remaining missing words are almost completely domain-specific, for example:
sulfhydryls, aminophenyl, preicteric,  methimine, fibrosed, haematemesis, paracetamols, prehepatic, icteric, urobilin, clottability, hepatoma, sclerae, hypergonadism, extravasates, clottable, necroses, necrose
So the unsurprising conclusion is that a lecture on a narrow, specialist topic may contain a lot of words which are very infrequent in general English. Another way of visualizing this is comparing the word frequency distribution from a lecture transcript to a text from another genre.

This scatter plot shows term frequency in the transcript against dictionary rank (i.e. the position of the word in a dictionary sorted from most-to-least frequent), for the lecture transcript (blue) and the first 10,000 words or so from Alice's Adventures in Wonderland (i.e. a similar wordcount to the lecture).

The narrative fictional text shows the type of distribution we would expect from Zipf's law. The lecture text shows many more outliers -- for example terms with a document frequency of between 10 and 100, and a dictionary rank of 10,000 and below.

So is the solution to recognizing these terms to use a very large dictionary? In this case, larger is not always better. While we may want to recognize a word such as "fibrosed" which occurs with frequency 3 only in the full 1.7M Wikipedia dictionary, in practical terms a dictionary is only as useful as the accompanying language model.

LMs generated with an unrestricted vocabulary from a very large text corpus such as Wikipedia are not only impractical to use (requiring significant memory), but also lose an essential element of context, which is that a lecture is typically about one topic, rather than the whole of human knowledge. Hence we need to take into account that "fibrosed" is significantly more likely to occur in a lecture on liver pathology than "fibro-cement".

This leads to the specialization of language model adaptation, a topic of future posts.

1 comment:

  1. This comment has been removed by a blog administrator.