Sunday, 18 December 2011

Sphinx4 speech recognition results for selected lectures from Open Yale Courses

About this dataset
Data on the performance of large-vocabulary, continuous speech recognition engines in real contexts is sometimes hard to find. This dataset describes the performance of the CMU Sphinx Speech Recognition Toolkit (specifically the Sphinx4 java implementation) in recognition selected lectures from Open Yale Courses using the HUB4 acoustic and language models.

This is the same data summarized on Slide 10 in Speech Recognition in Opencast Matterhorn and forms part of a larger research project on adapting language models to improve the searchability of automated transcriptions of recorded lectures.

Source material
The audio files are mp3 recordings which form part of Open Yale Courses (OYC), which helpfully includes both transcripts and a research-friendly Creative Commons license (CC-BY-NC-SA) permitting reuse and derivative works.

The 13 recordings were selected for audio quality, reasonable match of the speaker accent to a North American English accent (presumed to align reasonably with the acoustic model), a variety of speakers and topics, consistent length (around 50min each), and to primarily consist of a single speaker (i.e. minimal audience involvement). Of the 13 lectures, 11 are by male speakers and 2 by female speakers.

The transcripts provided by OYC have been normalized to a single continuous set of words without punctuation or linebreaks for calculating speech recognition accuracy, and are also provided as a separate dataset.

Sphinx4 Configuration
The Sphinx4 configuration is for large-vocabulary, continuous speech recognition, using the HUB4 US English Acoustic Model, HUB4 Trigram Language Model and CMUdict 0.7a dictionary. HUB4 contains 64000 terms, and in the worst case below, matches just over 95% of the vocabulary (though not much of the specialist vocabulary, which is a different topic).

There are many ways to adjust Sphinx's configuration depending on the task at hand, and the configuration used here may not be optimal, though experimenting with settings such as beam width and word insertion probability did not have a significant effect on accuracy.

Results

In the table below, click on the title to go to the OYC page for the lecture (which includes links to audio and transcript), and click on the WER to see the Sphinx recognition output.

Lecture
 Words
Word Error Rate (WER)
 Perplexity (sentence transcript)
 Out of vocabulary (OOV) words
 OOV %
6704
 228
 110
1.6%
7385
 307
 164
2.2%
6974
 211
 96
1.4%
5795
 331
 145
2.5%
7350
 535
 314
4.3%
6201
 379
 174
2.8%
6701
 274
 265
4.0%
7902
 309
 74
0.9%
6643
 252
 212
3.2%
6603
 475
 97
1.5%
5473
 357
 103
1.9%
7085
 275
 119
1.7%
8196
 286
 91
1.1%
Average
6847
41%
 324
 151
2.2%

The full data set with more detailed statistics is available at http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/ (start with the README).

Analysis
The error rate varies widely from a minimum of 32% to maximum of 61% with the output ranging from just readable to nonsensical. While the perplexity and OOV figures show some mismatch between the language model and the text, it is likely that acoustic issues (audio quality and/or speaker/model mismatch) have the biggest impact on performance, particularly for the outliers with word error rates above 45%.

Feedback and comparisons
If you have suggestions for improving the Sphinx configuration to produce better accuracy for this dataset, or have comparative results using this set of lectures using another speech recognition engine or 3rd-party service, please add a comment below.

8 comments:

  1. Wow, I was looking at this solution right before the break to do a similar evaluation. Thanks so much for posting!

    ReplyDelete
  2. Many thanks to you, it really helped me!! Keep on good work, best wishes.

    ReplyDelete
  3. How did you break up the transcript into sentences? Was this done manually?

    ReplyDelete
  4. hi , can you share java the source code for this . i'm looking for the option for the continues recognition in sphinx4. thanks.

    ReplyDelete
  5. Here's a java wrapper for running Sphinx for continuous recognition:

    http://source.cet.uct.ac.za/svn/people/smarquard/sphinx/wrapper/Transcribe.java

    with a config file like this:

    http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/sphinx-custom.xml

    ReplyDelete
  6. hi, how to use it with other language e.g. Portuguese or Chinese with Java. as you did in /Transcribe.java

    ReplyDelete
  7. how to do with Pocket Sphinx in language model
    i mean to create the language model Using Pocket Sphinx(linux). Without uSing Sphinx4

    ReplyDelete
  8. though experimenting with settings such as beam width and word insertion probability did not have a significant effect on accuracy.speech recognition program

    ReplyDelete