Truly Madly Wordly: Sphinx4 speech recognition results for selected lectures from Open Yale Courses

About this dataset
Data on the performance of large-vocabulary, continuous speech recognition engines in real contexts is sometimes hard to find. This dataset describes the performance of the CMU Sphinx Speech Recognition Toolkit (specifically the Sphinx4 java implementation) in recognition selected lectures from Open Yale Courses using the HUB4 acoustic and language models.

This is the same data summarized on Slide 10 in Speech Recognition in Opencast Matterhorn and forms part of a larger research project on adapting language models to improve the searchability of automated transcriptions of recorded lectures.

Source material
The audio files are mp3 recordings which form part of Open Yale Courses (OYC), which helpfully includes both transcripts and a research-friendly Creative Commons license (CC-BY-NC-SA) permitting reuse and derivative works.

The 13 recordings were selected for audio quality, reasonable match of the speaker accent to a North American English accent (presumed to align reasonably with the acoustic model), a variety of speakers and topics, consistent length (around 50min each), and to primarily consist of a single speaker (i.e. minimal audience involvement). Of the 13 lectures, 11 are by male speakers and 2 by female speakers.

The transcripts provided by OYC have been normalized to a single continuous set of words without punctuation or linebreaks for calculating speech recognition accuracy, and are also provided as a separate dataset.

Sphinx4 Configuration
The Sphinx4 configuration is for large-vocabulary, continuous speech recognition, using the HUB4 US English Acoustic Model, HUB4 Trigram Language Model and CMUdict 0.7a dictionary. HUB4 contains 64000 terms, and in the worst case below, matches just over 95% of the vocabulary (though not much of the specialist vocabulary, which is a different topic).

There are many ways to adjust Sphinx's configuration depending on the task at hand, and the configuration used here may not be optimal, though experimenting with settings such as beam width and word insertion probability did not have a significant effect on accuracy.

Results

In the table below, click on the title to go to the OYC page for the lecture (which includes links to audio and transcript), and click on the WER to see the Sphinx recognition output.

Lecture	Words	Word Error Rate (WER)	Perplexity (sentence transcript)	Out of vocabulary (OOV) words	OOV %
Dark Energy and the Accelerating Universe and the Big Rip	6704	35%	228	110	1.6%
Cell Culture Engineering	7385	35%	307	164	2.2%
Biomolecular Engineering: Engineering of Immunity	6974	32%	211	96	1.4%
Mating Systems and Parental Care	5795	40%	331	145	2.5%
Lycidas	7350	43%	535	314	4.3%
Thomas Pynchon, The Crying of Lot 49	6201	32%	379	174	2.8%
The Postmodern Psyches	6701	47%	274	265	4.0%
The Logic of Resistance	7902	40%	309	74	0.9%
Maximilien Robespierre and the French Revolution	6643	61%	252	212	3.2%
Personal identity, Part IV; What matters?	6603	49%	475	97	1.5%
Socratic Citizenship: Plato, Apology	5473	32%	357	103	1.9%
What Is It Like to Be a Baby: The Development of Thought	7085	42%	275	119	1.7%
The "Afterlife" of the New Testament and Postmodern Interpretation	8196	41%	286	91	1.1%
Average	6847	41%	324	151	2.2%

The full data set with more detailed statistics is available at http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/ (start with the README).

Analysis
The error rate varies widely from a minimum of 32% to maximum of 61% with the output ranging from just readable to nonsensical. While the perplexity and OOV figures show some mismatch between the language model and the text, it is likely that acoustic issues (audio quality and/or speaker/model mismatch) have the biggest impact on performance, particularly for the outliers with word error rates above 45%.

Feedback and comparisons
If you have suggestions for improving the Sphinx configuration to produce better accuracy for this dataset, or have comparative results using this set of lectures using another speech recognition engine or 3rd-party service, please add a comment below.

9 comments:

B. Sutherland28 December 2011 at 05:10
Wow, I was looking at this solution right before the break to do a similar evaluation. Thanks so much for posting!
Unknown13 January 2012 at 08:37
Many thanks to you, it really helped me!! Keep on good work, best wishes.
Unknown26 July 2013 at 19:16
How did you break up the transcript into sentences? Was this done manually?
me29 April 2014 at 19:04
hi , can you share java the source code for this . i'm looking for the option for the continues recognition in sphinx4. thanks.
Stephen Marquard1 May 2014 at 09:56
Here's a java wrapper for running Sphinx for continuous recognition:

http://source.cet.uct.ac.za/svn/people/smarquard/sphinx/wrapper/Transcribe.java

with a config file like this:

http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/sphinx-custom.xml
TOUCH FEEL2 September 2014 at 09:00
hi, how to use it with other language e.g. Portuguese or Chinese with Java. as you did in /Transcribe.java
shadir26 August 2015 at 14:33
how to do with Pocket Sphinx in language model
i mean to create the language model Using Pocket Sphinx(linux). Without uSing Sphinx4
Anna Schafer14 February 2016 at 11:12
though experimenting with settings such as beam width and word insertion probability did not have a significant effect on accuracy.speech recognition program
S star7 March 2019 at 19:47
Here the audio of any language can be made into any text. This is a very effective full website.

interview transcription services

best transcription service

video transcription services

convert audio file to text

Note: only a member of this blog may post a comment.

Sunday 18 December 2011

Sphinx4 speech recognition results for selected lectures from Open Yale Courses

9 comments: