OpenAI’s Whisper: Revolutionizing Speech Recognition

An Overview of OpenAI’s Speech Recognition Model

4 min readAug 12, 2023

Introduction

In the constantly evolving landscape of artificial intelligence, OpenAI has recently sparked a revolution of innovation following their large-scale launch of ChatGPT in late 2022. By January 2023, ChatGPT had earned the title of the fastest-growing consumer software application in history, with over 100 million users [4]. One of OpenAI’s groundbreaking achievements, Whisper, is an advanced automatic speech recognition (ASR) system that utilizes large multilingual datasets to transcribe spoken language into text with remarkable accuracy and adaptability [1]. OpenAI states:

“We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing… We hope Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications.” [1]

Applications Across Industries

Speech recognition is a vital component of many software services; below are some examples of practical uses of speech recognition:

Accessibility: By providing real-time transcription services, speech recognition enhances accessibility for individuals with hearing impairments.
Content Creation: Content creators can use speech recognition services to transcribe interviews, podcasts, and discussions effortlessly. Not only saving time but also enhancing the discoverability of content searchable transcripts.
Language Learning: Transcribing languages in real-time with audio provides a valuable tool for individuals learning a new language.
Legal Documentation: Speech recognition streamlines the legal documentation process by automating transcriptions. For example, in the field of customer service, speech recognition provides transcriptions that help enhance customer experiences.

The Power of Whisper’s Technology

Developed with meticulous attention to detail, Whisper’s exceptional ability to transcribe spoken language into written text is due to its training on a dataset of 680,000 hours of multilingual and multitask supervised data. This immense and diverse dataset serves as the foundation of Whisper’s capabilities, allowing it to excel at handling accents, background noise, and technical language. To add on, 117,000 hours of the dataset were non-English languages. In total, Whisper is capable of handling 97 different languages with 1.6 billion parameters [3].

Unfortunately, since Whisper was trained on such a large and diverse dataset and not fined tuned to a specific one, it lacks in performance compared to models that specialize in LibriSpeech, a well-known benchmark for speech recognition. However, OpenAI emphasizes Whisper’s zero-shot performance when compared to other models:

“When we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.” [1]

The Architecture

Whisper’s architecture employs a simple end-to-end approach, refer to Figure 1. At whisper’s core is an encoder-decoder Transformer framework — an established choice validated for scaling and handling complex language tasks. First, audio input of 30-second chunks are standardized to 16,000 Hz, then converted by a log-mel spectrogram computed through 25 millisecond windows with a stride of 10 milliseconds. After preprocessing the input, it is passed to an encoder which saves the position of each word. The decoder then predicts the corresponding text caption as well as performs tasks to identify languages, translate and transcribe speech, and create time stamps. [1,2]

Ethics

Whisper, alongside other AI models, has raised concerns about ethical training when regulating datasets. On May 16th, 2023, OpenAI’s CEO Sam Altman testified before Congress regarding the need to regulate the capability of AI at OpenAI and other companies such as Google and Microsoft [5].

“I think if this technology goes wrong, it can go quite wrong. And we want to be vocal about that… We want to work with the government to prevent that from happening.” — Sam Altman

The discussion of ethical AI usage is not new; when models such as Whisper are trained on such large-scale datasets, it becomes difficult to moderate the sources in the datasets for reliability and biases. As AI technology develops, Ethical usage will constantly need to be analyzed.

Closing Notes

OpenAI’s Whisper emerges as a game-changing force in speech recognition technology. While it may seem to have a straightforward architecture and underperform when comparing precision with other well-known speech recognition software, it has many underlying benefits for research and experimental usage. Specifically, users can fine-tune Whisper’s model to perform a specific task, allowing users to customize and transform the model for their own purposes and produce great results much faster than starting from scratch. Moreover, OpenAI has also publicly shared their code, allowing users more personal customization instead of simply using OpenAI’s Whisper API.

If you are interested in Whisper and would like to know more, here is a link to Whisper’s GitHub page.

Feel free to connect with me on LinkedIn,

Citations

[1] OpenAI Editorial Team, “Introducing whisper,” OpenAI Research, https://openai.com/research/whisper

[2] A. Radford et al., “Robust Speech Recognition via Large-Scale Weak Supervision,” OpenAI Papers, https://cdn.openai.com/papers/whisper.pdf

[3] A. Alford, “OpenAI Releases 1.6 Billion Parameter Multilingual Speech Recognition AI Whisper,” InfoQ, https://www.infoq.com/news/2022/10/openai-whisper-speech/

[4] UBS Editorial Team, “Let’s chat about chatgpt,” UBS, https://www.ubs.com/us/en/wealth-management/insights/marketnews/article.1585717.html

[5] C. Kang, “OpenAI’s Sam Altman urges A.I. Regulation in Senate hearing,” The New York Times, https://www.nytimes.com/2023/05/16/technology/openai-altman-artificial-intelligence-regulation.html?smid=url-share