Speech & Machine Learning

Machine learning and speech image

The Machine Learning Group at Mozilla is tackling speech recognition and voice synthesis as its first project. Speech is powerful. It brings a human dimension to our smartphones, computers and devices like Amazon Echo, Google Home and Apple HomePod. Speech interfaces enable hands-free operation and can assist users who are visually or physically impaired.

It’s simpler than ever to build high-quality speech applications using today’s advanced speech algorithms. However, there are still barriers that hamper community-based development of competing, open speech platforms. The missing pieces include:

  • Affordable, production-quality voice data for training new applications
  • Open source engines for speech recognition and speech synthesis
  • An ecosystem that encourages open research and development of different speech platforms

Mozilla’s goal is to make voice data and deep learning algorithms available to the open source world.

Project Common Voice

Project Common Voice by Mozilla is a campaign asking people to donate recordings of their voices to an open repository. Mozilla will release audio files and transcripts along with limited demographic information about the speakers. With a large enough data set, it’s possible to train speech-to-text (STT) systems so they meet production-quality standards. The Common Voice project begins this summer, and we expect to launch the repository in the fall.
Participate in Mozilla’s Project Common Voice

How a Speech Application Learns

Speech-to-Text at Mozilla

Production-quality STT is currently the domain of a handful of companies that have invested heavily in research and development of those technologies. To access proprietary STT services, newcomers need to pay in the range of one cent per utterance – a cost that becomes prohibitive for applications that scale to millions of users. To open up this area for development, Mozilla plans to open source its STT engine and models so they are freely available to the programmer community. The Mozilla open source STT engine is designed to work on server-class machines and can scale to serve large user populations.
Visit Mozilla’s GitHub
Read the GitHub wiki

Pipsqueak Engine

Online STT technologies can have security and privacy vulnerabilities. Mozilla researchers aim to create a competitive offline STT engine called Pipsqueak that promotes security and privacy. This implementation of a deep learning STT engine can be run on a machine as small as a Raspberry Pi 3. Our goal is to disrupt the existing trend in STT that favors a few commercial companies, and to stay true to our mission of making safe, open, affordable technologies available to anyone who wants to use them.

Machine Learning for Better Accuracy

Now anyone can access the power of deep learning to create new speech-to-text functionality. Mozilla is using open source code, algorithms and the TensorFlow machine learning toolkit to build its STT engine. The Mozilla deep learning architecture will be available to the community, as a foundation technology for new speech applications. We plan to create and share models that can improve accuracy of speech recognition and also produce high-quality synthesized speech.

Powerful Speech Algorithms

Today’s speech algorithms enable developers to create speech interfaces using significantly simplified software architectures. Improvements include:

  • More accurate speech recognition, especially in noisy environments
  • Better machine learning to train speech systems
  • No need to hand-engineer components or design complex process flows
  • Less required data maintenance

Writing to the Web Speech API

In time, we plan to use the Web Speech API to bring speech recognition to web sites and applications. We will update this page when we have work to share. Stay tuned!