Speech To text

The concept behind Speech To Text (STT) is the conversion from spoken words into text. There are some STT systems, for example Pocketsphinx, Kaldi, DeepSpeech, Remote HTTP Server, External Command. In this project we focus on DeepSpeech.

MQTT API

When the MQTT message hermes/asr/startListening with a sessionID is sent, the STT starts listening on the audio frame at hermes/audioServer/<siteId>/audioFrame. When silence is detected the message hermes/asr/stopListening is sent with the same sessionID as in the hermes/asr/startListening message. The transcripted text is sent with the Message hermes/asr/textCaptured, it’s in the text attribute. When an error occurred, the STT publishes the message hermes/error/asr

DeepSpeech

DeepSpeech combines the acoustic model and pronunciation dictionary into a single neural network. It still uses a language model.

Acoustic Model: Maps acoustic/speech features to likely phonemes in a given language
Pronunciation Dictionary: Needed to train an acoustic model and to do speech recognition
Grapheme to phoneme (G2P Model): Can be used to guess the phonetic pronunciation of words
Language Model: Helps to give a probability how often some words follow others. The probability is based on heuristic
Sentence Fragments: The language model does not contain probabilities for entire sentences, only sentence fragments. To get the entire word the speech recognizer requires a few tricks
Language Model Training: The main goal is to generate a language model based on the intent graph obtained during the initial stage of training
Language Model Mixing: Possibility to mix the language with a pre-built model

Open Transcription

By default DeepSpeech only knows the words you wrote in the sentence.ini. For us it’s sufficient to recognize the intents of the user. But when you want to add a chat functionality to your voice assistant it would be good to be able to transcript open text and not only the words in the sentence.ini. You can activate the open transcription, by set speech_to_text.deepspeech.open_transcription ìn your profile.json to true, or by check the checkbox for open transcription the Rhasspy settings menu under text to speech.
When you restart your Rhasspy now, Rhasspy asks you to download 2GB of data. After it’s done, Rhasspy starts the training, and the opentranscription is available.

Silence Detection

If you want to optimize the recognition of your Wake Word, you can adjust these options in your profile:

  "command": {
    "webrtcvad": {
      "skip_sec": 0,
      "min_sec": 1,
      "speech_sec": 0.3,
      "silence_sec": 0.5,
      "before_sec": 0.5,
      "vad_mode": 3
    }
  }

skip_sec is how many seconds of audio should be ignored before recording
min_sec is the minimum number of seconds a voice command should last
speech_sec is the seconds of speech before a command starts
silence_sec is the seconds a silence after a command before ending
before_sec is how many seconds of audio before a command starts are kept
vad_mode is the sensitivity of speech detection (3 is the least sensitive)