Speech to Text in WPF


One of the new features that came out with .NET 3.5 and 4.0 is the addition of the System.Speech library. This library is a collection of classes that enables speech recognition (Speech to Text) and speech synthesis (text-to-speech). 

In continuation of a previous contribution Text to Speech in WPF, here is a small sample that will recognize the speech and show the resultant text. You can use the System.Speech.Recognition namespace to write speech recognition for desktop applications. You can have two choices:
  1. SpeechRecognizer 
  2. SpeechRecognitionEngine
The Difference is that the SpeechRecognizer uses the shared recognizer, the same recognizer that Vista/7 uses for speech recognition. With this you can access the speech toolbar to interact with the user. The SpeechRecognitionEngine is all done in your application's own process, thus you cannot use the speech toolbar, and you must explicitly tell it when to start recognition.

The speech recognition engine is accessed directly in managed applications by using the classes in System.Speech.Recognition or, alternatively, by the Speech API (SAPI) when used in unmanaged applications. 

1.gif

2.gif
 
Here is a small sample of using System.Speech.Recognition. Add a reference to System.Speech.

3.gif
 
Create WPF window as below

<Window x:Class="Speech_to_Text.MainWindow"
        xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
        Title="Speech to Text" Height="300" Width="525">
    <Grid>
        <Grid>
            <Grid.RowDefinitions>
                <RowDefinition Height="*"/>
                <RowDefinition Height="30"/>
                <RowDefinition Height="25"/>
            </Grid.RowDefinitions>
            <Grid.ColumnDefinitions>
                <ColumnDefinition Width="120"/>
                <ColumnDefinition Width="120"/>
                <ColumnDefinition Width="120"/>
                <ColumnDefinition Width="*"/>
            </Grid.ColumnDefinitions>
            <TextBox Name="TextBox1"  Grid.Row="0" Grid.Column="0" Grid.ColumnSpan="4"  TextWrapping="Wrap" />
            <Label Name="LabelHypothesized" Grid.Row="1" Grid.Column="0" Foreground="Green" >Hypothesized</Label>
            <Label Name="LabelRecognized" Grid.Row="1" Grid.Column="1" Foreground="Green" >Recognized</Label>
            <Button Name="ButtonStart" Grid.Row="1" Grid.Column="3" Content="Start" Click="ButtonStart_Click" Width="80" IsEnabled="False"></Button>
            <Label Name="LabelStatus" Grid.Row="2" Grid.Column="0" FontSize="10" Foreground="Red">Status:</Label>
            <Label Name="Label1" Grid.Row="2" Grid.Column="3" FontSize="10">Speak "End Dictate" to stop.</Label>
        </Grid>
    </Grid>
</Window>

Now let's start with the code
  1. Add using directive

    using System.Speech.Recognition;

  2. Initialize speechsynthesizer object

    private SpeechRecognitionEngine recognizer;

  3. Add speechsynthesizer events on window load

    private
    void Window_Loaded(object sender, RoutedEventArgs e)
    {
        //initialize recognizer and synthesizer
        InitializeRecognizerSynthesizer();
    }
    /// <summary>
    /// initialize recognizer and synthesizer along with their events
    /// </summary>
    private void InitializeRecognizerSynthesizer()
    {
        var selectedRecognizer = (from e in SpeechRecognitionEngine.InstalledRecognizers()
                                          where e.Culture.Equals(Thread.CurrentThread.CurrentCulture)
                                          select e).FirstOrDefault();
        recognizer = new SpeechRecognitionEngine(selectedRecognizer);
        recognizer.AudioStateChanged+=new EventHandler<AudioStateChangedEventArgs>(recognizer_AudioStateChanged);
        recognizer.SpeechHypothesized += new EventHandler<SpeechHypothesizedEventArgs>(recognizer_SpeechHypothesized);
        recognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
        synthesizer = new SpeechSynthesizer();
    }

  4. Add event handlers 

    private
    void recognizer_AudioStateChanged(object sender, AudioStateChangedEventArgs e)
    {
        switch (e.AudioState)
        {
            case AudioState.Speech:
                LabelStatus.Content = "Listening";
                break;
            case AudioState.Silence:
                LabelStatus.Content = "Idle";
                break;
            case AudioState.Stopped:
                LabelStatus.Content = "Stopped";
                break;
        }
    }
    private void recognizer_SpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
    {
         Hypothesized++;
        LabelHypothesized.Content = "Hypothesized: " + Hypothesized.ToString();
    }
    private void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
    {
        Recognized++;
        LabelRecognized.Content = "Recognized: " + Recognized.ToString();
        if (RecogState == State.Off)
            return;
        float accuracy = (float)e.Result.Confidence;
        string phrase = e.Result.Text;
        {
             if (phrase == "End Dictate")
             {
                 RecogState = State.Off;
                 recognizer.RecognizeAsyncStop();
                 ReadAloud("Dictation Ended");
                 return;
             }
             TextBox1.AppendText(" " + e.Result.Text);
        }
    }

  5. And finally the ButtonStart_click

    private
    void ButtonStart_Click(object sender, RoutedEventArgs e)
    {
        switch (RecogState)
        {
            case State.Off:
                RecogState = State.Accepting;
                ButtonStart.Content = "Stop";
                recognizer.RecognizeAsync(RecognizeMode.Multiple);
                break;

            case State.Accepting:
                RecogState = State.Off;
                ButtonStart.Content = "Start";
                recognizer.RecognizeAsyncStop();
                break;
        }
    }

The resulting screen of the application will be as:
 
4.gif