Text to Speech, Speech to Text, and Speech Translation using Azure Cognitive Services SDK on Python

Introduction

In today's fast-paced world, effective communication plays a crucial role in breaking down barriers and fostering connections. Thanks to advancements in technology, solutions such as speech to text, text to speech, and speech translation have become integral in enhancing communication experiences. One such powerful tool is the Azure Cognitive Services Software Development Kit (SDK), which offers an extensive range of features for speech-related tasks. In this article, we will explore how to leverage the Azure Cognitive Services SDK on Python to transform the way we interact with speech and text.

Understanding Speech Services

  • Text to Speech: This enables your applications, tools, or devices to convert text into human-like synthesized speech.
  • Speech to Text: Enables real-time and batch transcription of audio streams into text.
  • Speech Translation: Enables real-time, multilingual translation of speech to your applications, tools, and devices. 

Setting up Azure Speech resource

Go to Azure Portal and search Speech Service, then click on Create.

Choose the subscription, resource group, region, pricing tier, and type the resource name. Then, click on Review + create.

Once the resource is created, go to Keys and Endpoint to copy your credentials.

Testing Speech Services using Azure Cognitive Services SDK on Python

You need to install the Azure Cognitive Services Speech SDK. You can do this by running the following command in your Python environment:

pip install azure-cognitiveservices-speech

After the Azure Cognitive Services Speech SDK was installed, we can import the library.

import azure.cognitiveservices.speech as speechsdk

We are going to have the following constants using the Key and Region you copied previously.

SPEECH_KEY = "<YOUR_API_KEY>"
SPEECH_REGION = "<YOUR_REGION>"

We are going to start with Text to Speech. You can go to this page to know further details about the documentation of this service. To setup a difference voice, visit this page to see the supported languages per speech feature.

def text_to_speech(text: str, audio_config):
  speech_config = speechsdk.SpeechConfig(subscription=SPEECH_KEY, region=SPEECH_REGION)

  # The language of the voice that speaks
  speech_config.speech_synthesis_voice_name = 'en-US-GuyNeural'

  speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

  speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()

  if speech_synthesis_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print("Speech synthesized for text [{}]".format(text))
  elif speech_synthesis_result.reason == speechsdk.ResultReason.Canceled:
      cancellation_details = speech_synthesis_result.cancellation_details
      print("Speech synthesis canceled: {}".format(cancellation_details.reason))
      if cancellation_details.reason == speechsdk.CancellationReason.Error:
          if cancellation_details.error_details:
              print("Error details: {}".format(cancellation_details.error_details))
              print("Did you set the speech resource key and region values?")

To test the above method, just pass a text and the audio config. For this example we are creating an audio file (mp3) from the given text input.

text = "New York is a great city to visit"
audio_config = speechsdk.audio.AudioOutputConfig(filename="file1.mp3")
text_to_speech(text, audio_config)

Let's continue with Speech to Text. To test this feature just call the speech_to_text method, you must use your microphone.

def speech_to_text():
    speech_config = speechsdk.SpeechConfig(subscription=SPEECH_KEY, region=SPEECH_REGION)
    speech_config.speech_recognition_language="en-US"

    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    print("Speak into your microphone.")
    speech_recognition_result = speech_recognizer.recognize_once_async().get()

    if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print("Recognized: {}".format(speech_recognition_result.text))
    elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
        print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))
    elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speech_recognition_result.cancellation_details
        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))
            print("Did you set the speech resource key and region values?")

And finally, the Speech Translation. For this example, we are using the audio file generated from text_to_speech method.

def translate_speech_to_text(target_language: str, audio_config):
    speech_translation_config = speechsdk.translation.SpeechTranslationConfig(subscription=SPEECH_KEY, region=SPEECH_REGION)
    speech_translation_config.speech_recognition_language="en-US"

    speech_translation_config.add_target_language(target_language)

    translation_recognizer = speechsdk.translation.TranslationRecognizer(translation_config=speech_translation_config, audio_config=audio_config)

    translation_recognition_result = translation_recognizer.recognize_once()

    if translation_recognition_result.reason == speechsdk.ResultReason.TranslatedSpeech:
        print("Recognized: {}".format(translation_recognition_result.text))
        print("""Translated into '{}': {}""".format(
            target_language, 
            translation_recognition_result.translations[target_language]))
    elif translation_recognition_result.reason == speechsdk.ResultReason.NoMatch:
        print("No speech could be recognized: {}".format(translation_recognition_result.no_match_details))
    elif translation_recognition_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = translation_recognition_result.cancellation_details
        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))
            print("Did you set the speech resource key and region values?")

 We are going to translate from English to Portuguese.

audio_config = speechsdk.audio.AudioConfig(filename="file1.mp3")
translate_speech_to_text("pt", audio_config)

This is the result:

Recognized: New York is a great city to visit.
Translated into 'pt': Nova York é uma ótima cidade para visitar.

You can find the full source code here.

Conclusion

The Azure Cognitive Services SDK on Python opens up a world of possibilities for transforming speech and text interactions. Whether you need to transcribe audio, create natural-sounding synthesized speech, or enable real-time speech translation, the SDK provides the necessary tools to enhance communication experiences. By leveraging the power of Azure Cognitive Services, developers can build innovative applications that break down language barriers, improve accessibility, and revolutionize how we connect with one another. So, why not explore the capabilities of the Azure Cognitive Services SDK today and take your Python projects to new heights?

Thanks for reading

Thank you very much for reading. I hope you found this article interesting and may be useful in the future. If you have any questions or ideas you need to discuss, it will be a pleasure to collaborate and exchange knowledge.


Similar Articles