SSML In Alexa Skill Kit - Part Two


In the previous article; i.e. SSML in Alexa Skill Kit Part 1, we tried to understand what is SSML and some of the tags which are supported by ASK, however, this article is a continuation of the previous article which will elaborate on remaining tags.


This tag is the same as a break, however, it provides some extra breaks before and after the tag. It is equivalent to <break strength="x-strong"/>
Let’s understand with an example,
  1. <speak>  
  2.    <p>This is the first paragraph.  
  3.       There should be a pause after this text is spoken.  
  4.    </p>  
  5.    <p>This is the second paragraph.</p>  
  6. </speak>    
Let’s understand how to write in response from the Lamda function.
  1. var speech = new SsmlOutputSpeech();  
  2.              speech.Ssml = @"<speak><p> This is the first paragraph.There should be a pause after this   
  3. text is spoken.</p> <p>This is the second paragraph.</p></speak>";  
  4.         // build the response using ResponseBuilder  
  5.         var finalResponse = ResponseBuilder.Tell(speech);  
  6.         return finalResponse;  


This tag is the same as a break, however, it provides some extra breaks before and after the sentence i.e. tag. It is equivalent to <break strength="strong"/>.
  1. <speak>       
  2.    <s>this is a first sentence</s>       
  3.    <s>this is the second sentence there should be a short pause before this second sentence</s>      
  4.    This is third and ends with a period and should have the same pause.   
  5. </speak>    


This tag helps to change the rate and volume of speech. This tag has a level attribute that has the below values,
  • strong: with the help of this value, you can slow down speaking rate due to which speech becomes louder and slower
  • moderate: It is the same as strong however it is less than strong.
  • reduced: It will decrease the volume and speed up the speaking rate.
Let’s understand with the help of the below example,
  1. // build the SSML response   
  2. var speech = new SsmlOutputSpeech();  
  3. speech.Ssml = @"<speak>I already told you I <emphasis level=""strong"">really like</emphasis> that person.</speak>";  
  4. // build the response using ResponseBuilder  
  5. var finalResponse = ResponseBuilder.Tell(speech);  
  6. return finalResponse; 
If you will not provide any value then the default value will be moderate.


This tag helps us to speak a phrase in a specific language. Like French, US, Indian, etc. You need to xml: lang as an attribute to specify the language as shown below,
  1. <speak>  
  2.    In India, they pronounce it   
  3.    <lang xml:lang="en-IN">India</lang>  
  4. </speak>    
In Lamda function you need to specify as shown below,
  1. var speech=new SsmlOutputSpeech();  
  2. speech.Ssml=@"<speak>In India, they pronounce it <lang xml:lang=""en-  
  3.  IN"">India</lang>.</speak>";  
  4.  // build the response using ResponseBuilder  
  5. var finalResponse=ResponseBuilder.Tell(speech);  
  6. return finalResponse; 
The following are the values supported by xml: lang attribute.
hi-IN, en-US,  en-IN, en-AU, de-DE, en-CA, es-ES, it-IT, ja-JP, fr-FR, en-GB.


If you want to pronounce any text as phonemic/phonetic then you can use phoneme tag. It has alphabet and ph as an attribute, where the alphabet is used for setting the phonetic alphabet and ph is for pronunciation.
Let's understand with example,
  1. <speak>  
  2.    You say,   
  3.    <phoneme alphabet="ipa" ph="pbA.t@l"></phoneme>  
  4. </speak>    
In the above example we set alphabet as ipa and its ph values pbA.t@l which will pronaince as bootle.
  1. var speech=new SsmlOutputSpeech() speech.Ssml=@"<speak> You say, <phoneme alphabet=""ipa"" ph=""pbA.t@l""></phoneme>. <speak>";  
  2. // build the response using ResponseBuilder  
  3. var finalResponse=ResponseBuilder.Tell(speech);  
  4. return finalResponse; 


This tag is used for modifying pitch, volume, and rate of speech. It has volume, pitch, and rate as attributes which has several values as described below,
  • volume: it helps to change the volume of speech, It has below values.
    • silent
    • x-soft
    • soft
    • medium
    • loud
    • x-loud
  • pitch: it helps to raise /lower the tone of speech, It has below values.
    • x-low
    • low
    • high
    • x-high
  • rate: it helps to modify the rate of speech, It has below values.
    • x-slow
    • slow
    • medium
    • fast
    • x-fast
Let’s understand with an example.
  1. <speak>  
  2.    Normal volume for the first sentence.       
  3.    <prosody volume="x-loud">Louder volume for the second sentence</prosody>  
  4.    .     When I wake up,   
  5.    <prosody rate="x-slow">I speak quite slowly</prosody>  
  6.    .<     I can speak with my normal pitch,       
  7.    <prosody pitch="x-high"> but also with a much higher pitch </prosody>  
  8.    ,     and also   
  9.    <prosody pitch="low">with a lower pitch</prosody>  
  10.    .   
  11. </speak>   


It helps to specify how the text should be interpreted as a number that should be interpreted as digits or cardinal. It has interpret-as and format attributes however the format is optional. And interpret-as specifies how Alexa should respond.
Let’s understand their options,
  • spell out: it spells out each letter, this is used for characters.
  • cardinal: it interprets values as cardinal numbers.
  • ordinal: it interprets values as ordinal numbers.
  • digits: it spells each digit individually.
  • fractional: it interprets values as a fractional
  • unit: it interprets values in measurement.
  • time: interprets the value in time like minutes and seconds.
  • telephone: interprets numbers like 7 or 10 digit telephone numbers.
  • date: it interprets values as date , however, format attribute is used with this.
  • expletive: it beeps out the content inside this tag.
  • interjection: with the help of this tag you can force Alexa to speak text in a more expressive voice.
  • address: it helps to interpret values as a street address.
Let’s understand with an example,
  1. <speak>  
  2.    number is spoken  with each digit  separately:       
  3.    <say-as interpret-as="digits">12345</say-as>  
  4.    .     number spoken as a cardinal number:       
  5.    <say-as interpret-as="cardinal">12345</say-as>  
  6.    .        word is  spelled out:       
  7.    <say-as interpret-as="spell-out">hi</say-as>  
  8. </speak>   


it is similar to say-as, however, it customizes pronunciation by specifying words' part of speech like a simple present or past participle, noun, etc. It has a role attribute which has the below values.
  • amazon: NN: it interprets a word  as a noun
  • amazon:VB: it interprets the word as a verb
  • amazon:VBD: it interprets as a past participle
  • amazon: SENSE_1: it interprets the word as non- default pronunciation
Let’s understand by example,
  1. <speak>  
  2.    The word   
  3.    <say-as interpret-as="characters">test</say-as>  
  4.    may be interpreted     as either the present simple form   
  5.    <w role="amazon:VB">test</w>  
  6.    ,     or the past participle form   
  7.    <w role="amazon:VBD">test</w>  
  8.    .   
  9. </speak>  


This tag helps to speak the text in Amazon Polly voice. It has the name attribute where you can specify the name which indicated the region like for
  • America i.e en-Us: you can specify Ivy, Kendra, Matthew, Joanna, Joey, Salli, Justin, Kimberly
  • Hindi (hi-IN): Aditi
  • English, British (en-GB): Amy, Brian, Emma
  • English, Indian (en-IN): Raveena, Aditi,
  • Spanish, Castilian (es-ES): Conchita, Enrique
  • Japanese (ja-JP): Mizuki, Takumi
  • German (de-DE): Hans, Vicki, Marlene
  • French (fr-FR): Lea, Celine, Mathieu
  • English, Australian (en-AU):  Russell, Nicole
  • Italian (it-IT): Carla, Giorgio
Let’s understand by example,
  1. <speak>  
  2.    I want to tell you a secret.       
  3.    <voice name="Aditi">I am not a real human.</voice>  
  4.    .     Can you believe it?   
  5. </speak>  


In this article, we learned about SSML in the Alexa Skill Kit.