AI-102 Microsoft Azure AI – Implement Knowledge Mining Solutions
- Overview of Azure Cognitive Search
So in this section, we’re moving on from the text analytics stuff to speech service. Now, speech service in some ways is completely different than text. And in other ways, oftentimes you’re going to use them together. So if you’re going to try to translate audio from one language to another, you’re going to do sort of speech to text, do a translation element and then do the text to speech. Now, the speech service does contain those two factors text to speech and speech to text, which if you think about it, are completely different sort of understanding spoken words and turning them into text.
And the act of reading textual words is basically two different skill sets, but they’re all under the same API. Now, before we look at the code, we should look at the different languages. So in terms of turning speech to text, the speech service supports a lot of languages. If you look down here, in fact, you might be surprised that not only do they understand different languages, they understand dialects and they understand regional differences. So they can understand the difference between Egyptian Arabic and Iraqi Arabic and Jordanian Arabic, et cetera. If we scroll down, we can see all the different languages. There’s got to be, you know, 50 or 100 of them. And even between English speakers, it can understand the difference between and American English.
No doubt united States and British English. So it does understand the speakers. This is speech to text in this list here. So it understands the speakers of these dialects and can translate them to text. Now, if we keep going, we are looking at the opposite, which is the text to speech stuff. Now, text to speech is when the computer synthesizes a human voice. There are different types of voices. There are standard voices, custom voices, and neural voices. Now, neural is the newest type and it is basically the most realistic. So we’re using actually artificial intelligence and deep neural networks to create synthesized speech. In neural voices, synthesized speech is nearly indistinguishable from humans. And this is where you get into trouble because we’ve seen projects in the past where companies have called on the phone businesses asking them a question using a computer and not telling them that it was a computer calling. So people are oftentimes fooled by synthesized speech when it’s actually computer generated. So neural is relatively new and again, it’s the most realistic.
And this is text to speech. So we’re looking at another list of realistic voices. Now, sometimes for the same language, they have a couple of different voices. So if we go down to let’s look at British English, if we look at United Kingdom English, we can see that there are two female voices for United Kingdom English. And they are distinguished by names, human names. Libby, neuro and Mia. Neuro. Okay, so these are sort of two different imitations of two different women speaking british English. Now in that case there’s only one male English voice. Same with the United States. There are two, Aria and Jenny in terms of the female neural voices. Now I mentioned there are two other types. There’s also the classic or the standard. If I scroll down to standard voices here, these are standard synthesized speech and you can tell that it’s a computer talking. So this is the more stunted types of speech that you might be used to from computer generated speech. And so we go down to here in terms of English, we can see here in English, Canadian, we have a voice named Linda and when we look at our GitHub repository for TextToSpeech, we’re going to be using Linda’s voice.
So it’s going to be a bit stunted, a computer generated clearly not trying to imitate, trying to be understood, not trying to totally imitate a human. So in this video the course will talk about text to speech and a couple of sections later we’ll talk about speech to text. The last type of voice is called custom voice. And custom voice is basically you’re going to be able to train this. Okay? So again, it is available in standard and neural tiers and you’re going to basically have training data in order to train your custom voice, okay? Now I don’t think most people are going to be training their own voices. Maybe for some vanity you want your own voice to be spoken by the computer or maybe you’re actually trying to imitate some famous person or something like that. So there is the ability to train this, but it’s not necessarily probably common if you’re going to choose either the standard or the neural options. So let’s switch over to the code. So we’re going to go into the repository under speech, text to speech. Now in this case we are using the SDK. So we are going to be importing the speech module from Cognitive Services and we’re going to call it Speech SDK. And we’re going to basically sign up with our subscription, our region. And we’re choosing again this Canadian Linda voice, which is obviously fake.
Now it’s pretty straightforward. So you create your synthesizer and then you call in this particular case it is a speak text async to get the text. And then you’re basically calling the get method on that, which is going to wait for the result calling this line of code should run the text through the speaker. And so we’re expecting as we run this that your computer will actually end up playing the text that we enter. So this is a custom input on the speaker. Now that might not be the idealized result, of course. And so we also do support running the same text, but in this case we are going to output that to an audio. So we pass in a configuration, which means it’s going to a file, a wave file, instead of going to the speaker. Both call the same command, just with different parameters. So when we run this in PyCharm, we’ll expect the audio to play the quick brown fox jumps over the lazy dog. Thank you, Linda. And so that did sound quite stunted. We can also see that it did save a file to our local machine. So if we navigate there, we can run that. The quick brown fox jumps over the lazy dog. It’s the same output because we have the same input, obviously. So in this video, we saw that we can basically pass pass text to a cognitive services and generate an audio file, which we can save to a file, or we can pass on to the end user.
- Implement a Cognitive Search solution
So in this video we are switching from text to speech to speech to text. And so we can go to the GitHub repository and look for the Speechtotext Python script. We are using the Cognitive Services SDK. The speech API under that. We are setting this up with our key in our region and basically setting up that configuration. And in this first example, we’re going to grab a file from our local machine, a wave file, and that sets up in the configuration and we create the Speech Recognizer object based on the Speech API configuration and the audio configuration. We’re calling the Recognized Once method on the Speech Recognizer and it’s going to upload the wave file to Microsoft Azure Cognitive Services and it will recognize the text and turn that into text. Now remember we said earlier the Speechtotext SDK and API supports many different languages and dialects.
Now it’s important to note that sometimes the dialects can be determined based on the speech and sometimes Microsoft Azure is not quite good enough to determine the difference between different dialects. So in this case with Arabic it only recognizes automatically the Egyptian dialect. But if you specify the language as being, again, Kuwaiti or Lebanese or one of the other Arabic dialects, then it might do a better job of understanding the speech to text. You can see all the different languages. Somehow it can recognize the difference between Australian English and Canadian English, which is a good job.
Just want to say good eye might to my Australian friends and hopefully this will mess up. If anyone imports this audio into speech recognition, we’ll see if it comes out as Canadian or Australian. So let’s switch over to the code again. So we called the Recognize Once method to get that. And in the second example we are still calling the speech Recognizer. But in this case we are basically asking for the audio to come in through the default method because we’re not passing in an audio config at all. And that’s basically going to be a microphone.
And so we have to basically speak into our microphone. It’s going to call the Recognize Once command on that and then that will turn that into text. So we have two examples in this example. So we’re going to load this into Pie charm. We’re going to run it. Now remember, the first example is uploading a WAV file and doing the recognition. In my experience, this does take a little bit longer as we can see it’s taking more than a few seconds to do it, but it does come back recognizing the text. The quick brown folks jumps over the lazy dock and so we can see that Cognitive Services did correctly transcribe that speech. Now, once you have this in text, you can then go to the Text APIs to do sediment analysis and key phrase extraction, translation and all great stuff. So Cognitive Services has both textto speech and Speech to text capabilities.