Get to know MDN better
Cette page a été traduite à partir de l'anglais par la communauté. Vous pouvez contribuer en rejoignant la communauté francophone sur MDN Web Docs.
L'API Web Speech fournit deux fonctionnalités différentes — la reconnaissance vocale, et la synthèse vocale (aussi appelée "text to speech", ou tts) — qui ouvrent de nouvelles possibiités d'accessibilité, et de mécanismes de contrôle. Cet article apporte une simple introduction à ces deux domaines, accompagnée de démonstrations.
La reconnaissance vocale implique de recevoir de la voix à travers un dispositif de capture du son, tel qu'un microphone, qui est ensuite vérifiée par un service de reconnaissance vocale utilisant une "grammaire" (le vocabulaire que vous voulez faire reconnaître par une application donnée). Quand un mot ou une phrase sont reconnus avec succès, ils sont retournés comme résultat (ou une liste de résultats) sous la forme d'une chaîne de caractères, et d'autres actions peuvent être initiées à la suite de ce résultat.
L'API Web Speech a une interface principale de contrôle — SpeechRecognition — plus un nombre d'interfaces inter-reliées pour représenter une grammaire, des résultats, etc. Généralement, le système de reconnaissance vocale par défaut disponible sur le dispositif matériel sera utilisé pour la reconnaissance vocale — la plupart des systèmes d'exploitation modernes ont un système de reonnaissance vocale pour transmettre des commandes vocales. On pense à Dictation sur macOS, Siri sur iOS, Cortana sur Windows 10, Android Speech, etc.
Note : Sur certains navigateurs, comme Chrome, utiliser la reconnaissance vocale sur une page web implique de disposer d'un moteur de reconnaissance basé sur un serveur. Votre flux audio est envoyé à un service web pour traitement, le moteur ne fonctionnera donc pas hors ligne.
Pour montrer une simple utilisation de la reconnaissance vocale Web speech, nous avons écrit une demo appelée Speech color changer. Quand l'écran est touché ou cliqué, vous pouvez dire un mot clé de couleur HTML et la couleur d'arrière plan de l'application sera modifié par la couleur choisie.
Pour lancer la demo, vous pouvez cloner (ou directement télécharger) le dépôt Github dont elle fait partie, ouvrir le fichier d'index HTML dans un navigateur pour ordinateur de bureau le supportant comme Chrome, ou naviguer vers l'URL de démonstration live, sur un navigateur pour mobile le supportant comme Chrome.
Le support de la reconnaissance vocale via l'API Web Speech est actuellement limité à Chrome pour ordinateur de bureau et pour mobiles sur Android — Chrome le supporte depuis la version 33 mais avec des interfaces préfixées spécifiques, donc vous avez besoin d'inclure des versions d'interfaces préfixées définies, comme : webkitSpeechRecognition.
The HTML and CSS for the app is really trivial. We simply have a title, instructions paragraph, and a div into which we output diagnostic messages.
The CSS provides a very simple responsive styling so that it looks ok across devices.
Let's look at the JavaScript in a bit more detail.
As mentioned earlier, Chrome currently supports speech recognition with prefixed properties, therefore at the start of our code we include these lines to feed the right objects to Chrome, and any future implementations that might support the features without a prefix:
The next part of our code defines the grammar we want our app to recognise. The following variable is defined to hold our grammar:
The grammar format used is JSpeech Grammar Format (JSGF) — you can find a lot more about it at the previous link to its spec. However, for now let's just run through it quickly:
The next thing to do is define a speech recogntion instance to control the recognition for our application. This is done using the SpeechRecognition() constructor. We also create a new speech grammar list to contain our grammar, using the SpeechGrammarList() constructor.
We add our grammar to the list using the SpeechGrammarList.addFromString() method. This accepts as parameters the string we want to add, plus optionally a weight value that specifies the importance of this grammar in relation of other grammars available in the list (can be from 0 to 1 inclusive.) The added grammar is available in the list as a SpeechGrammar object instance.
We then add the SpeechGrammarList to the speech recognition instance by setting it to the value of the SpeechRecognition.grammars property. We also set a few other properties of the recognition instance before we move on:
After grabbing references to the output <div> and the HTML element (so we can output diagnostic messages and update the app background color later on), we implement an onclick handler so that when the screen is tapped/clicked, the speech recognition service will start. This is achieved by calling SpeechRecognition.start(). The forEach() method is used to output colored indicators showing what colors to try saying.
Once the speech recognition is started, there are many event handlers that can be used to retrieve results, and other pieces of surrounding information (see the SpeechRecognition event handlers list.) The most common one you'll probably use is SpeechRecognition.onresult, which is fired once a successful result is received:
The second line here is a bit complex-looking, so let's explain it step by step. The SpeechRecognitionEvent.results property returns a SpeechRecognitionResultList object containing SpeechRecognitionResult objects. It has a getter so it can be accessed like an array — so the first [0] returns the SpeechRecognitionResult at position 0. Each SpeechRecognitionResult object contains SpeechRecognitionAlternative objects that contain individual recognised words. These also have getters so they can be accessed like arrays — the second [0] therefore returns the SpeechRecognitionAlternative at position 0. We then return its transcript property to get a string containing the individual recognised result as a string, set the background color to that color, and report the color recognised as a diagnostic message in the UI.
We also use a SpeechRecognition.onspeechend handler to stop the speech recognition service from running (using SpeechRecognition.stop()) once a single word has been recognised and it has finished being spoken:
The last two handlers are there to handle cases where speech was recognised that wasn't in the defined grammar, or an error occured. SpeechRecognition.onnomatch seems to be supposed to handle the first case mentioned, although note that at the moment it doesn't seem to fire correctly; it just returns whatever was recognised anyway:
SpeechRecognition.onerror handles cases where there is an actual error with the recognition successfully — the SpeechRecognitionError.error property contains the actual error returned:
Speech synthesis (aka text-to-speech, or tts) involves receiving synthesising text contained within an app to speech, and playing it out of a device's speaker or audio output connection.
The Web Speech API has a main controller interface for this — SpeechSynthesis — plus a number of closely-related interfaces for representing text to be synthesised (known as utterances), voices to be used for the utterance, etc. Again, most OSes have some kind of speech synthesis system, which will be used by the API for this task as available.
To show simple usage of Web speech synthesis, we've provided a demo called Speak easy synthesis. This includes a set of form controls for entering text to be synthesised, and setting the pitch, rate, and voice to use when the text is uttered. After you have entered your text, you can press Enter/Return to hear it spoken.
To run the demo, you can clone (or directly download) the Github repo it is part of, open the HTML index file in a supporting desktop browser, or navigate to the live demo URL in a supporting mobile browser like Chrome, or Firefox OS.
Support for Web Speech API speech synthesis is still getting there across mainstream browsers, and is currently limited to the following:
The HTML and CSS are again pretty trivial, simply containing a title, some instructions for use, and a form with some simple controls. The <select> element is initially empty, but is populated with <option>s via JavaScript (see later on.)
Let's investigate the JavaScript that powers this app.
First of all, we capture references to all the DOM elements involved in the UI, but more interestingly, we capture a reference to Window.speechSynthesis. This is API's entry point — it returns an instance of SpeechSynthesis, the controller interface for web speech synthesis.
To populate the <select> element with the different voice options the device has available, we've written a populateVoiceList() function. We first invoke SpeechSynthesis.getVoices(), which returns a list of all the available voices, represented by SpeechSynthesisVoice objects. We then loop through this list — for each voice we create an <option> element, set its text content to display the name of the voice (grabbed from SpeechSynthesisVoice.name), the language of the voice (grabbed from SpeechSynthesisVoice.lang), and -- DEFAULT if the voice is the default voice for the synthesis engine (checked by seeing if SpeechSynthesisVoice.default returns true.)
We also create data- attributes for each option, containing the name and language of the associated voice, so we can grab them easily later on, and then append the options as children of the select.
When we come to run the function, we do the following. This is because Firefox doesn't support SpeechSynthesis.onvoiceschanged, and will just return a list of voices when SpeechSynthesis.getVoices() is fired. With Chrome however, you have to wait for the event to fire before populating the list, hence the if statement seen below.
Next, we create an event handler to start speaking the text entered into the text field. We are using an onsubmit handler on the form so that the action happens when Enter/Return is pressed. We first create a new SpeechSynthesisUtterance() instance using its constructor — this is passed the text input's value as a parameter.
Next, we need to figure out which voice to use. We use the HTMLSelectElement selectedOptions property to return the currently selected <option> element. We then use this element's data-name attribute, finding the SpeechSynthesisVoice object whose name matches this attribute's value. We set the matching voice object to be the value of the SpeechSynthesisUtterance.voice property.
Finally, we set the SpeechSynthesisUtterance.pitch and SpeechSynthesisUtterance.rate to the values of the relevant range form elements. Then, with all necessary preparations made, we start the utterance being spoken by invoking SpeechSynthesis.speak(), passing it the SpeechSynthesisUtterance instance as a parameter.
In the final part of the handler, we include an SpeechSynthesisUtterance.onpause handler to demonstrate how SpeechSynthesisEvent can be put to good use. When SpeechSynthesis.pause() is invoked, this returns a message reporting the character number and name that the speech was paused at.
Finally, we call blur() on the text input. This is mainly to hide the keyboard on Firefox OS.
The last part of the code simply updates the pitch/rate values displayed in the UI, each time the slider positions are moved.
Cette page a été modifiée le 17 déc. 2024 par les contributeur·ice·s du MDN.
Votre modèle pour un internet meilleur.
Visitez la société mère à but non lucratif de Mozilla Corporation, la Fondation Mozilla.
Certaines parties de ce contenu sont protégées par le droit d'auteur ©1998—2026 des contributeurs individuels de mozilla.org. Contenu disponible sous une licence Creative Commons.