The Internet of Speech is here, transforming the way we interact with our devices.
Siri signals your next turn in an unfamiliar city. Google Assistant searches the web for instructions on how to grill salmon and reads them to you as you work. The voice robot on the other end of the customer service line delivers results without waiting for menus or pressing buttons. Call it the Conversational Computing Era andThe computer side of these conversations is supported by a digital technology called text-to-speech, or TTS for short.
But TTS is not just for new and sophisticated voice computing applications. It has been used as an accessibility tool for years; as educational technology (edtech); and as an audio alternative to reading. In 2021 almost oneamerican adults bedroomAudiobooks heard and TTS may have helped make these experiences possible. All these examples just scratch the surface of what TTS can do.
In this article, we describe the standard meaning of text-to-speech and list some of the demographics that benefit from TTS. Next, we look at some ways organizations can use language technology to achieve mission-critical goals. Finally, we'll take you through the history of this ever-evolving field. Here's your definitive introduction to TTS technology, starting with a basic question:
What is TTS? In other words, what does TTS mean?
Curious to know what today's leading TTS actually sounds like?Discover ReadSpeaker's TTS voices, complete with audio samples.
Text-to-speech: meaning and science behind the term
Text-to-speech technology is software that takes text as input and produces audible speech as output. In other words, it goes from text to speech, making TTS one of the best equipped technologies of the digital revolution. A TTS system includes software that predicts the best possible pronunciation of a given text. It also contains the program that generates the speech sound waves; this is called avoice modifier
Text-to-speech is a multidisciplinary field that requires in-depth knowledge of a variety of sciences. If you want to build a TTS system from scratch, you need to study the following topics:
- Linguistics, the scientific study of language.In order to synthesize coherent speech, TTS systems need a way to recognize how a human speaker pronounces written speech. This requires knowledge of linguistics down to the level of phonemes, the sound units that together make up language, like the /c/ sound in cat. To achieve truly true TTS, the system must also predict correct prosody, which includes speech elements beyond the phoneme, such as accents, pauses, and intonation.
- Processing audio signals, creating and editing digital sound representations.Audio (speech) signals are electronic representations of sound waves. The speech signal is digitally represented as a sequence of numbers. In the context of TTS, linguists use different feature representations that describe discrete aspects of the speech signal, making it possible to train AI models to generate new speech.
- Artificial intelligence, specifically deep learning, a type of machine learning that uses a computing architecture called a deep neural network (DNN).A neural network is a computational model inspired by the human brain. It consists of complex networks of processors, each of which performs a processing task before sending its output to another processor. A trained DNN will learn the best processing route to get accurate results. This model offers great computational power, making it ideal for handling the large number of variables required for high-quality speech synthesis.
ReadSpeaker linguists conduct research and practice in all these areas and continually develop TTS technology. These researchers are producing lifelike TTS voices for brands and developers that enable businesses to be expressed on the web, whether on a smartphone, through smart speakers or a speech-enabled mobile app. In fact, TTS voices are emerging on an increasing number of devices and for an increasing number of applications (and users).
Who uses TTS?
People with visual and reading disabilities were early adopters of the TTS. Makes sense: TTS makes the Internet experience easier for 1 in 5 people with dyslexia. It also helps dyslexics and people with learning disabilities by taking the stress out of reading and presenting information in an optimal format. We are moving towards a more accessible Internet of the future and TTS is an integral part of that movement.
Many forward-thinking content owners and publishers are already offering TTS solutions to make the web a place for everyone. Businesses and buildings are required to provide access for wheelchair users and people with reduced mobility. Shouldn't the Internet be accessible to everyone? However, as technology advances, the uses and users of TTS evolve as well. You may not need TTS, but you certainly will. Text-to-speech can make your life easier and more efficient, no matter how you define it.
These are just some of the demographics already benefiting from TTS technology:
1. Students
current studiessuggest that students benefit more from blended presentations. Some students retain more information presented in audio and visual formats, also known asBimodal Learning.A popular educational framework calledUniversal Design for Learning (UDL)recommends dual modal learning to help all students succeed. Teachers at all grade levels who promote UDL use a combination of auditory, visual, and kinesthetic techniques with the help of technology and adaptive lesson plans.
Even if you identify as a kinesthetic or visual learner, science says that adding an auditory method can help you retain information. Last but not least, TTS makes verification much easier.
2. Readers on the move
If you want to know what's new, podcasts and audiobooks will only take you so far. So if there is a detailed profilethe new yorkeror a long articleThe guardwant to read, TTS can recite it for you. So you can drive, exercise or clean at the same time. Or maybe you prefer listening to reading. Afterleading technology expertsOnline content will soon be automatically converted to audio so more people can enjoy content on the go.
Dharmesh ShahMaster's Conference inENTRY 2016
3. Multitasking
The shortcuts TTS can provide are endless, from reading recipes while cooking to dictating instruction manuals while assembling furniture. The only limit to how much you can help is your own imagination.
4. Mature Readers
Older adults understandably want to avoid straining their eyes to read tiny text on a smartphone. Text-to-speech can solve this problem by making online content easy to consume, regardless of your technical understanding or acumen.
5. Younger generations
Give young people technology and they are likely to use it, whether or not it is strictly 'necessary' for them. 2022,70%of 18- to 25-year-old consumers "mostly" turned on closed captions while watching video content, not because they were hard of hearing, but because it was convenient. And many Tik Tok users took advantage of the TTS feature of the competing Instagram app.released its TTS in 2021.In the meantime onestudent researchfound that only 5% of respondents had a disability that required the use of assistive technology, yet at least 18% of students considered any technology 'necessary'. The thing is, Gen Z uses TTS not just as an accessibility tool, but as a preference.
6. Readers with visual impairment or sensitivity to light
Older adults aren't the only ones who want to avoid squinting at screens. Many people have mild visual impairments or are sensitive to light. For example, think of people with chronic migraines. Thanks to TTS, these users can be more productive on days when looking at screens seems too painful.
In fact,Advice on medical studiesthat exposure to light at night, particularly blue light from computer screens, has adverse health effects. Not only does it mess with our internal clocks, it can also increase your risk of cancer, diabetes, heart disease and obesity. Text-to-speech offers users a safer way to consume written content without looking at the screen.
7. Foreign language students
Studies show that listening to another language helps students learn the new dialect. Text-to-speech can help with this.ReadSpeakeris an international TTS software company with 50+ languages and 150+ voices, all based on native speakers.
With ReadSpeaker, foreign language learners can get a picture of pronunciation, cadence and accents. A particularly useful feature in this regard is the ability to highlight words when read aloud, which can help students feel more confident when pronouncing new vocabulary.
8. Multilingual readers
New generations growing up in multilingual homes may understand their grandparents' language, but may not feel fluent enough to read, write or speak it. This is common in many communities where the mother tongue is not taught in schools. For second and third generations who want to maintain or strengthen their ties to their home countries, ReadSpeaker can make articles, journals and other literature accessible and understandable through language.
9. People with severe speech impediments
A speech generation device (SGD), also known as a voice output communication aid (VOCA), is useful for those who have severe speech impediments and would otherwise be unable to communicate verbally. Summarized under the term "Augmentative and Alternative Communication (AAC)", SGDs and VOCAs can now be integrated into mobile devices such as smartphones.
Stephen Hawking, who suffered from ALS, and his well-known film critic Roger Ebert were among the best known SGD practitioners using TTS technology. So who uses TTS? Many people, for many different reasons. And if you're looking for a way to tackle today's business challenges, TTS might be the technology for you.
For more information about ReadSpeaker TTS services, consult yourProductsÖCOMMON QUESTIONS.
TTS technology for enterprises
When ReadSpeaker AI started speech synthesis in 1999, TTS was primarily used as an accessibility tool. Text-to-speech makes written content available on platforms for people with visual impairments, low literacy, cognitive impairments, and other accessibility barriers. And while accessibility remains a core value ofReadSpeaker solutions,The rise of voice computing has led to a growing range of applications for TTS across all devices, especially the enterprise.
Here are just a few of the powerful enterprise use cases for TTS in today's voice-first world:
- Conversational Interactive Voice Response (IVR) System,as in customer service call centers
- voice trading applicationshow to shop on an amazon alexa device
- voice guidance and navigation tools,like gps map apps
- smart home devicesand other voice-activated Internet of Things (IoT) tools
- Autonomous Virtual Assistantslike Apple's Siri but for its own brand
- Experimental marketing and advertising solutions,such as interactive voice announcements on music streaming services or branded smart speaker apps
- video game development,with dynamic TTS runtime for accessibility features, scene prototyping and AI non-player characters
- Corporate training and marketing videosthat allow creators to change voiceovers without having to look for original voice talent for ongoing recording sessions
Chances are you've already experienced TTS in some or all of these examples. If you run a business, you may have even helped create a voice-first device or experience. With such widespread adoption, it's safe to say that TTS is here to stay. But it's not exactly a new technology.
Types of TTS technology, then and now
Mechanical attempts at synthetic speech date back to the 18th century. Electric synthetic speech has been around ever sinceHomer Dudley's voder from the 1930s.But the first system that went directly from text to speech in Englisharrived in 1968,and was designed by Noriko Umeda and a team at the Japan Electrotechnical Laboratory.
Since then, researchers have developed a cascade of new TTS technologies, each working in different ways. You might be wondering, "How does text-to-speech work?" The answer depends on the TTS technology used. Here's a brief overview of the dominant forms of TTS, past and present, from early experiments to the latest AI features.
Formant synthesis and articulatory synthesis
Early TTS systems used rule-based technologies such as formant synthesis and articulatory synthesis, which achieved a similar result through slightly different strategies. Pioneering researchers recorded a speaker and extracted acoustic characteristics from that recorded speech: the formants, which define the qualities of speech sounds, in the formant synthesis and in the form of articulation (nasal, plosive, vowel, etc.) in the articulatory synthesis. They then programmed rules that modeled these parameters with a digital audio signal. This TTS was quite robotic; These approaches inevitably abstract much of the variation you find in human speech, things like pitch and accent variation, because they only allow programmers to write rules for a few parameters at a time. But formant synthesis is not just a historical novelty: it is still used in the open source TTS synthesizer.Ehablar OF,that synthesizes speechNVDA,one of the top free screen readers for Windows.
diphone synthesis
The next big development in TTS technology is called diphone synthesis, which researchers developed in the 1970s and was still in widespread use at the turn of the millennium. Diphone synthesis creates machine speech by combining diphones, single-unit phoneme combinations, and the transitions from one phoneme to the next: not just the /c/ in the word cat, but /c/ plus half of the following sound /ae/. The researchers record between 3,000 and 5,000 individual diphones, which the system combines into a coherent expression.
TTS diphone synthesis technology also includes software models that predict the duration and pitch of each diphone for given input. When these two systems are superimposed, the system combines the signals from the diphones and then processes the signal to correct pitch and duration. The end result is synthetic speech that sounds more natural than formant synthesis produces, but it is far from perfect, and listeners can easily distinguish a human speaker from such synthetic speech.
Unit selection summary
In the 1990s, a new form of TTS technology was established: Drive Select synthesis, which is still ideal for fuel-efficient TTS engines. Where diphone synthesis has added the appropriate duration and pitch by a second processing system, unit selection synthesis skips this step: it starts with a large database of recorded speech (about 20 hours or more) and selects the bits of sound that already have the duration and pitch . Text input required for natural sounding speech.
Unit selection synthesis produces human speech without much signal modification, but is still artificially identifiable. During all these decades of development, computer processing power and available data storage increased rapidly. The stage was set for the next era of TTS technology, which, like much of today's computing era, relies on artificial intelligence to provide incredible predictive power.
neural synthesis
Remember the deep neural networks we mentioned earlier? This is the technology that is driving today's advances in TTS technology and is the key to the realistic results that are now possible. Like its predecessors, Neural TTS starts with voice recordings. That's an entry. The other is text, the written script that your origin host used to create these records. Inject these inputs into a deep neural network and you learn the best possible association between a short piece of text and its acoustic characteristics.
Once trained, the model will be able to predict realistic sound for new texts: with a neural TTS model trained, the system, together with a vocoder trained on the same data, can produce speech similar to des Talents is remarkably similar to the language of origin when confronted with virtually any new text. This similarity between source and output is why neural TTS is sometimes referred to as "cloning voices”
There are all sorts of signal processing tricks you can use to change the resulting synthesized voice so that it doesn't sound exactly like the source speaker. The most important fact to remember is that the best AI-generated TTS voices still start with a human speaker, and TTS technology is becoming more and more human. Current research is resulting in TTS voices that speak with emotional expression, unique voices in multiple languages, and increasingly realistic audio quality.Discover available languages and voices with ReadSpeaker TTS.
It's probably more technical than you need to be, but it covers the basics of text-to-speech and much more. And if you still have questions, follow the links below.
For more information on text-to-speech, see Help Create Your Ownbrand voice,ready market accessvotes TTSin more than 30 languages,Contact ReadSpeaker today.