Amazon is the latest tech giant to unveil a voice AI model. According to Amazon, its Nova Sonic is “a new foundation model that unifies speech understanding and speech generation into a single model, to enable more human-like voice conversations in AI applications.” Nova Sonic will compete with similar AI models by OpenAI, Google, and other tech companies.
Nova Sonic understands more than words
The Nova Sonic doesn’t just understand the speaker’s words, but it can also process the tone, style, and pace. The AI voice generator adapts to the conversation context, so dialogue flows more naturally, compared to the more stilted models from the first generations of Alexa. The Nova Sonic can do this because it combines multiple speech processing and generating functions into a single AI model instead of using multiple different models.
Traditionally, AI voice tools involved running multiple models in sequence: a speech recognition model would convert speech to text, then a large language model (LLM) would process the input text and generate responses, and finally a text-to-speech model would convert text back to audio. This complex pipeline often stripped away the tone, style, and pacing of the speaker’s original dialogue.
Since the Nova Sonic combines all of this in one model, it can adapt to the acoustic context of the input speech. It also responds more naturally to the cadences of human speech; for instance, it won’t interrupt when the speaker hesitates or pauses to take a breath.
How to get Nova Sonic
Nova Sonic is currently available via a new API in Amazon Bedrock, the company’s enterprise application building platform, and will simplify the development of voice applications.
What developers need to know about Amazon Nova
The tech giant recently introduced Amazon Nova Act, a new AI model trained to perform actions within a web browser. In addition, there is an Amazon Nova SDK for developers to explore. One of the foundation models is Nova Canvas for generating high-quality images; there are also models for generating text from different modalities as well as videos from text and image input.