Speech generation that keeps up with an Auctioneer?

Fast voice synthesis, deep fakes and LLM compute charts

Welcome back to the Enterprise AI Playbook, Issue 2. Here are the successes, cautionary tales and deep dives from this week.

Successful launches - Fast Text To Speech

A new model, Sonic, released by Cartesia.AI makes Voice generation faster than previous synthesis frameworks like ElevenLabs. This release and the continued rapid development in the Voice synthesis space by Open AI, Google and ElevenLabs is sparking a renaissance in Voice applications. IVR, outbound calling and real-time translation will likely accelerate due to these technology developments.

Sonic model in 5 key points

Unlike standard LLM architecture that uses transformer models, the Sonic model is based on a newer architecture (2021) called state space models (the founders wrote some of the original papers). The use of Voice with these models is quite novel, so we’ll see how they power the growth of next-gen Voice apps.

Cautionary Tales - Misleading media and AI’s impact - Google

While the AI space continues to move quickly with interesting advancements, Google releases a paper, AMMEBA, which builds on the importance of understanding misinformation within the modern internet. Deep fakes and even simpler “shallow” fakes continue to grow within the space as visual, audio and video content becomes harder to distinguish. For this purpose, independent fact checkers and various cryptography techniques will become increasingly important.

One of the interesting examples presented in the paper examines how many fact checks are done for images, with the image of Pope Francis going viral while also attracting fact checking bodies due to its uncanny blend of suspicious and realistic.

Pope deepfake generated levels of independent fact checking previously unseen

Read more within the paper.

Deep dive - How much compute is needed for top models?

Models have been growing immensely in size and the Epoch AI team created some great visualizations on the topic highlighting the 4-5x growth yearly for top performing models. This growth has been powered by exponential gains in training efficiencies along with exponential budgets: thankfully what was cutting edge even a year ago can now be run on consumer hardware. We are likely to see continued growth and refinement of models in the coming years as investment dollars continue to grow.

Exponential Growth in LLM Training costs

However, for teams looking to train frontier models there is still a demand for 10s or even 100,000s of GPUs, leading to billions in capex requirements. These models also have shelf lives of less than year, so teams are better off investing in open source models, fine tuning and prompting for their own applications. More notes in the article.

Questions to ask your team

What user research are we running to measure the ROI for lower latency and more powerful models?

Until next week,
Denys - Enterprise AI @ Voiceflow