Serving Models not Burgers

A new Claude model, Voice drive-thrus and LLMs at scale

Welcome back to the Enterprise AI Playbook, Issue 5. Here are the successes, cautionary tales and deep dives from this week.

Successful launches - Claude Sonnet 3.5 and Artifacts Interface

Anthropic launched an update to Claude 3, Sonnet edition, with a promise to update Haiku and Opus by year’s end. Sonnet was in a strange position — not strong enough for larger tasks, not cheap enough for easier tasks — warranting a well deserved upgrade. The new Sonnet scores better than GPT-4o on a number of benchmarks and comparisons:

Benchmarks shared by Anthropic on LLMs

On other benchmarks, GPT-4o is still better, but this represents a significant victory for Anthropic to pull into a 1A/1B position. Beyond benchmarks, another method is to have users vote on model responses head to head, with the most common comparison platform called the lmsys arena. Looking at the current results, Sonnet 3.5 answers are slightly lower rated than GPT-4o by user preference overall, and slightly ahead for coding preference.

User preferences in head to head prompt comparisons

User preferences for coding prompt comparisons

Beyond benchmarks and Elo comparisons, most consumers of models are doing hard-to-measure tasks like summarization, role play and rephrasing, while researchers construct task-specific benchmarks. With the Claude press release came the graphic below, aiming for simple communication rather than precision, likely for a general consumer audience.

Anthropic’s visualization of Claude Sonnet’s capabilities

The other, but very important, release was Artifacts, a new interface for building with Claude. According to the press release, “This creates a dynamic workspace where they can see, edit, and build upon Claude’s creations in real-time, seamlessly integrating AI-generated content into their projects and workflows.”

This new interface, combined with strong code generation capabilities, allows Claude users to iterate more quickly and test web projects immediately in the Claude interface. This announcement as well as the desktop + mobile focus for Open AI emphasizes the importance of interfaces to LLM usage. A LLM needs to part of a full workflow, not just a tool.

Open AI also acquired desktop collaboration app Multi on Monday, indicated in the press release as a deeper dive into the workflow collaboration space. “Recently, we’ve been increasingly asking ourselves how we should work with computers. Not on or using computers, but truly with computers. With AI. We believe it’s one of the most important product questions of our time.”

You can watch a quick demo of Claude Artifacts here.

Cautionary Tales - The voice of drive-thru past

McDonald’s has halted the rollout of its automated voice-based drive-thrus. After working with IBM on the project, McDonald’s is now ending the experiment with minimal commentary. While the Voice and AI space in general has received significant investment and attention, it still struggles to provide the reliability needed to replace common tasks. The nature of drive-throughs is a noisy environment with hungry, often hangry patriots, and a corporation looking to hit key metrics around time to order. This does indicate an interesting frontier: a competition between process optimization and automation in an effort to decrease costs.

This is not a complete failure, however. The nature of research and development is the continued iteration of new technologies, and in the cancelation of this rollout, McDonald’s leadership indicates it will continue to invest in the technology for future years.

Deep dive - Optimizing LLM inference

The ability to run models at scale has always been a challenge and the past two years have shown immense progress in the space. Last week character.ai released a short blog post highlighting some of their progress with 5 techniques. The impact is pretty profound.

“We have reduced serving costs by a factor of 33 compared to when we began in late 2022”

This kind of cost reduction is business defining from a financial perspective. A business breaking even would reach a gross margin of 97% with such a change. These savings and techniques are also focused on challenges of the business, rather than building in a technology vacuum.

“On Character.AI, the majority of chats are long dialogues; the average message has a dialogue history of 180 messages. As dialogues grow longer, continuously refilling KV caches on each turn would be prohibitively expensive.”

While KV caching is commonly used for LLM serving, every technology solution has tradeoffs. A well connected product and engineering team can focus on a use case-specific technique to solve challenges. This coordination allows teams to excel beyond product and engineering. Many of the key techniques mentioned require an engineering-wide collaboration:

  1. Training and running inference in 8bit precision. A challenging problem that would not work if the modeling team just threw the model over to the deployment team and hoped it works in production.

  2. Changing the model architecture from “Open Source Standard” using Multi-Query Attention over a more common Group-Query Attention to allow better KV caching.

  3. Close coordination between hardware, architecture and inference teams. “With these [KV Cache] techniques, GPU memory is no longer a bottleneck for serving large batch sizes”. Removing hardware bottlenecks with better software and architecture is every engineering team’s dream.

The last takeaway I’ll mention is that running an AI product or company is not just about the model; good engineering and product work will be key factors in its success since its an end to end system.

Full blog post here.

Question to ask your team

How well are your APIs and data exposed to integrate with LLMs and modern copilots?

Until next week,
Denys - Enterprise AI @ Voiceflow