Chip Huyen’s first newsletter: selling my startup + model routing

Hi, it’s Chip. You probably subscribed to my blog sometime in the past year, and this is the first email!

Selling my first startup

Claypot AI, the startup that I co-founded, had been acquired by Voltron Data. It’s bittersweet. We didn’t start a company to sell this early, but I believe it was right for the team. The team is happy. We like Voltron Data’s vision and culture. Our product, real-time AI platform, is complimentary to Voltron Data’s GPU-native distributed systems. They have a great GPU engineering team, and I think now is a great time to learn about GPU optimization.

Now that I’m no longer running a company, I want to spend time learning about challenging technical topics and making them more accessible to readers. I also want to do experiments to test out new ideas. Hence this newsletter!

What to expect in my newsletters

For each newsletter, I’m thinking of the following format:

Ideas: Three questions that I’m exploring and might write about in the future.
Deep dive: One topic that I’ve thought a lot about. I usually write a blog post about the topic, but the newsletter expands on the topic, incorporating readers’ feedback.

I use this newsletter as a channel for discussion. I’d love to hear from you. Please feel free to respond to this email with your thoughts on any discussed topic, what you want me to dive deeper into, or other interesting topics.

If this isn’t what you have in mind, you can unsubscribe.

Ideas

Batch inference optimization: Batch inference usually has a higher latency budget than online inference. This allows for techniques such as model splitting (splitting a model across GPU, CPU, and disk). See Efficiently Scaling Transformer Inference (Google) and FlexGen.
Long-term memory management: How to get AI to remember all previous conversations with the same user. See MemGPT. This requires two steps:
1. Step 1: Update the memory bank (e.g. if you tell the model that you’ve changed your job, your job information should be updated).
2. Step 2: Fetch/retrieve information from the memory bank (which is related to RAG).
Data processing on GPUs. GPUs are being used mostly for training and inference. Can/should they also be used for data processing?

Deep dive: Predictive Human Preference and Model Routing

This email contains images to make the following technical concepts easier to understand. If you can't see them, click "Display images below" at the top of this email.

Leaderboards like Chatbot Arena help us predict the best model overall. I wonder if we could take a step further: predict which model users prefer for each prompt.

One use case of predictive human preference is model routing. For example, if we know in advance that, given a prompt, users will prefer Claude Instant’s response over GPT-4, and Claude Instant is cheaper/faster than GPT-4, we can route this prompt to Claude Instant. Model routing can potentially increase response quality while reducing costs and latency. Model routing turns quality, cost, and latency tradeoffs into an optimization problem that can be systematically solved.

Model routing is related to expert routing, which is at the heart of Mixture of Experts (Mixtral) and Composition of Experts (Samba-1). The router can be trained jointly with the experts or independently.

Experiment

I built a toy preference predictor using LMSYS’s crowd-sourced data, which seems to work. Given a tuple of (prompt, model_a, model_b), it can predict the model that users would prefer 76.2% of the time. This is better than just picking the higher-ranking model in each pair.

Despite being a toy predictor, the model seems to be able to capture different models’ performance patterns. One pattern is that for simple prompts, weak models can do (nearly) as well as strong models. For challenging prompts, however, users are more likely to prefer stronger models. Here’s a visualization of predicted human preference for an easy prompt (“hello, how are you?”) and a challenging prompt (“Explain why Planc length …”).

Given the predicted preference for all model pairs for a prompt, I used a Bradley-Terry model (the same ranking algorithm that LMSYS uses) to create a leaderboard for this prompt.

Model routing as a business

Model routing, if it works, sounds like a no-brainer. There are already a dozen model routing startups, most notable are Martian (which raised $9M), Pulze, and NotADiamond. I’m also excited about the model routing work by LMSYS, which I think is a natural progression from their work on Chatbot Arena.

However, there are many challenges with building a model routing business. Some of them are:

Can your model router be generalized to a wide range of use cases and domains?
How defensible is your model router? Model routing can be done by inference services (e.g. Replicate, Anyscale, or Fireworks) or even model providers (like Samba-1).
Model routing adds extra latency, which makes it hard for applications with strict latency budgets.
Several companies I’ve talked to are skeptical of whether model routing works. They see this as added complexity on top of already complex systems. Model routing also hinges on the ability to predict a model’s response quality. If evaluating a model’s response quality is hard, predicting it is much harder.

Read the full blog post

Thank you for having made it this far. I’d love to hear your thoughts on any of the topics above and pointers to where I can learn more about them!

You can read previous newsletters here.