A few months ago, we started building out Diarupt AI to make it possible to add fluid video-based AI conversations to your product. ( See our "Introducing Diarupt" post if you missed that.)
I'll walk you through some of the hurdles we faced while trying to achieve the results you see in this video.
We had to ask ourselves what makes up a human-to-human conversation and convert those to engineering frameworks.
Let's start with Realism.
We knew the foundation of everything we were building was based on how close to reality it felt and sounded.
We began exploring various methods for generating human-like avatars. We tested several models, but none provided both speed and quality. We experimented with Audio-driven Live Portrait Synthesis using only a picture and syncing specific facial landmarks, but the outcome was poor. We also attempted various NeRF models, but they were too sluggish. We conducted lots of experiments in search of a solution.
We realised we needed the speed of Audio-driven Live Portrait Synthesis and the Quality of NeRF models, so we started to build a different framework that incorporates both. After about two months, we got our first near real-time, realistic model, and that was just video synthesis (We will talk about our wars with quality and fast audio synthesis in the future).
One would think the struggle was over, but we were onto the next one.
There are a lot of moving parts to a fluid conversation. A simplified version looks something like this.
user speaks -> proccess speech -> LLM responds -> proccess response -> synthesis audio & video response-> send response to back to user -> repeat
Each step in the pipeline was made of other embedded processes. How do we make this all happen without noticeable lag? That took us back to the drawing board to understand how people naturally communicate without lag.
Let's do a quick breakdown of steps #2(processing user input) & #3 (LLM response), for instance, which needed to be processed in less than 0.3 seconds for the whole pipeline to be real-time
After multiple iterations, we discovered that a core part of fluid communication was that the human brain doesn't wait until the other person is done talking before it starts to process a response (light bulb moment).
It sounds easy, but in implementation, it wasn't. Let's run a quick example: assume someone wants to say the following sentence.
"Last night, I was in India, having a wonderful conversation with my soulmate, until I realised it was a dream."
If we start to process the speech before "until I realised it was a dream", the response will be entirely different & maybe wrong because the last phrase changed the context of the sentence.
Due to this and other complex behaviours we analysed, we needed to build a new framework to analyse user speech faster and start composing the AI's response even before the user was done speaking, which allowed for more fluid responses.
Another honourable mention was interruptions; when we speak to each other, it's not always a turn-by-turn conversation as with text-based communication; we interrupt ourselves, which, surprisingly, is a huge part of everyday conversations.
Interrupts introduce a new set of hurdles and are diverse. They can sometimes be as simple as "oh, okay" or more complex thoughts. There are a lot of ways to respond while doing it efficiently without losing context and fluidity.
After we had successfully had all the pieces working individually, we ran into several pipeline issues while bringing them together. We were moving heavy data around the pipeline due to how large video and audio data can be, which, if not done efficiently, would result in wasted resources.
We've learned a lot during this journey and will continue to learn.
We are just getting started.
Diarupt has been exciting for us, and we are even more thrilled by its possibilities.
We're currently in Private Alpha and will roll out access gradually over the next few weeks. If you want to try it out as soon as we have a spot, join the waitlist here at https://diarupt.ai/early-access, and we'll keep in touch.