MultiModal AI Assistant, My GPT-4o Alternative

OpenAI isn’t available in Hong Kong and I really, really wanted to use the GPT-4o voice model that they were showing off everywhere online to practice speaking Thai, Mandarin, Cantonese, and a bit of Spanish. I got so fed up that I ended up just pulling a bunch of APIs to make my own “GPT-4o” model that I could communicate with using text or voice input that would respond with lifelike audio outputs, simulating natural conversations.

I used Llama 3.1 for communication, Azure for the audio outputs, and Gemini Flash 1.5 for screen and camera computer vision as a bonus. The system could communicate in 100+ languages with 400+ realistic voice options. While this wasn’t really anything innovative, it was fun to be able to solve an annoying problem of mine.

Enjoy Reading This Article?