An advanced AI model that generates speech, song, or general audio synchronized with video or text input; supports multimodal inputs and efficient generation with strong lip-sync and alignment.
AudioGen-Omni is a unified multimodal diffusion transformer model (called MMDit), developed by researchers at China University of Mining and Technology & Kuaishou Technology. Its goal is to generate high-fidelity audio, speech, and songs that are synchronized with input video, text, or both.