A research framework for text-to-audio generation that uses synthetic captions + confidence scoring to filter noisy data and improve the quality and faithfulness of generated audio.
CosyAudio is a framework designed to improve text-to-audio (TTA) generation by leveraging synthetic captions and confidence scores. The motivation is that many large audio datasets are weakly labeled or have noisy/inaccurate captions, which degrades the performance of audio generation models.