Shanghai AI Lab and others open source, audio, music unified development toolkit Amphion

Image Source: Generated by Wujie AI

Shanghai AI Lab, the Data Science Institute of the Chinese University of Hong Kong, and the Shenzhen Big Data Research Institute have jointly open-sourced a toolset named Amphion for audio, music, and speech generation.

Amphion can help developers research text-to-speech, music, and other audio-related fields within a single framework to solve problems such as black-box generation models, scattered code libraries, and lack of evaluation metrics.

Amphion includes data processing, general modules, optimization algorithms, and other basic infrastructure. It also provides specific frameworks, models, and development instructions for tasks such as text-to-speech, voice conversion, and text-to-audio generation, and incorporates various neural speech codecs and evaluation metrics.

Especially for beginners in generative AI development, Amphion is very easy to use.

Open-source address: https://github.com/open-mmlab/Amphion

Paper address: https://arxiv.org/abs/2312.09911

Below are various models included in Amphion.

Text-to-Speech Synthesis

Amphion's built-in text-to-speech synthesis models cover a range from traditional to state-of-the-art technologies. For example, FastSpeech 2 uses a feed-forward Transformer architecture for rapid speech synthesis; VITS integrates conditional variational autoencoders for end-to-end speech synthesis; Vall-E achieves zero-resource speech synthesis with a neural codec language model; NaturalSpeech 2 utilizes latent diffusion models to synthesize high-quality speech.

Developers can choose different models for speech synthesis according to their business needs.

Voice Conversion

Amphion provides various content-based features for extracting speaker-independent representations, such as pre-trained speech features from WeNet, Whisper, and ContentVec.

It also implements multiple acoustic decoder architectures, such as methods based on diffusion models, transformers, and variational autoencoders.

In addition, with the built-in neural speech codecs to synthesize waveform output, developers can flexibly configure different modules for different voice style conversions.

Text-to-Audio Generation

Amphion uses mainstream latent diffusion generation models. This model includes a variational autoencoder that maps spectra to latent space, a T5 encoder that accepts text and outputs conditions, and a diffusion network that generates the final audio.

Users only need to provide audio description text to generate semantically consistent background sound effects.

Neural Speech Codecs

Amphion provides a rich selection of codec algorithms, covering mainstream autoregressive models, flow models, generative adversarial models, diffusion models, etc.

For example, WaveNet uses dilated convolutions to achieve high-quality speech synthesis; HiFi-GAN applies multi-scale discriminators to achieve high-fidelity speech reconstruction, meeting the needs of different business scenarios.

Performance Evaluation Module

To help developers comprehensively evaluate the quality and performance of generated speech, Amphion provides a rich set of evaluation modules.

Evaluating pitch modeling, energy modeling, spectral distortion, intelligibility, and other speech dimensions can help developers compare the performance of different models in a simple and intuitive way.

The development team stated that in the future, they will continue to update this toolkit, adding more speech-related models to make it one of the most user-friendly open-source speech toolkits.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Shanghai AI Lab and others open source, audio, music unified development toolkit Amphion

Text-to-Speech Synthesis

Text-to-Audio Generation

Neural Speech Codecs

Performance Evaluation Module

Selected Articles by 巴比特

Table of Contents

Related Articles