Pretrained Models in Wespeaker

Besides speaker related tasks, speaker embeddings can be utilized for many related tasks which requires speaker modeling, such as

  • voice conversion

  • text-to-speech

  • speaker adaptive ASR

  • target speaker extraction

For users who would like to verify the SV performance or extract speaker embeddings for the above tasks without troubling about training the speaker embedding learner, we provide two types of pretrained models.

  1. Checkpoint Model, with suffix .pt, the model trained and saved as checkpoint by WeSpeaker python code, you can reproduce our published result with it, or you can use it as checkpoint to continue.

  2. Runtime Model, with suffix .onnx, the runtime model is exported by Onnxruntime on the checkpoint model.

Model License

The pretrained model in WeNet follows the license of it’s corresponding dataset. For example, the pretrained model on VoxCeleb follows Creative Commons Attribution 4.0 International License., since it is used as license of the VoxCeleb dataset, see https://mm.kaist.ac.kr/datasets/voxceleb/.

Onnx Inference Demo

To use the pretrained model in pytorch format, please directly refer to the run.sh in corresponding recipe.

As for extracting speaker embeddings from the onnx model, the following is a toy example.

# Download the pretrained model in onnx format and save it as onnx_path
# wav_path is the path to your wave file (16k)
python wespeaker/bin/infer_onnx.py --onnx_path $onnx_path --wav_path $wav_path

You can easily adapt infer_onnx.py to your application, a speaker diarization example can be found in the voxconverse recipe.

Model List

The model with suffix LM means that it is further fine-tuned using large-margin fine-tuning, which could perform better on long audios, e.g. >3s.

modelscope

Datasets Languages Checkpoint (pt) Runtime Model (onnx)
VoxCeleb EN ResNet34 / ResNet34_LM ResNet34 / ResNet34_LM
VoxCeleb EN ResNet152_LM ResNet152_LM
VoxCeleb EN ResNet221_LM ResNet221_LM
VoxCeleb EN ResNet293_LM ResNet293_LM
VoxCeleb EN CAM++ / CAM++_LM CAM++ / CAM++_LM
VoxCeleb EN ECAPA512 / ECAPA512_LM ECAPA512 / ECAPA512_LM
VoxCeleb EN ECAPA1024 / ECAPA1024_LM ECAPA1024 / ECAPA1024_LM
VoxCeleb EN Gemini_DFResnet114_LM Gemini_DFResnet114_LM
CNCeleb CN ResNet34 / ResNet34_LM ResNet34 / ResNet34_LM
VoxBlink2 Multilingual SimAMResNet34 SimAMResNet34
VoxBlink2 (pretrain) + VoxCeleb2 (finetune) Multilingual SimAMResNet34 SimAMResNet34
VoxBlink2 Multilingual SimAMResNet100 SimAMResNet100
VoxBlink2 (pretrain) + VoxCeleb2 (finetune) Multilingual SimAMResNet100 SimAMResNet100
### huggingface
Datasets Languages Checkpoint (pt) Runtime Model (onnx)
VoxCeleb EN ResNet34 / ResNet34_LM ResNet34 / ResNet34_LM
VoxCeleb EN ResNet152_LM ResNet152_LM
VoxCeleb EN ResNet221_LM ResNet221_LM
VoxCeleb EN ResNet293_LM ResNet293_LM
VoxCeleb EN CAM++ / CAM++_LM CAM++ / CAM++_LM
VoxCeleb EN ECAPA512 / ECAPA512_LM ECAPA512 / ECAPA512_LM
VoxCeleb EN ECAPA1024 / ECAPA1024_LM ECAPA1024 / ECAPA1024_LM
VoxCeleb EN Gemini_DFResnet114_LM Gemini_DFResnet114_LM
CNCeleb CN ResNet34 / ResNet34_LM ResNet34 / ResNet34_LM