Runtime for Wespeaker

Platforms Supported

The Wespeaker runtime supports the following platforms.

Server
- TensorRT GPU
Device
- Onnxruntime
  - linux_x86_cpu
  - linux_x86_gpu
  - macOS
  - windows
- Android (coming)
- ncnn (coming)

Onnxruntime

Step 1. Export your experiment model to ONNX by https://github.com/wenet-e2e/wespeaker/blob/master/wespeaker/bin/export_onnx.py

exp=exp  # Change it to your experiment dir
onnx_dir=onnx
python wespeaker/bin/export_onnx.py \
  --config $exp/config.yaml \
  --checkpoint $exp/avg_model.pt \
  --output_model $onnx_dir/final.onnx

# When it finishes, you can find `final.onnx`.

Step 2. Build. The build requires cmake 3.14 or above, and gcc/g++ 5.4 or above.

mkdir build && cd build
# 1. no gpu
cmake -DONNX=ON ..
# 2. gpu (macOS don't supported)
# cmake -DONNX=ON -DGPU=ON ..
cmake --build .

Step 3. Testing.

NOTE: If using GPU, you need to specify the cuda path.

export PATH=/usr/local/cuda-11.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.1/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

the RTF(real time factor) is shown in the console, and embedding will be written to the txt file.

export GLOG_logtostderr=1
export GLOG_v=2
wav_scp=your_test_wav_scp
onnx_dir=your_model_dir
embed_out=your_embedding_txt
./build/bin/extract_emb_main \
  --wav_scp $wav_scp \
  --result $embed_out \
  --speaker_model_path $onnx_dir/final.onnx \
  --embedding_size 256 \
  --samples_per_chunk  32000  # 2s

NOTE: samples_per_chunk: samples of one chunk. samples_per_chunk = sample_rate * duration

If samples_per_chunk = -1, compute the embedding of whole sentence; else compute embedding with chunk by chunk, and then average embeddings of chunk.

Calculate the similarity of two speech.

export GLOG_logtostderr=1
export GLOG_v=2
onnx_dir=your_model_dir
./build/bin/asv_main \
    --enroll_wav wav1_path \
    --test_wav wav2_path \
    --threshold 0.5 \
    --speaker_model_path $onnx_dir/final.onnx \
    --embedding_size 256

Server (tensorrt gpu)

Introduction

In this project, we use models trained in wespeaker as an example to show how to convert speaker model to tensorrt and deploy them on Triton Inference Server. If you only have CPUs, instead of using GPUs to deploy Tensorrt model, you may deploy the exported onnx model on Triton Inference Server as well.

Step 0. Train a model

Please follow wespeaker examples to train a model. After training, you should get several checkpoints under your exp/xxx/models/ folder. We take voxceleb as an example.

Step 1. Export model

We’ll first export our model to onnx and then convert our onnx model to tensorrt.

# go to your example
cd wespeaker/examples/voxceleb/v2
. ./path.sh
exp_dir=exp/resnet
python3 wespeaker/bin/export_onnx.py --config=${exp_dir}/config.yaml --checkpoint=${exp_dir}/models/avg_model.pt --output_model=${exp_dir}/models/avg_model.onnx

# If you want to minus the mean vector in the onnx model, you may simply add the --mean_vec to the .npy mean vector file.
python3 wespeaker/bin/export_onnx.py --config=${exp_dir}/config.yaml --checkpoint=exp/resnet/models/avg_model.pt --output_model=exp/resnet/models/avg_model.onnx --mean_vec=${exp_dir}/embeddings/vox2_dev/mean_vec.npy

If you only want to deploy the onnx model on CPU or GPU, you may skip the Tensorrt part and go to the section to construct your model repository.

Export to Tensorrt Engine

Now let’s convert our onnx model to tensorrt engine. We will deploy our model on Triton 22.03 therefore we here will use tensorrt 22.03 docker as an example to show how to convert the model. Please move your onnx model to the target platform/GPU you will deploy.

docker run --gpus '"device=0"' -it -v <the output onnx model directory>:/models nvcr.io/nvidia/tensorrt:22.03-py3
cd /models/
# shape=BxTxF  batchsize, sequence_length, feature_size
trtexec --saveEngine=b1_b128_s3000_fp16.trt  --onnx=/models/avg_model.onnx --minShapes=feats:1x200x80 --optShapes=feats:64x200x80 --maxShapes=feats:128x3000x80 --fp16

Here we get an engine which has maximum sequence length of 3000 and minimum length of 200. Since the frame stride is 10ms, 200 and 3000 corresponds to 2.02 seconds and 30.02 seconds respectively(kaldi feature extractor). Notice these numbers will differ and depend on your feature extractor parameters. Notice we’ve added --fp16 and in pratice, we found this option will not affect the final accuracy and improve the perf at the same time.

You may set these numbers by your production requirements. If you only know the seconds of audio you will use and have no idea of how many frames it will generate, you may try the below script:

import torchaudio.compliance.kaldi as kaldi
import torch
audio_dur_in_seconds = 2
feat_dim = 80  # please check config.yaml if you dont know
sample_rate = 16000

waveform = torch.ones(sample_rate * audio_dur_in_seconds).unsqueeze(0)
feat_tensor = kaldi.fbank(waveform,
                            num_mel_bins=feat_dim,
                            frame_shift=10,
                            frame_length=25,
                            energy_floor=0.0,
                            window_type='hamming',
                            htk_compat=True,
                            use_energy=False,
                            dither=1)
print(feat_tensor.shape) # (198, 80)

Then you may find 198 is the actual number of frames for audio of 2 seconds long.

That’s it！We build an engine that can accept 2.02 to 30.02 seconds long audio. If your application can accept fixed audio segments, we suggest you to set the minShapes, optShapes and maxShapes to the same shape.

Construct Model Repo

Now edit the config file under model_repo/speaker_model/config.pbtxt and replace default_model_filename:xxx with the name of your engine (e.g., b1_b128_s3000_fp16.trt) or onnx model (e.g., avg_model.onnx) and put the engine or model under model_repo/speaker_model/1/.

And if you use other model settings or different model from ours (resnet34), for example, ecapa model, the embedding dim of which is 192, therefore, you should edit the model_repo/speaker_model/config.pbtxt and model_repo/speaker/config.pbtxt and set embedding dim to 192.

If your model is onnx model, you should also edit backend: "tensorrt" to backend: "onnxruntime" in model_repo/speaker_model/config.pbtxt.

If you want to deploy model on CPUs, you should edit config.pbtxt under speaker and speaker_model and replace kind: KIND_GPU to kind: KIND_CPU.

Notice Tensorrt engine can only run on GPUs.

Step 2. Build server and start server

Notice we use triton 22.03 in dockerfile. Be sure to use the triton that has the same version as your tensorrt.

Build server:

# server
docker build . -f Dockerfile/dockerfile.server -t wespeaker:latest --network host

docker run --gpus '"device=0"' -v $PWD/model_repo:/ws/model_repo --shm-size=1g --ulimit memlock=-1 -p 8000:8000 -p 8001:8001 -p 8002:8002 --ulimit stack=67108864 -ti  wespeaker:latest
tritonserver --model-repository=/ws/model_repo

The port 8000 is for http request and 8001 for grpc request.

Step 3. Build client and start client

Build client:

# client
docker build . -f Dockerfile/dockerfile.client -t wespeaker_client:latest --network host

docker run -it -v $PWD:/ws -v <data path>:<data path> --network=host wespeaker_client

# example command
cd /ws/client/
python3 client.py --url=<ip of the server>:8001 --wavscp=/raid/dgxsa/slyne/wespeaker/examples/voxceleb/v2/data/vox1/wav.scp --output_directory=<to put the generated embeddings>

# The output direcotry will be something like:
# xvector_000.ark xvextor_000.scp xvector_001.scp .....

Step 4. Test score

After you extract the embeddings, you can now use the same way as wespeaker to test these embeddings. For example, you may test the extracted embeddings in wespeaker by:

cat embeddings/xvector_*.scp > embeddings/xvector.scp

config=conf/resnet.yaml
exp_dir=exp/resnet

mkdir -p embeddings/scores
trials_dir=data/vox1/trials
python -u wespeaker/bin/score.py \
    --exp_dir ${exp_dir} \
    --eval_scp_path /raid/dgxsa/slyne/wespeaker/runtime/server/x86_gpu/embeddings/xvector.scp \  # embeddings generated from our server
    --cal_mean True \
    --cal_mean_dir ${exp_dir}/embeddings/vox2_dev \
    --p_target 0.01 \
    --c_miss 1 \
    --c_fa 1 \
    ${trials_dir}/vox1_O_cleaned.kaldi ${trials_dir}/vox1_E_cleaned.kaldi ${trials_dir}/vox1_H_cleaned.kaldi \
    2>&1 | tee /raid/dgxsa/slyne/wespeaker/runtime/server/x86_gpu/embeddings/scores/vox1_cos_result

Perf

We build our engines for 2.02 seconds long audio only by:

trtexec --saveEngine=resnet_b1_b128_s200_fp16.trt  --onnx=resnet/resnet_avg_model.onnx --minShapes=feats:1x200x80 --optShapes=feats:64x200x80 --maxShapes=feats:128x200x80 --fp16

trtexec --saveEngine=ecapa_b1_b128_s200_fp16.trt  --onnx=ecapa/ecapa_avg_model.onnx --minShapes=feats:1x200x80 --optShapes=feats:64x200x80 --maxShapes=feats:128x200x80 --fp16

GPU: T4
resnet: resnet34.

Engine	Throughput (bz=64)	utter/s
resnet_b1_b128_s200_fp16.trt	39.7842	2546
ecapa_b1_b128_s200_fp16.trt	52.958	3389

Pipeline Perf

In client docker, we may test the whole pipeline performance.

cd client/
# generate test input
python3 generate_input.py --audio_file=test.wav --seconds=2.02

perf_analyzer -m speaker -b 1 --concurrency-range 200:1000:200 --input-data=input.json -u localhost:8000

Engine	Conccurency	Throughput	Avg Latency(ms)	P99 Latency(ms)
resnet_b1_b128_s200_fp16.trt	200	2033	98	111
	400	2010	198	208
ecapa_b1_b128_s200_fp16.trt	200	2647	75	111
	400	2726	147	172