moshi.cpp

A port of Kyutai's Moshi to C++ and ggml.

https://github.com/kyutai-labs/moshi

With additional support for NVIDIA's PersonaPlex.

https://github.com/nvidia/personaplex

There is a separate project for Kyutai's Pocket TTS.

https://github.com/Codes4Fun/pocket-tts.cpp

Status
Quick Start Linux
PersonaPlex Linux
Quick Start Windows
PersonaPlex Windows
Build Dependencies
Building
Models
Running Demos
Benchmarks
Design Notes

Status

The base library supports Kyutai's earlier Moshi models, speech to speech, text to speech, speech to text. And it supports NVIDIA's PersonaPlex since it is based on Moshi.

For Future Release

wrap sentencepiece into it's own dynamic library or externalize it.
investigate timing issue with sdl, integrate diagnosis in mimi-echo.
sync up moshi.cpp and pocket-tts.cpp code bases.
externalize utilities shared between pocket-tts.cpp and moshi.cpp
look into personaplex voice loading performance, possibly make an embedding tool.

In Current Release

There are multiple tools that demonstrate different components:

mimi-encode - demonstrates using mimi to encode different inputs to a mimi file
mimi-decode - demonstrates using mimi to decode and output different files
mimi-play - decodes mimi files and plays them through sdl
mimi-echo - realtime demo that allows you to hear mimi compression
moshi-tts - demonstrates text inputs to audio outputs
moshi-stt - demonstrates audio inputs to text outputs
moshi-sts - demonstrates audio inputs to audio (and text) outputs
personaplex - demonstrates how to use NVidia's advanced moshi model with voice cloning and a system prompt.

There are aria2c download scripts to make it easier to download tested models.

The tools support quantization of the safetensor models and caching of gguf files via commmand line, -g to cache a gguf which is several times faster to load than the safetensors but will consume more drive space. Use -q q8_0 or -q q4_k to quantize, the q4_k can take a while to convert, several minutes for some models, so it's best to use those with -g to save gguf versions, they also perform a bit faster. The largest models, moshika and moshiko, can run on 8gb of vram with q4_k, but they may not perform fast enough, though I was able to have a conversation with an rtx 2070 laptop running linux.

Performance and Optimizations

I did create an optimization that does not exist in moshi, and that is, instead of generating an attention bias mask each frame, it generates a reusable pattern once at initialization, and reuses it like you would a lookup table. Not only does this reduce the work to just changing an offset in the pattern tensor, but it makes easier an implementation that originally involved boolean logic operations and dealing with infinities. And also for the lookup table, it only does the lookup once per transformer instead of for each transformer layer.

Quick Start Linux

Make sure you have relatively recent drivers for linux.

Download a binary release for linux and extract somewhere.

Open a terminal to where the files are extracted.

Install some additional dependencies:

sudo apt install aria2 libsdl2-2.0-0

Download the models ( about 9.7GB ):

aria2c --disable-ipv6 -i moshi-defaults.txt

Run moshika, a hallucinating speech-to-speech model, requires microphone, ask her "What are you doing?":

./moshi-sts

Run speech-to-text, requires microphone:

./moshi-stt

Run text-to-speech:

./moshi-tts "Hello World!"

PersonaPlex Linux

Follow the instruction above but download the personaplex models ( about 5.0GB ):

aria2c --disable-ipv6 -i Codes4Fun_personaplex-7b-v1-q4_k-GGUF.txt

Run it with random voice and behavior.

./personaplex

To use one of NVidia's default voices:

./personaplex -v NATF0

See .\personaplex -h for more default voice options.

To clone a voice from a wav file and set a system prompt.

./personaplex -v adam-west.wav -p "You are batman investigating a crime."

Quick Start Windows

Make sure you have relatively recent drivers and have the latest msvc runtimes.

Download a binary release for windows and extract somewhere.

Open a command line ( window + r keys, open 'cmd' ) or a PowerShell, and navigate to where the files are extracted.

Download the models ( about 9.7GB ):

.\aria2c --disable-ipv6 -i moshi-defaults.txt

Run moshika, a hallucinating speech-to-speech model, requires microphone, ask her "What are you doing?":

.\moshi-sts

Run speech-to-text, requires microphone:

.\moshi-stt

Run text-to-speech:

.\moshi-tts "Hello World!"

PersonaPlex Windows

Follow the instructions above but download the personaplex models ( about 5.0GB ):

.\aria2c --disable-ipv6 -i Codes4Fun_personaplex-7b-v1-q4_k-GGUF.txt

Run it with random voice and behavior.

.\personaplex

Under Windows 11 it's been observed that performance can slow with large contexts, so you can use -c 1000 to see if performance improves at the cost of conversation quality.

To use one of NVidia's default voices:

.\personaplex -v NATF0

See .\personaplex -h for more default voice options.

To clone a voice from a wav file and set a system prompt.

.\personaplex -v adam-west.wav -p "You are batman investigating a crime."

Build Dependencies

The moshi library depends on:

SentencePiece (tested with 0.2.0)
GGML

The tools additionally depend on:

FFmpeg (7+)
SDL2

Sentence Piece

SentencePiece has only been tested using static linking built from source:

https://github.com/google/sentencepiece/releases/tag/v0.2.0

GGML

If you plan to build vulkan you should use my modified version of ggml:

https://github.com/Codes4Fun/ggml

otherwise you can use the official version:

https://github.com/ggml-org/ggml

Example build with cuda and vulkan:

git clone --branch for_moshi --single-branch https://github.com/codes4fun/ggml
cd ggml
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_CUDA=ON -DGGML_VULKAN=ON

You might need to set CMAKE_CUDA_COMPILER to where nvcc is located, and Vulkan_GLSLC_EXECUTABLE to where glslc is located. Using a newer version of CMake (4.1+) can usually resolve that.

FFmpeg

For FFmpeg it requires a newer version than most linux package systems include, it can be built from source, or you can use binaries for linux or windows here:

https://github.com/BtbN/FFmpeg-Builds/releases

I've tested the ffmpeg-master-latest-*-lgpl-shared versions.

Other download options at the official site: https://ffmpeg.org/download.html

SDL2

For SDL2, it can be installed using standard package managers, for Ubuntu:

sudo apt install libsdl2-dev

And windows SDL2 devel libraries (SDL2-devel-2.30.11-VC.zip) can be downloaded here :

https://github.com/libsdl-org/SDL/releases/tag/release-2.30.11

Building

With dependencies in place you can use cmake by first cloning this repository and then creating a build directory:

git clone https://github.com/codes4fun/moshi.cpp
cd moshi.cpp
mkdir build
cd build

and then generate a build using cmake, which for example on windows would look like this (changing generation target and paths as needed):

cmake .. -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=RelWithDebInfo -DGGML_INCLUDE_DIR=C:/repos/ggml/include -DGGML_LIBRARY_DIR=C:/repos/ggml/build/src -DSentencePiece_INCLUDE_DIR=C:/repos/sentencepiece/src -DSentencePiece_LIBRARY_DIR=C:/repos/sentencepiece/build/src -DCMAKE_PREFIX_PATH=C:\lib\SDL2-2.30.11 -DFFmpeg_DIR=C:\lib\ffmpeg-master-latest-win64-lgpl-shared

or Ubuntu, change the paths if necessary:

cmake .. \
 -DGGML_INCLUDE_DIR=~/repos/ggml/include\
 -DGGML_LIBRARY_DIR=~/repos/ggml/build/src\
 -DSentencePiece_INCLUDE_DIR=~/repos/sentencepiece/include\
 -DSentencePiece_LIBRARY_DIR=~/repos/sentencepiece/lib\
 -DFFmpeg_DIR=~/lib/ffmpeg-master-latest-linux64-lgpl-shared

And finally build it.

cmake --build .

That will create a bin directory under build. You will need to copy over ggml libraries, and if needed the ffmpeg libraries. On windows you will need to also copy over sdl2.

Models

To make downloading models easier, I have provided aria2 input files that will automatically download and verify the downloaded files. You can install aria2 either by downloading from https://github.com/aria2/aria2/releases/tag/release-1.37.0 or using a package manager like apt:

sudo apt install aria2

or pacman

sudo pacman -S aria2

For windows you can unzip the aria2c.exe into the moshi directory.

Aftwards you can run the following which will download and verify the minimal files to run moshi-tts and moshi-stt. This requires about 9.7 GB of space:

aria2c --disable-ipv6 -i moshi-defaults.txt

If you want your models to be located in another directory, ideally set it's path in an environment variable named MODEL_CACHE and then add to the command line -d, so for example in linux use -d $MODEL_CACHE or in windows -d %MODEL_CACHE%.

If you wish to download all available voices, 731 MB, run aria command again but change the last part from -i moshi-defaults.txt to -i kyutai_tts-voices.txt.

These are the available aria2 download scripts:

moshi-defaults.txt - 9.7GB downloads files necessary to run all demos.
kyutai_tts-voices.txt - 731 MB, all tts-1.6b and tts-0.75b voices
Codes4Fun_moshi-common.txt - files shared between models.
Codes4Fun_moshika-q4_k-GGUF.txt - Kyutai's Moshika model in quantized gguf format.
Codes4Fun_stt-1b-en_fr-GGUF.txt - Kyutai's STT 1B model in gguf format.
Codes4Fun_tts-1.6b-en_fr-GGUF.txt - Kyutai's TTS 1.6B model in gguf format.

These are additional aria2 download scripts, they are here for reference. they can be used but may not performed well unless converted/quantized:

kyutai_stt-1b-en_fr-candle.txt - downloaded as part of default.
kyutai_stt-2.6b-en.txt - 6 GB, large model without vad but better quality.
kyutai_tts-0.75b-en-public.txt - 2 GB, small model that uses audio files for voices.
kyutai_tts-1.6b-en_fr.txt - downloaded as part of default.
kyutai_moshika-pytorch-bf16.txt - 16 GB female model
kyutai_moshiko-pytorch-bf16.txt - 16 GB male model

Running Demos

After downloading/building moshicpp , you can see a list of device options with the -l option, for example moshi-tts -l should output a list of devices. If no output shows up, make sure you have the latest msvc redistributables installed:

https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170

After downloading the default models (see the Data/Weights section), you should be able start generating speech using the default tts model and voice:

moshi-tts "She sells sea shells by the sea shore."

If you installed the data to a different directory, you can specify the root location with command line argument -r or by setting the environment variable MODEL_CACHE to where the models reside, for example if tts model is located at C:/models/kyutai/tts-1.6b-en_fr then you could use -r C:/models or set MODEL_CACHE to C:\models and not need the command line option.

If you run into an error about PTX being compiled by an unsupported toolchain, try updating the nvidia drivers.

If for some reason SDL isn't outputing audio or you want to generate a mp3 file, or other media file format that ffmpeg supports, you can use the output option -o like so:

moshi-tts "She sells sea shells by the sea shore." -o seashells.mp3

To demo the stt, using the microphone:

moshi-stt

or an input media file:

moshi-stt -i seashells.mp3

If you get an error, make sure the microphone is working, and if in windows make sure desktop apps can access it via the "Microphone privacy settings".

To talk to moshika (not part of the default download), if you have 20gb vram, you can use:

moshi-sts

If you have less than 20gb and at least 8gb of vram, or performance is a bit low, you can quantize the model down and cache it using this command:

moshi-sts -g -q q4_k

That will consume about 4gb of additional disk space, and takes several minutes to convert the model, but after the initial creation, starting moshi will take seconds.

If you plan to use these models multiple times, it is recommened to use the -g option, it will take up more drive space but will load several times faster. You can experiment with quantization of the other models as well: -q q8_0 -q q4_k.

Benchmarks

A simple way to do benchmarking is to first generate a wav using moshi-tts and then use that wav with moshi-stt. If you store the models in a separate directory, set the environment variable MODEL_CACHE to the root directory containing kyutai folder to make it easier. You can use this command for benchmarking text-to-speech:

./moshi-tts --bench

This will default to "The quick brown fox jumped over the sleeping dog." and disables output, sets the seed to 0 and temperature to 0 for consistent results. If you have a specific device you want to benchmark, you can get a list via ./moshi-tts -l and to target a specific device (like CUDA0, Vulkan0, or CPU) and/or want to set the number of threads, you can modify the command like this:

./moshi-tts --bench -d CPU --threads 8

For benchmarking speech-to-text, you need an audio input file first, which I would recommend generating by adding an output file to the tts bench option:

./moshi-tts --bench -o test.wav

Then you can use test.wav to run stt.

./moshi-stt -i test.wav

For benchmarking speech-to-speech (sts), you can use the --bench option, this will disable sdl audio input/output to run the model as fast as possible for only 125 frames, which can take between 10 to 40 seconds. For the fastest speed with sts it is recommended to use the -g -q q4_k options which will take an addition 4gb of disk space and take several minutes the first run, but after the first run it will loads in seconds, and consumes less than 8gb of vram.

./moshi-sts --bench -g -q q4_k

These commands output frames per second. Although tts also outputs tokens per second, that is for reference since token pronouncation can take variable frames to compute.

Moshi operates at 12.5 frames per second, so anything below that would not work for real time applications.

CUDA benchmarks (beta2):

make	name	gb	driver	os	tts fps	stt fps	sts q4_k
NVIDIA	RTX 2070	8	CUDA	linux	20.64	93.27	🟢 19.49
NVIDIA	RTX 4060	8	CUDA	linux	19.41	76.63	🟢 17.85
NVIDIA	RTX 3060	12	CUDA	linux	17.98	78.02	🟢 17.82
NVIDIA	RTX 2070 Laptop	8	CUDA	linux	18.84	83.08	🟢 16.89
NVIDIA	RTX 2070 Laptop	8	CUDA	win10	16.96	59.56	🟢 14.75
NVIDIA	RTX 2070	8	CUDA	win11	14.71	48.46	🟢 13.77
NVIDIA	RTX 4060	8	CUDA	win11	14.14	42.37	🟢 13.44
NVIDIA	RTX 3060	12	CUDA	win11	13.80	42.44	🟢 12.79
NVIDIA	GTX 1070	8	CUDA	win11	8.72	41.81	🔴 6.94

Vulkan benchmarks (beta2):

make	name	gb	driver	os	tts fps	stt fps	sts q4_k
Intel	ARC B850	12	Vulkan	win11	31.43	63.88	🟢 22.03
AMD	Radeon RX 6700 XT	12	Vulkan	win11	22.46	56.70	🟢 19.17
AMD	Radeon RX 6700 XT	12	Vulkan	linux	20.35	58.32	🟢 17.84
Intel	ARC B850	12	Vulkan	linux	19.88	44.49	🟢 16.45
AMD	Radeon 8060S	64	Vulkan	linux	13.15	43.57	🟢 15.47
AMD	Radeon 8060S	64	Vulkan	win11	12.34	37.16	🟢 15.05
AMD	Radeon 890M HX370	16	Vulkan	linux	7.50	23.83	🔴 6.60
AMD	Radeon 890M HX370	16	Vulkan	win11	7.53	21.65	🔴 5.80

CPU benchmarks (alpha):

make	name	driver	tts fps	stt fps	threads
AMD	Ryzen AI MAX+ 395	CPU	4.24	8.36	8
AMD	Ryzen AI 9 HX370	CPU	4.18	7.48	8
AMD	Ryzen 7 8845HS	CPU	3.71	6.77	8
AMD	Ryzen 7 8840U	CPU	2.89	6.45	8
Intel	Core i7-8750H	CPU	2.73	5.03	6
Intel	Core i7-9750H	CPU	2.54	5.09	6
Intel	Core i7-6700T	CPU	1.62	3.04	4

Design Notes

I was originally looking at designing the API after gstreamer and/or potentially integrating it with it, but I found gstreamer was rather hard to debug when things didn't work and they immediately didn't work. I still like the idea of pipes, but I decided to follow how FFmpeg connects decoders resamplers and encoders. I am not entirely set on this, as I have lots of other ideas, such as both streaming to SDL and being able to record to an mp3 file, but also in the future it may make sense for data to stay on the GPU as long as it can, so rather hiding how things are connected would make sense.

Internally I tried to replicate what the original moshi did by using single header files for code, following it's file hierarchy. To make it easier for anyone interested to compare python to c++.

My coding style is a combination of C++ and C, largely because C++ through deep abstraction can make it hard to debug, read, maintain, and refactor code. So I try to keep abstractions shallow, mostly used for reducing code bloat with automation. There are other misc things I do primarily for readability.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
bin		bin
cmake		cmake
include/moshi		include/moshi
src		src
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

moshi.cpp

Status

For Future Release

In Current Release

Performance and Optimizations

Quick Start Linux

PersonaPlex Linux

Quick Start Windows

PersonaPlex Windows

Build Dependencies

Sentence Piece

GGML

FFmpeg

SDL2

Building

Models

Running Demos

Benchmarks

Design Notes

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

moshi.cpp

Status

For Future Release

In Current Release

Performance and Optimizations

Quick Start Linux

PersonaPlex Linux

Quick Start Windows

PersonaPlex Windows

Build Dependencies

Sentence Piece

GGML

FFmpeg

SDL2

Building

Models

Running Demos

Benchmarks

Design Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors 1

Languages

Packages