mickael-rouvier.fr

Follow me on GitHub

Corpus : LIA Commonvoice

This is the LIA Commonvoice corpus dedicated for Automatic Speech Recognition (ASR) and Speaker Verification (SV). This corpus is based on French Common voice where there was a text and speaker re-annotation in order to fit more on the data reality.

Contents:

  • 152.361 audio segments (173 hours of audio)
  • Text and speaker re-annotation in order to correct the data given by Mozilla
  • Dictionary with pronunciations
  • Acoustic and language models
  • Recipe on Kaldi in order to reproduce the models