Module 5 TTS front-end

We want to generate speech that is

  • Intelligible: you can clearly perceive what words are being said
  • Natural: sounds like human speech
  • Appropriate: conveys the right meaning in a specific context
  • Front-end: Analyze text, generate a linguistic specification of what to actually generate
    • Front-end purpose: derive a linguistic specification from text that includes the necessary information to generate speech
  • Back-end: Waveform generation from the linguistic specification

Linguistic specification guides what we generate

  • Phones
  • Syllables
  • Words
  • Phrases
  • Utterances
  • Discourses
  • Pronunciation Dictionaries: Use pre-existing pronunciation dictionaries to map words to phonetic transcriptions
    • CMUDict: 1 big text file of words and their pronunciations
      • CMUDict is dialect specific, like Scotland accent.
    • Unilex is an ‘accent-independent’ lexicon based on the Unisyn database
      • Classifies phones by keywords e.g. ‘Foot’ vs ‘Strut’ are keywords
        • ‘Put’ → FOOT class
        • ‘Putt’ → STRUT class
      • Use this to describe phonemic variation in English dialects/accents
      • A single lexicon to encode different accents: run lexicon through accent specific rules to produce accent specific lexica
  • Phoneset choice
    • Unilex is more generalizable than CMUDict
    • Unilex more compact: 1 base lexicon + rules
      • But we need to define rules to convert from one accent to another. This leads us to revisit the concept of phoneme

We use decision tree when we want learn rules from data.

Module 6 Waveform generation

Diphone database Requirements

  • Clean, clear recordings of a single speaker
  • Recordings of every possible diphone in the language
  • Phone segmentation (timings) to calculate where diphones start and end

The most common use of lexical stress marking is for determining which syllable in a word a pitch accent will be placed on if that word is made prosodically prominent.

The Tone and Break Indices (ToBI) model of prosody basically aims to capture prosodic prominence (pitch accents), boundary tones, and the extend of prosodic breaks (break indices). It doesn’t try to capture pragmatic or affective content of speech such as speech acts or emotions.

Spectral smoothing, as the name suggests, will help spectral discontinuity by making the change in the spectrum more smooth across a join.

Translate + Edit: YangSier (Homepage)

:cherry_blossom:唠嗑QQ群兔叽的魔术工房 (942848525)