Diphone
Phones are not a suitable unit for waveform concatenation, so we used diphones, which capture co-articulation.
Diphone starts at the middle of one phone and ends at the middle of the other.
Coarticulation is the overlapping of adjacent articulations or the influence of the target phoneme on surrounding phonemes. Middles of phones are more stable in their spectral properties than the edges, because of coarticulation. So, concatenating diphones should lead to smoother joins
Waveform concatenation
Concatenation of waveforms is a simple way of making synthetic speech, but we need to take care about how we do it.
- discontinuity cause pops
- periodicity alignment cause glitches
Overlap-add
Cross-fading between two waveforms is an effective way to avoid some of the artefacts of concatenation.
Pitch period
This fundamental building block of speech waveforms offers a route to source-filter separation in the time domain.
Overlap of pitch period or impulse signal is observed.
extract pitch period (with taper window) for each pitch mark, and we make the time for each pitch period twice the $T_0$.
overlap to get the reconstruction signal similar to the original one. the whole process is called copy sentences.
TD-PSOLA
Applying overlap-add techniques to pitch period waveforms allows the modification of F0 and duration without changing the phone identity.
Time-domain pitch-synchronous overlap-and-add
Pitch period closer to each other
Pitch period far apart from each other
make a copy of one pitch period and insert to the sequence.
delete one pitch period
Diphone synthesis:
- One recording of every diphone (small database)
- Use signal processing methods to change F0, duration, and smooth joins to match linguistic specification
- e.g. TD-PSOLA
Unit Selection
Unit selection:
- Record a large naturalistic database
- Select diphone units based on closeness to the linguistic specification
- If the database has enough variation, don’t worry (too much) about signal processing!
Choice of units to concatenate depends on:
- Target cost: how well the unit matches the linguistic specification
- Join cost: how well edges of the units match
Convolution
A non-mathematical illustration of the equivalence of convolution (in the time domain), multiplication of magnitude spectra, and addition of log magnitude spectra.
Summary
Origin: Module 6 – Speech Synthesis – waveform generation and connected speech
Translate + Edit: YangSier (Homepage)
:four_leaf_clover:碎碎念:four_leaf_clover:
Hello米娜桑,这里是英国留学中的杨丝儿。我的博客的关键词集中在编程、算法、机器人、人工智能、数学等等,点个关注吧,持续高质量输出中。
:cherry_blossom:唠嗑QQ群:兔叽的魔术工房 (942848525)
:star:B站账号:白拾Official(活跃于知识区和动画区)