DOKUMEN123.COM

Concatenative synthesis is a technique for synthesising sounds by concatenating short samples of recorded sound (called units). The duration of the units is not strictly defined and may vary according to the implementation, roughly in the range of 10 milliseconds up to 1 second. It is used in speech synthesis and music sound synthesis to generate user-specified sequences of sound from a database (often called a corpus) built from recordings of other sequences.

In contrast to granular synthesis, concatenative synthesis is driven by an analysis of the source sound, in order to identify the units that best match the specified criterion.^[1]

In speech

Concatenative synthesis is based on the concatenation (stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.

Unit selection synthesis

Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences. Typically, the division into segments is done using a specially modified speech recognizer set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the waveform and spectrogram.^[2] An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phones. At run time, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted decision tree.

Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the gigabytes of recorded data, representing dozens of hours of speech.^[3] Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database.^[4] Recently, researchers have proposed various automated methods to detect unnatural segments in unit-selection speech synthesis systems.^[5]

Diphone synthesis

Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language. The number of diphones depends on the phonotactics of the language: for example, Spanish has about 800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target prosody of a sentence is superimposed on these minimal units by means of digital signal processing techniques such as linear predictive coding, PSOLA^[6] or MBROLA.^[7] or more recent techniques such as pitch modification in the source domain using discrete cosine transform.^[8] Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining, although it continues to be used in research because there are a number of freely available software implementations. An early example of Diphone synthesis is a teaching robot, Leachim, that was invented by Michael J. Freeman.^[9] Leachim contained information regarding class curricular and certain biographical information about the students whom it was programmed to teach.^[10] It was tested in a fourth grade classroom in the Bronx, New York.^[11]^[12]

Domain-specific synthesis

Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports.^[13] The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.^{[citation needed]}

Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account. For example, in non-rhotic dialects of English the "r" in words like "clear" /ˈklɪə/ is usually only pronounced when the following word has a vowel as its first letter (e.g. "clear out" is realized as /ˌklɪəɹˈʌʊt/). Likewise in French, many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called liaison. This alternation cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be context-sensitive.

In music

Concatenative synthesis for music started to develop in the 2000s in particular through the work of Schwarz^[14] and Pachet^[15] (so-called musaicing). The basic techniques are similar to those for speech, although with differences due to the differing nature of speech and music: for example, the segmentation is not into phonetic units but often into subunits of musical notes or events.^[1]^[14]^[16]

Zero Point, the first full-length album by Rob Clouth (Mesh 2020), features self-made concatenative synthesis software called the 'Reconstructor' which "chops sampled sounds into tiny pieces and rearranges them to replicate a target sound. This allowed Clouth to use and manipulate his own beatboxing, a technique used on 'Into' and 'The Vacuum State'."^[17] Clouth's concatenative synthesis algorithm was adapted from 'Let It Bee — Towards NMF-Inspired Audio Mosaicing' by Jonathan Driedger, Thomas Prätzlich, and Meinard Müller.^[18]^[19]

References

^ ^a ^b Schwarz, D. (2005), "Current research in Concatenative Sound Synthesis" (PDF), Proceedings of the International Computer Music Conference (ICMC)
^ Alan W. Black, Perfect synthesis for all of the people all of the time. IEEE TTS Workshop 2002.
^ John Kominek and Alan W. Black. (2003). CMU ARCTIC databases for speech synthesis. CMU-LTI-03-177. Language Technologies Institute, School of Computer Science, Carnegie Mellon University.
^ Julia Zhang. Language Generation and Speech Synthesis in Dialogues for Language Learning, masters thesis, Section 5.6 on page 54.
^ William Yang Wang and Kallirroi Georgila. (2011). Automatic Detection of Unnatural Word-Level Segments in Unit-Selection Speech Synthesis, IEEE ASRU 2011.
^ "Pitch-Synchronous Overlap and Add (PSOLA) Synthesis". Archived from the original on February 22, 2007. Retrieved 2008-05-28.
^ T. Dutoit, V. Pagel, N. Pierret, F. Bataille, O. van der Vrecken. The MBROLA Project: Towards a set of high quality speech synthesizers of use for non commercial purposes. ICSLP Proceedings, 1996.
^ Muralishankar, R; Ramakrishnan, A.G.; Prathibha, P (2004). "Modification of Pitch using DCT in the Source Domain". Speech Communication. 42 (2): 143–154. doi:10.1016/j.specom.2003.05.001.
^ "Education: Marvel of The Bronx". Time. 1974-04-01. ISSN 0040-781X. Retrieved 2019-05-28.
^ "1960 - Rudy the Robot - Michael Freeman (American)". cyberneticzoo.com. 2010-09-13. Retrieved 2019-05-23.
^ LLC, New York Media (1979-07-30). New York Magazine. New York Media, LLC.
^ The Futurist. World Future Society. 1978. pp. 359, 360, 361.
^ L.F. Lamel, J.L. Gauvain, B. Prouts, C. Bouhier, R. Boesch. Generation and Synthesis of Broadcast Messages, Proceedings ESCA-NATO Workshop and Applications of Speech Technology, September 1993.
^ ^a ^b Schwarz, Diemo (2004-01-23), Data-Driven Concatenative Sound Synthesis, retrieved 2010-01-15
^ Zils, A.; Pachet, F. (2001), "Musical Mosaicing" (PDF), Proceedings of the COST G-6 Conference on Digital Audio Effects (DaFx-01), University of Limerick, pp. 39–44, archived from the original (PDF) on 2011-09-27, retrieved 2011-04-27
^ Maestre, E. and Ramírez, R. and Kersten, S. and Serra, X. (2009), "Expressive Concatenative Synthesis by Reusing Samples from Real Performance Recordings", Computer Music Journal, vol. 33, no. 4, pp. 23–42, CiteSeerX 10.1.1.188.8860, doi:10.1162/comj.2009.33.4.23, S2CID 1078610{{citation}}: CS1 maint: multiple names: authors list (link)
^ "Zero Point, by Rob Clouth". Rob Clouth. Retrieved 2022-07-23.
^ Sónar+D CCCB 2020 Talk: "Journey to the Center of the Musical Brain", retrieved 2022-07-23
^ "AudioLabs - Let it Bee - Towards NMF-inspired Audio Mosaicing". www.audiolabs-erlangen.de. Retrieved 2022-07-23.

Categori:Speech synthesis Categori:Sound synthesis types

[schwarzsummary-1] Schwarz, D. (2005), "Current research in Concatenative Sound Synthesis" (PDF), Proceedings of the International Computer Music Conference (ICMC)

[2] Alan W. Black, Perfect synthesis for all of the people all of the time. IEEE TTS Workshop 2002.

[3] John Kominek and Alan W. Black. (2003). CMU ARCTIC databases for speech synthesis. CMU-LTI-03-177. Language Technologies Institute, School of Computer Science, Carnegie Mellon University.

[4] Julia Zhang. Language Generation and Speech Synthesis in Dialogues for Language Learning, masters thesis, Section 5.6 on page 54.

[5] William Yang Wang and Kallirroi Georgila. (2011). Automatic Detection of Unnatural Word-Level Segments in Unit-Selection Speech Synthesis, IEEE ASRU 2011.

[6] "Pitch-Synchronous Overlap and Add (PSOLA) Synthesis". Archived from the original on February 22, 2007. Retrieved 2008-05-28.

[7] T. Dutoit, V. Pagel, N. Pierret, F. Bataille, O. van der Vrecken. The MBROLA Project: Towards a set of high quality speech synthesizers of use for non commercial purposes. ICSLP Proceedings, 1996.

[8] Muralishankar, R; Ramakrishnan, A.G.; Prathibha, P (2004). "Modification of Pitch using DCT in the Source Domain". Speech Communication. 42 (2): 143–154. doi:10.1016/j.specom.2003.05.001.

[9] "Education: Marvel of The Bronx". Time. 1974-04-01. ISSN 0040-781X. Retrieved 2019-05-28.

[10] "1960 - Rudy the Robot - Michael Freeman (American)". cyberneticzoo.com. 2010-09-13. Retrieved 2019-05-23.

[11] LLC, New York Media (1979-07-30). New York Magazine. New York Media, LLC.

[12] The Futurist. World Future Society. 1978. pp. 359, 360, 361.

[13] L.F. Lamel, J.L. Gauvain, B. Prouts, C. Bouhier, R. Boesch. Generation and Synthesis of Broadcast Messages, Proceedings ESCA-NATO Workshop and Applications of Speech Technology, September 1993.

[schwarzphd-14] Schwarz, Diemo (2004-01-23), Data-Driven Concatenative Sound Synthesis, retrieved 2010-01-15

[pachet-15] Zils, A.; Pachet, F. (2001), "Musical Mosaicing" (PDF), Proceedings of the COST G-6 Conference on Digital Audio Effects (DaFx-01), University of Limerick, pp. 39–44, archived from the original (PDF) on 2011-09-27, retrieved 2011-04-27

[16] Maestre, E. and Ramírez, R. and Kersten, S. and Serra, X. (2009), "Expressive Concatenative Synthesis by Reusing Samples from Real Performance Recordings", Computer Music Journal, vol. 33, no. 4, pp. 23–42, CiteSeerX 10.1.1.188.8860, doi:10.1162/comj.2009.33.4.23, S2CID 1078610{{citation}}: CS1 maint: multiple names: authors list (link)

[17] "Zero Point, by Rob Clouth". Rob Clouth. Retrieved 2022-07-23.

[18] Sónar+D CCCB 2020 Talk: "Journey to the Center of the Musical Brain", retrieved 2022-07-23

[19] "AudioLabs - Let it Bee - Towards NMF-inspired Audio Mosaicing". www.audiolabs-erlangen.de. Retrieved 2022-07-23.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]