Diphone synthesis is one of the most popular methods used for creating a synthetic voice from recordings or samples of a particular person; it can capture a good deal of the acoustic quality of an individual, within some limits. The rationale for using a diphone, which is two adjacent half-phones, is that the ``center'' of a phonetic realization is the most stable region, whereas the transition from one ``segment'' to another contains the most interesting phenomena, and thus the hardest to model. The diphone, then, cuts the units at the points of relative stability, rather than at the volatile phone-phone transition, where so-called coarticulatory effects appear.
There is clearly a simplifying assumption: that all relevant phonetic realizations can be enumerated, and that by simply collecting all of phone-phone transitions, that any possible sequence of speech sounds in the target language could be produced. Thus, with a 44-phone inventory, one could collect a 44 * 44 = 1936 diphone inventory and create a synthesizer that could speak anything, given the imposition of appropriate prosody - intonation, duration, and shift in spectral quality, as determined by other modules in a general-purpose synthesizer.