A second important aspect of synthesis is to change the basic model of when things happen, both Flite and Festival synthesis utterance by utterance, thus a whole structure is built for each utterance, this takes up space and of course time. The time aspect is not usually a problem when multiple utterances are being generated but it is an issue in how fast the system can start speaking, especially important when synthesis is being used in a dialog systems.
There is the issue of how much text is required before synthesis can reasonable start. Basically how much context is required to synthesis. This is an interesting question that deserves study, though presently we have only taken an engineering view of it.
As most of the work, and the memory requirement is in the building of the waveform from the LPC parameters we can make that function know where it is to send the data thus it can use short buffers and write them directly to the audio device or through a socket to some player elsewhere. This would reduce memory requirement significantly as well as the make the time to first audio be much less affected in absolute terms by the size of the uttrances.
The next obvious improvement for streaming synthesis is to do synthesis using prosodic phrases as chunks, not by utterance (which are closer to sentences); the particular application will make a difference here. In cases where speed and size are paramount, utterances are usually pretty short anyway so it may not be worth it. Longer utterances are more common in flowing text, web pages, novels, and the like.