Initially introduced in , MPEG-4 absorbed the information of its predecessors and expanded upon it. The set of technologies associated with MPEG-4 benefit authors by enabling them to provide content with far greater readability, network service providers by giving transparent content, and end users by bringing higher levels of interaction with content.
MPEG-4 achieves these goals by representing units of aural, audio, or audiovisual content of natural or synthetic origin. MPEG-4 audiovisual scenes are composed of still images, video objects, and audio objects. Please note that there is no MPEG As with other MPEG documents, the first three parts cover the systems, video visual , and audio. There are currently 30 parts in total, all listed below. Please note that many of these standard documents have amendments, which make technical changes to the existing documents, and corrigenda, which repair editorial errors.
This standard has around forty amendments. It also has several corrigenda: Cor , Cor , Cor , and Cor Furthermore, tools for effects processing and 3-D localisation of sound are included, allowing the creation of artificial sound environments using artificial and natural sources.
Synthetic audio is described by first defining a set of 'instrument' modules that can create and process audio signals under the control of a script or score file. An instrument is a small network of signal processing primitives that can emulate the effects of a natural acoustic instrument.
A script or score is a time-sequenced set of commands that invokes various instruments at specific times to contribute their output to an overall music performance. Other instruments, serving the function of effects processors reverberators, spatialisers, mixers , can be similarly invoked to receive and process the outputs of the performing instruments.
These actions can not only realise a music composition but can also organise any other kind of audio, such as speech, sound effects and general ambience. Likewise, the audio sources can themselves be natural sounds, perhaps emanating from an audio channel decoder, thus enabling synthetic and natural sources to be merged with complete timing accuracy.
TTS is becoming a rather common interface and plays an important role in various multi-media application areas. For instance, by using TTS functionality, multi-media contents with narration can be easily composed without recording natural speech sound. This extended TTS can utilize prosodic information of natural speech in addition to input texts and can generate much higher quality of synthetic speech.
The interface and its bitstream format is strongly scalable; for example, if some parameters of prosodic information are not available, it then generates the missing parameters by rule. In order to achieve the highest audio quality within the full range of bitrates and at the same time provide the extra functionalities, three types of coder have been defined. In this region, two sampling rates, 8 and 16 kHz, are used to support a broader range of audio signals other than speech.
The audio signals in this region typically have bandwidths starting at 8 kHz. A number of functionalities are provided to facilitate a wide variety of applications which could range from intelligible speech to high quality multichannel audio. Examples of the functionalities are speed control, pitch change, error resilience and scalability in terms of bitrate, bandwidth, error robustness, complexity, etc.
0コメント