Abstract:
Computer-implemented methods for speech synthesis are provided. A speech synthesizer may be trained to generate synthesized audio data that corresponds to words uttered by a source speaker according to speech characteristics of a target speaker. The speech synthesizer may be trained by time-stamped phoneme sequences, pitch contour data and speaker identification data. The speech synthesizer may include a voice modeling neural network and a conditioning neural network.
Abstract:
Methods, systems, and computer program products for network-based processing and distribution of multimedia content of a live performance are disclosed. In some implementations, recording devices can be configured to record a multimedia event (e.g., a musical performance). The recording devices can provide the recordings to a server while the event is ongoing. The server automatically synchronizes, mixes and masters the recordings. The server performs the automatic mixing and mastering using reference audio data previously captured during a rehearsal. The server streams the mastered recording to multiple end users through the Internet or other public or private network. The streaming can be live streaming.
Abstract:
A spherical image of a spatial environment is received and contains spherically arranged pixel values indexed by a time value. The spherical image is represented in a content creation coordinate system in reference to a spatial position in the spatial environment. The spatial position is indexed by the time value. A spatial relationship is determined between the content creation coordinate system and a spherical image reference coordinate system. Based at least in part on the spatial relationship and the spherically arranged pixel values, spherical distributions of image metadata are determined for the spherical image.