摘要:
An “adaptive audio playback controller” operates by decoding and reading received packets of an audio signal into a signal buffer. Samples of the decoded audio signal are then played out of the signal buffer according to the needs of a player device. Jitter control and packet loss concealment are accomplished by continuously analyzing buffer content in real-time, and determining whether to provide unmodified playback from the buffer contents, whether to compress buffer content, stretch buffer content, or whether to provide for packet loss concealment for overly delayed or lost packets as a function of buffer content. Further, the adaptive audio playback controller also determines where to stretch or compress particular frames or signal segments in the signal buffer, and how much to stretch or compress such segments in order to optimize perceived playback quality.
摘要:
An “adaptive audio playback controller” operates by decoding and reading received packets of an audio signal into a signal buffer. Samples of the decoded audio signal are then played out of the signal buffer according to the needs of a player device. Jitter control and packet loss concealment are accomplished by continuously analyzing buffer content in real-time, and determining whether to provide unmodified playback from the buffer contents, whether to compress buffer content, stretch buffer content, or whether to provide for packet loss concealment for overly delayed or lost packets as a function of buffer content. Further, the adaptive audio playback controller also determines where to stretch or compress particular frames or signal segments in the signal buffer, and how much to stretch or compress such segments in order to optimize perceived playback quality.
摘要:
A “Client-Side Echo Canceller” provides a unique system and method for reducing Multipoint Control Unit (MCU) computational overhead in a multi-point audio conference. In general, the local audio input signal of each client is transmitted in real-time to the MCU. The MCU then combines the audio input signals of all clients to create a single composite signal that is transmitted back to all clients in real-time. Each client then locally processes the composite signal to remove each client's local contribution to the composite signal prior to local playback in order to eliminate a local echo of each client's local audio input. In various embodiments, local cancellation of the local audio input from the composite signal is performed on either a time domain or a transform domain representation of the composite signal. Further, since each client receives the same signal, MCU transmission bandwidth can be reduced via multicast transmissions.
摘要:
A decentralized computer network architecture and method that gathers metadata from local and remote clients and, based on that metadata, locally makes a decision whether to send a packet over the network. Each client listens to what other clients are doing, and only sends when the total number of concurrent speakers is below some threshold. In a multi-party voice conferencing embodiment, the threshold is a number of concurrent speakers that is restricted to less than a certain number. Under the decentralized computer network architecture, the type of network topology used to connect the clients is flexible, as long as each client is running a peer-aware system to decide locally whether to send their packets. The decentralized computer network architecture and method is distributed to run on each client, making it suitable for a wide variety of network topologies (such as full-mesh, bridge-based, or a hybrid of the two).
摘要:
A peer-aware voice stream ranking method that makes decisions based on information about participants of a voice conference over a network. Whether to send a participant's own audio packet out on the network is based both on information about the participant's own voice packet and voice packets that the participant receives from other clients. A Voice Activity Score (VAS) is computed for each frame of a particular voice stream. The VAS includes a voiceness component, indicating the likelihood that the audio frame contains speech or voice, and an energy level component that indicating the ratio of current frame energy to the long-term average of energy for a current speaker. Using the VAS from the participants, the method also ranks the client's voice stream as compared to other clients' voice streams in the voice conference. If there are participants higher ranking, the client's voice stream is not sent.
摘要:
A method and system for modifying a digital audio signal to vary its playback speed while preserving the signal's pitch and quality. The variable speed playback (VSP) system and method mitigates artifacts remaining after processing by existing techniques. The VSP system and method produces a consistent and pleasing sound to an audio file, even while its speed is varied during playback. The VSP method includes selecting and estimating an input frame, adjusting the frame position, and overlapping and adding the adjust frame to an output signal. The frame position adjustment is achieved using an enhanced correlation technique that finds all local maxima over a cross-correlation function. The local maxima having a highest correlation score is designated as a cut position, where the adjusted frame is cut from the input buffer. The VSP system and method using four input frames to generate one output frame.
摘要:
An adaptive “temporal audio scaler” is provided for automatically stretching and compressing frames of audio signals received across a packet-based network. Prior to stretching or compressing segments of a current frame, the temporal audio scaler first computes a pitch period for each frame for sizing signal templates used for matching operations in stretching and compressing segments. Further, the temporal audio scaler also determines the type or types of segments comprising each frame. These segment types include “voiced” segments, “unvoiced” segments, and “mixed” segments which include both voiced and unvoiced portions. The stretching or compression methods applied to segments of each frame are then dependent upon the type of segments comprising each frame. Further, the amount of stretching and compression applied to particular segments is automatically variable for minimizing signal artifacts while still ensuring that an overall target stretching or compression ratio is maintained for each frame.
摘要:
A “Client-Side Echo Canceller” provides a unique system and method for reducing Multipoint Control Unit (MCU) computational overhead in a multi-point audio conference. In general, the local audio input signal of each client is transmitted in real-time to the MCU. The MCU then combines the audio input signals of all clients to create a single composite signal that is transmitted back to all clients in real-time. Each client then locally processes the composite signal to remove each client's local contribution to the composite signal prior to local playback in order to eliminate a local echo of each client's local audio input. In various embodiments, local cancellation of the local audio input from the composite signal is performed on either a time domain or a transform domain representation of the composite signal. Further, since each client receives the same signal, MCU transmission bandwidth can be reduced via multicast transmissions.
摘要:
An “adaptive audio playback controller” operates by decoding and reading received packets of an audio signal into a signal buffer. Samples of the decoded audio signal are then played out of the signal buffer according to the needs of a player device. Jitter control and packet loss concealment are accomplished by continuously analyzing buffer content in real-time, and determining whether to provide unmodified playback from the buffer contents, whether to compress buffer content, stretch buffer content, or whether to provide for packet loss concealment for overly delayed or lost packets as a function of buffer content. Further, the adaptive audio playback controller also determines where to stretch or compress particular frames or signal segments in the signal buffer, and how much to stretch or compress such segments in order to optimize perceived playback quality.
摘要:
An adaptive “temporal audio scaler” is provided for automatically stretching and compressing frames of audio signals received across a packet-based network. Prior to stretching or compressing segments of a current frame, the temporal audio scaler first computes a pitch period for each frame for sizing signal templates used for matching operations in stretching and compressing segments. Further, the temporal audio scaler also determines the type or types of segments comprising each frame. These segment types include “voiced” segments, “unvoiced” segments, and “mixed” segments which include both voiced and unvoiced portions. The stretching or compression methods applied to segments of each frame are then dependent upon the type of segments comprising each frame. Further, the amount of stretching and compression applied to particular segments is automatically variable for minimizing signal artifacts while still ensuring that an overall target stretching or compression ratio is maintained for each frame.