Abstract:
Described herein are audio capture systems and methods. One embodiment provides an audio capture system (1) including: microphones (9-11) positioned to capture respective audio signals from different directions or locations within an audio environment; a mixing module (7) configured to mix the audio signals in accordance with a mixing control signal to produce an output audio mix, wherein, upon the detection of vibration activity, the mixing control signal controls the mixing module (7) to selectively temporarily modify one or more of the audio signals to reduce the presence of noise associated with vibration activity in the output audio mix.
Abstract:
Systems and methods are described for determining orientation of an external audio device in a video conference, which may be used to provide congruent multimodal representation for a video conference. A camera of a video conferencing system may be used to detect a potential location of an external audio device within a room in which the video conferencing system is providing a video conference. Within the detected potential location, a visual pattern associated with the external audio device may be identified. Using the identified visual pattern, the video conferencing system may estimate an orientation of the external audio device, the orientation being used by the video conferencing system to provide spatial audio video congruence to a far end audience.
Abstract:
Systems and methods are described for detecting and remedying potential incongruence in a video conference. A camera of a video conferencing system may capture video images of a conference room. A processor of the video conferencing system may identify locations of a plurality of participants within an image plane of a video image. Using face and shape detection, a location of a center point of each identified participant's torso may be calculated. A region of congruence bounded by key parallax lines may be calculated, the key parallax lines being a subset of all parallax lines running through the center points of each identified participant. When the audio device location is not within the region of congruence, audio captured by an audio device may be adjusted to reduce effects of incongruence when the captured audio is replayed at a far end of the video conference.
Abstract:
Example embodiments disclosed herein relate to a estimation of reverberant energy components from audio sources. A method of estimating a reverberant energy component from an active audio source (100) is disclosed. The method comprises determining a correspondence between the active audio source and a plurality of sample sources by comparing one or more spatial features of the active audio source with one or more spatial features of the plurality of sample sources, each of the sample sources being associated with an adaptive filtering model (101); obtaining an adaptive filtering model for the active audio source based on the determined correspondence (102); and estimating the reverberant energy component from the active audio source over time based on the adaptive filtering model (103). Corresponding system (800) and computer program product (900) are also disclosed.
Abstract:
Example embodiments disclosed herein relate to impulsive noise suppression. A method of impulsive noise suppression in an audio signal is disclosed. The method includes determining an impulsive noise related feature from a current frame of the audio signal. The method also includes detecting an impulsive noise in the current frame based on the impulsive noise related feature, and in response to detecting the impulsive noise in the current frame, applying a suppression gain to the current frame to suppress the impulsive noise. Corresponding system and computer program product of impulsive noise suppression in an audio signal are also disclosed.
Abstract:
The present document relates to audio communication systems. In particular, the present document relates to the control of the level of audio signals within audio communication systems. A method for leveling a near-end audio signal (211) using a leveling gain (214) is described. The near-end audio signal (211) comprises a sequence of segments, wherein the sequence of segments comprises a current segment and one or more preceding segments. The method comprises determining a nuisance measure (416) which is indicative of an amount of aberrant voice activity within the sequence of segments of the near-end audio signal (211); and determining the leveling gain (214) for the current segment of the near-end audio signal (211), at least based on the leveling gain (214) for the one or more preceding segments of the near-end audio signal (211), and by taking into account—according to a variable degree—an estimate of the level of the current segment of the near-end audio signal (211); wherein the variable degree is dependent on the nuisance measure (416).
Abstract:
A system and method for initiating conference calls with external devices are disclosed. Call participants are sent conference invitation and conference information regarding the designated conference call. This conference information is stored on the participant's external device. When the participants arrive at a conference call location having a conferencing device, the conferencing device is capable of communicating with the external device, initiating communications, exchanging conference information. If the participant is verified and/or authorized, the conference system may send the IP address of the conference device to the conference system to initiate the conference call. In one embodiment, the conference device uses an ultrasound acoustic communication band to initiate the call with the external device on a semi-automated basis. An acoustic signature comprising a pilot sequence for communications synchronization may be generated to facilitate the call. Audible and aesthetic acoustic protocols may also be employed.
Abstract:
Method for measuring level of speech determined by an audio signal in a manner which corrects for and reduces the effect of modification of the signal by the addition of noise thereto and/or amplitude compression thereof, and a system configured to perform any embodiment of the method. In some embodiments, the method includes steps of generating frequency banded, frequency-domain data indicative of an input speech signal, determining from the data a Gaussian parametric spectral model of the speech signal, and determining from the parametric spectral model an estimated mean speech level and a standard deviation value for each frequency band of the data; and generating speech level data indicative of a bias corrected mean speech level for each frequency band, including using at least one correction value to correct the estimated mean speech level for the frequency band, where each correction value has been predetermined using a reference speech model.
Abstract:
In some embodiments, a method for modifying noise captured at endpoints of a teleconferencing system, including steps of capturing noise at each endpoint, and modifying the captured noise to generate modified noise having a frequency-amplitude spectrum which matches a target spectrum and a spatial property set which matches a target spatial property set. In other embodiments, a teleconferencing method including steps of: at endpoints of a teleconferencing system, determining audio frames indicative of audio captured at each endpoint, each of a subset of the frames indicative of noise but not a significant level of speech; and at each endpoint, generating modified frames indicative of modified noise having a frequency-amplitude spectrum which matches a target spectrum and a spatial property set which matches a target spatial property set, and generating encoded audio including by encoding the modified frames. Other aspects are systems configured to perform any embodiment of the method.
Abstract:
Some methods may involve receiving a first content stream that includes first audio signals, rendering the first audio signals to produce first audio playback signals, generating first direct sequence spread spectrum (DSSS) signals, generating first modified audio playback signals by inserting the first DSSS signals into the first audio playback signals, and causing a loudspeaker system to play back the first modified audio playback signals, to generate first audio device playback sound. The method(s) may involve receiving microphone signals corresponding to at least the first audio device playback sound and to second through Nth audio device playback sound corresponding to second through Nth modified audio playback signals (including second through Nth DSSS signals) played back by second through Nth audio devices, extracting second through Nth DSSS signals from the microphone signals and estimating at least one acoustic scene metric based, at least partly, on the second through Nth DSSS signals.