Abstract:
A system which performs social interaction analysis for a plurality of participants includes a processor. The processor is configured to determine a similarity between a first spatially filtered output and each of a plurality of second spatially filtered outputs. The processor is configured to determine the social interaction between the participants based on the similarities between the first spatially filtered output and each of the second spatially filtered outputs and display an output that is representative of the social interaction between the participants. The first spatially filtered output is received from a fixed microphone array, and the second spatially filtered outputs are received from a plurality of steerable microphone arrays each corresponding to a different participant.
Abstract:
A method for detecting voice activity by an electronic device is described. The method includes detecting near end speech based on a near end voiced speech detector and at least one single channel voice activity detector. The near end voiced speech detector is associated with a harmonic statistic based on a speech pitch histogram.
Abstract:
A method for displaying a user interface on an electronic device is described. The method includes presenting a user interface. The user interface includes a coordinate system. The coordinate system corresponds to physical coordinates based on sensor data. The method also includes displaying at least a target audio signal and an interfering audio signal on the user interface.
Abstract:
Methods, systems and articles of manufacture for recognizing and locating one or more objects in a scene are disclosed. An image and/or video of the scene are captured. Using audio recorded at the scene, an object search of the captured scene is narrowed down. For example, the direction of arrival (DOA) of a sound can be determined and used to limit the search area in a captured image/video. In another example, keypoint signatures may be selected based on types of sounds identified in the recorded audio. A keypoint signature corresponds to a particular object that the system is configured to recognize. Objects in the scene may then be recognized using a shift invariant feature transform (SIFT) analysis comparing keypoints identified in the captured scene to the selected keypoint signatures.
Abstract:
Systems, methods, and apparatus are described for applying, based on angles of arrival of source components relative to the axes of different microphone pairs, a spatially directive filter to a multichannel audio signal to produce an output signal.
Abstract:
A first device includes a memory configured to store instructions and one or more processors configured to receive audio signals from multiple microphones. The one or more processors are configured to process the audio signals to generate direction-of-arrival information corresponding to one or more sources of sound represented in one or more of the audio signals. The one or more processors are also configured to and send, to a second device, data based on the direction-of-arrival information and a class or embedding associated with the direction-of-arrival information.
Abstract:
A device to process speech includes a speech processing network that includes an input configured to receive audio data. The speech processing network also includes one or more network layers configured to process the audio data to generate a network output. The speech processing network includes an output configured to be coupled to multiple speech application modules to enable the network output to be provided as a common input to each of the multiple speech application modules.
Abstract:
Methods, systems, and devices for signal processing are described. Generally, as provided for by the described techniques, a wearable device to receive an input audio signal from one or more outer microphones, an input audio signal from one or more inner microphones, and a bone conduction signal from a bone conduction sensor based on the input audio signals. The wearable device may filter the bone conduction signal based on a set of frequencies of the input audio signals, such as a low frequency portion of the input audio signals. For example, the wearable device may apply a filter to the bone conduction signal that accounts for an error in the input audio signals. The wearable device may add a gain to the filtered bone conduction signal and may equalize the filtered bone conduction signal based on the gain. The wearable device may output an audio signal to a speaker.
Abstract:
A device to perform target sound detection includes one or more processors. The one or more processors include a buffer configured to store audio data and a target sound detector. The target sound detector includes a first stage and a second stage. The first stage includes a binary target sound classifier configured to process the audio data. The first stage is configured to activate the second stage in response to detection of a target sound. The second stage is configured to receive the audio data from the buffer in response to the detection of the target sound.