Abstract:
A system for exploiting visual information for enhancing audio signals via source separation and beamforming is disclosed. The system may obtain visual content associated with an environment of a user, and may extract, from the visual content, metadata associated with the environment. The system may determine a location of the user based on the extracted metadata. Additionally, the system may load, based on the location, an audio profile corresponding to the location of the user. The system may also load a user profile of the user that includes audio data associated with the user. Furthermore, the system may cancel, based on the audio profile and user profile, noise from the environment of the user. Moreover, the system may include adjusting, based on the audio profile and user profile, an audio signal generated by the user so as to enhance the audio signal during a communications session of the user.
Abstract:
Devices, systems, methods, media, and programs for detecting an emotional state change in an audio signal are provided. A plurality of segments of the audio signal is received, with the plurality of segments being sequential. Each segment of the plurality of segments is analyzed, and, for each segment, an emotional state and a confidence score of the emotional state are determined. The emotional state and the confidence score of each segment are sequentially analyzed, and a current emotional state of the audio signal is tracked throughout each of the plurality of segments. For each segment, it is determined whether the current emotional state of the audio signal changes to another emotional state based on the emotional state and the confidence score of the segment.
Abstract:
Devices, systems, methods, media, and programs for detecting an emotional state change in an audio signal are provided. A plurality of segments of the audio signal is received, with the plurality of segments being sequential. Each segment of the plurality of segments is analyzed, and, for each segment, an emotional state and a confidence score of the emotional state are determined. The emotional state and the confidence score of each segment are sequentially analyzed, and a current emotional state of the audio signal is tracked throughout each of the plurality of segments. For each segment, it is determined whether the current emotional state of the audio signal changes to another emotional state based on the emotional state and the confidence score of the segment.
Abstract:
Disclosed herein are systems, methods, and computer-readable storage devices for processing audio signals. An example system configured to practice the method receives audio at a device to be transmitted to a remote speech processing system. The system analyzes one of noise conditions, need for an enhanced speech quality, and network load to yield an analysis. Based on the analysis, the system determines to bypass user-defined options for enhancing audio for speech processing. Then, based on the analysis, the system can modify an audio transmission parameter used to transmit the audio from the device to the remote speech processing system. The audio transmission parameter can be one of an amount of coding, a chosen codec, or a number of audio channels, for example.
Abstract:
Disclosed herein are systems, methods, and computer-readable storage media for detecting voice activity in a media signal in an augmented, multi-tier classifier architecture. A system configured to practice the method can receive, from a first classifier, a first voice activity indicator detected in a first modality for a human subject. Then, the system can receive, from a second classifier, a second voice activity indicator detected in a second modality for the human subject, wherein the first voice activity indicator and the second voice activity indicators are based on the human subject at a same time, and wherein the first modality and the second modality are different. The system can concatenate, via a third classifier, the first voice activity indicator and the second voice activity indicator with original features of the human subject, to yield a classifier output, and determine voice activity based on the classifier output.
Abstract:
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for combining frame and segment level processing, via temporal pooling, for phonetic classification. A frame processor unit receives an input and extracts the time-dependent features from the input. A plurality of pooling interface units generates a plurality of feature vectors based on pooling the time-dependent features and selecting a plurality of time-dependent features according to a plurality of selection strategies. Next, a plurality of segmental classification units generates scores for the feature vectors. Each segmental classification unit (SCU) can be dedicated to a specific pooling interface unit (PIU) to form a PIU-SCU combination. Multiple PIU-SCU combinations can be further combined to form an ensemble of combinations, and the ensemble can be diversified by varying the pooling operations used by the PIU-SCU combinations. Based on the scores, the plurality of segmental classification units selects a class label and returns a result.
Abstract:
A system that incorporates teachings of the subject disclosure may include, for example, a method for controlling a steering of a plurality of cameras to identify a plurality of potential sources, identifying the plurality of potential sources according to image data provided by the plurality of cameras, assigning a beam of a plurality of beams of a plurality of microphones to each of the plurality of potential sources, detecting a first command comprising one of a first audible cue based on signals from a portion of the plurality of microphones, a first visual cue based on image data from one of the plurality of cameras, or both for controlling a media center, and configuring the media center according to the first command. Other embodiments are disclosed.
Abstract:
Television content is provided upon request. A search request for television content is received from a user on a user device. Listings for television content that meet the search request are determined based on the search request. Text describing the listings is converted to corresponding speech describing the listings. Speech describing the listings is provided audibly.
Abstract:
A pre-distortion system for improved mobile device communications via cancellation of nonlinear distortion is disclosed. The pre-distortion system may transmit an acoustic signal from a network to a device, wherein the acoustic signal includes a linear signal and a nonlinear cancellation signal that cancels at least a portion of nonlinear distortions created once a loudspeaker in the device emits the linear signal. Thus, when a loudspeaker of a mobile device is operating and nonlinear distortions are generated by the loudspeaker or adjacent components of the mobile device in close proximity to the loudspeaker, the pre-distortion system may create one or more nonlinear cancellation signals in the network. The nonlinear cancellation signal may be combined with the linear signal sent to the loudspeaker to cancel the nonlinear distortion signal created by the loudspeaker emitting acoustic sounds from the linear signal. Thus, the nonlinear cancellation signal becomes a pre-distortion signal.
Abstract:
A pre-distortion system for improved mobile device communications via cancellation of nonlinear distortion is disclosed. The pre-distortion system may transmit an acoustic signal from a network to a device, wherein the acoustic signal includes a linear signal and a nonlinear cancellation signal that cancels at least a portion of nonlinear distortions created once a loudspeaker in the device emits the linear signal. Thus, when a loudspeaker of a mobile device is operating and nonlinear distortions are generated by the loudspeaker or adjacent components of the mobile device in close proximity to the loudspeaker, the pre-distortion system may create one or more nonlinear cancellation signals in the network. The nonlinear cancellation signal may be combined with the linear signal sent to the loudspeaker to cancel the nonlinear distortion signal created by the loudspeaker emitting acoustic sounds from the linear signal. Thus, the nonlinear cancellation signal becomes a pre-distortion signal.