Abstract:
Disclosed herein are systems, methods, and computer-readable storage media for detecting voice activity in a media signal in an augmented, multi-tier classifier architecture. A system configured to practice the method can receive, from a first classifier, a first voice activity indicator detected in a first modality for a human subject. Then, the system can receive, from a second classifier, a second voice activity indicator detected in a second modality for the human subject, wherein the first voice activity indicator and the second voice activity indicators are based on the human subject at a same time, and wherein the first modality and the second modality are different. The system can concatenate, via a third classifier, the first voice activity indicator and the second voice activity indicator with original features of the human subject, to yield a classifier output, and determine voice activity based on the classifier output.
Abstract:
Speaker content generated in an audio conference is selectively visually represented. A profile for each audience member who participates in the audio conference is obtained. Speaker content spoken during the audio conference is monitored. Words of the speaker content are classified to have different weights according to a parameter of the profile for each of the audience members. A relation between the speaker content to the profile for each of the audience members is determined. Different visual representations of the speaker content are presented to different ones of the audience members based on the determined relation.
Abstract:
Speaker content generated in an audio conference is selectively visually represented. A profile for each audience member who listen to an audio conference is obtained. Speaker content from audio conference participants who speak in the audio conference is monitored. The speaker content from each of the audio conference participants is analyzed. Based on the analyzing and on the profiles for each of the plurality of audience members, visual representations of the speaker content to present to the audience members are identified. Visual representations of the speaker content are generated based on the analyzing. Different visual representations of the speaker content are presented to different audience members based on the analyzing and identifying.