摘要:
A distributed voice recognition system and method for obtaining acoustic features and speech activity at multiple frequencies by extracting high frequency components thereof on a device, such as a subscriber station and transmitting them to a network server having multiple stream processing capability, including cepstral feature processing, MLP nonlinear transformation processing, and multiband temporal pattern architecture processing. The features received at the network server are processed using all three streams, wherein each of the three streams provide benefits not available in the other two, thereby enhancing feature interpretation. Feature extraction and feature interpretation may operate at multiple frequencies, including but not limited to 8 kHz, 11 kHz, and 16 kHz.
摘要:
A method is provided and includes discovering active participants and passive participants from a meeting recording, generating an active notification that includes an option to manipulate the meeting recording, and a passive notification without the option to manipulate the meeting recording, and sending the active notification and the passive notification to the active participants and the passive participants, respectively. The method can also include discovering followers from the meeting recording, generating a followers notification without the option to manipulate the meeting recording, and which includes access to a portion of meeting recording, and sending the followers notification to the followers. Discovering the active participants and the passive participants includes running speaker segmentation and recognition algorithms on the meeting recording, discovering attendees including speakers and non-speakers, and categorizing the speakers as the active participants, and the non-speakers as the passive participants.
摘要:
A method is provided and includes estimating an approximate list of potential speakers in a file from one or more applications. The file (e.g., an audio file, video file, or any suitable combination thereof) includes a recording of a plurality of speakers. The method also includes segmenting the file according to the approximate list of potential speakers such that each segment corresponds to at least one speaker; and recognizing particular speakers in the file based on the approximate list of potential speakers.
摘要:
In one embodiment, an audio stream is partitioned into a plurality of segments such that the plurality of segments are clustered into one or more clusters, each of the one or more clusters identifying a subset of the plurality of segments in the audio stream and corresponding to one of a first set of one or more speaker models, each speaker model in the first set of speaker models representing one of a first set of hypothetical speakers. The speaker models in the first set of speaker models are compared with a second set of one or more speaker models, where each speaker model in the second set of speaker models represents one of a second set of hypothetical speakers. Labels associated with one or more speaker models in the second set of speaker models are propagated to one or more speaker models in the first set of speaker models according to a result of the comparing step.
摘要:
An example method is provided and includes receiving a media file that includes video data and audio data; determining an initial scene sequence in the media file; determining an initial speaker sequence in the media file; and updating a selected one of the initial scene sequence and the initial speaker sequence in order to generate an updated scene sequence and an updated speaker sequence respectively. The initial scene sequence is updated based on the initial speaker sequence, and wherein the initial speaker sequence is updated based on the initial scene sequence.
摘要:
A method and apparatus for speaker recognition is provided. One embodiment of a method for determining whether a given speech signal is produced by an alleged speaker, where a plurality of statistical models (including at least one support vector machine) have been produced for the alleged speaker based on a previous speech signal received from the alleged speaker, includes receiving the given speech signal, the speech signal representing an utterance made by a speaker claiming to be the alleged speaker, scoring the given speech signal using at least two modeling systems, where at least one of the modeling systems is a support vector machine, combining scores produced by the modeling systems, with equal weights, to produce a final score, and determining, in accordance with the final score, whether the speaker is likely the alleged speaker.
摘要:
A method is provided and includes discovering active participants and passive participants from a meeting recording, generating an active notification that includes an option to manipulate the meeting recording, and a passive notification without the option to manipulate the meeting recording, and sending the active notification and the passive notification to the active participants and the passive participants, respectively. The method can also include discovering followers from the meeting recording, generating a followers notification without the option to manipulate the meeting recording, and which includes access to a portion of meeting recording, and sending the followers notification to the followers. Discovering the active participants and the passive participants includes running speaker segmentation and recognition algorithms on the meeting recording, discovering attendees including speakers and non-speakers, and categorizing the speakers as the active participants, and the non-speakers as the passive participants.
摘要:
A method is provided and includes estimating an approximate list of potential speakers in a file from one or more applications. The file (e.g., an audio file, video file, or any suitable combination thereof) includes a recording of a plurality of speakers. The method also includes segmenting the file according to the approximate list of potential speakers such that each segment corresponds to at least one speaker; and recognizing particular speakers in the file based on the approximate list of potential speakers.