摘要:
Disclosed herein are systems, computer-implemented methods, and tangible computer-readable storage media for captioning a media presentation. The method includes receiving automatic speech recognition (ASR) output from a media presentation and a transcription of the media presentation. The method includes selecting via a processor a pair of anchor words in the media presentation based on the ASR output and transcription and generating captions by aligning the transcription with the ASR output between the selected pair of anchor words. The transcription can be human-generated. Selecting pairs of anchor words can be based on a similarity threshold between the ASR output and the transcription. In one variation, commonly used words on a stop list are ineligible as anchor words. The method includes outputting the media presentation with the generated captions. The presentation can be a recording of a live event.
摘要:
Disclosed herein are systems, computer-implemented methods, and computer-readable storage media for unit selection synthesis. The method causes a computing device to add a supplemental phoneset to a speech synthesizer front end having an existing phoneset, modify a unit preselection process based on the supplemental phoneset, preselect units from the supplemental phoneset and the existing phoneset based on the modified unit preselection process, and generate speech based on the preselected units. The supplemental phoneset can be a variation of the existing phoneset, can include a word boundary feature, can include a cluster feature where initial consonant clusters and some word boundaries are marked with diacritics, can include a function word feature which marks units as originating from a function word or a content word, and/or can include a pre-vocalic or post-vocalic feature. The speech synthesizer front end can incorporates the supplemental phoneset as an extra feature.
摘要:
Apparatuses, systems, methods, and media for filtering a data stream are provided. The data stream is analyzed based on an acoustic parameter to determine extraneous portions in which a first predetermined condition is satisfied. When a first extraneous portion is separated from a second extraneous portion by a non-extraneous portion in which the first predetermined condition is not satisfied, it is determined whether the first extraneous portion being separated from the second extraneous portion by the non-extraneous portion satisfies a second predetermined condition. At least one of the first extraneous portion and the second extraneous portion is deleted from the data stream to produce a filtered data stream in response to determining the second predetermined condition is satisfied.
摘要:
Disclosed herein are systems, computer-implemented methods, and tangible computer-readable storage media for captioning a media presentation. The method includes receiving automatic speech recognition (ASR) output from a media presentation and a transcription of the media presentation. The method includes selecting via a processor a pair of anchor words in the media presentation based on the ASR output and transcription and generating captions by aligning the transcription with the ASR output between the selected pair of anchor words. The transcription can be human-generated. Selecting pairs of anchor words can be based on a similarity threshold between the ASR output and the transcription. In one variation, commonly used words on a stop list are ineligible as anchor words. The method includes outputting the media presentation with the generated captions. The presentation can be a recording of a live event.
摘要:
A method of detecting pre-determined phrases to determine compliance quality includes determining whether a precursor event has occurred based on a comparison between stored pre-determined phrases and a received communication, and determining a compliance rating based on a presence of a pre-determined phrase associated with the precursor event in the communication.
摘要:
Systems, methods, and computer-readable storage media for text-to-speech processing having an improved intonation. The system first receives text to be converted to speech, the text having a first segment and a second segment. The system then compares the text to a database of stored utterances, identifying in the database a first utterance corresponding to the first segment and determining an intonation of the first utterance. When the database does not contain a second utterance corresponding to the second segment, the system generates the speech corresponding to the text by combining the first utterance with a generated second utterance corresponding to the second segment, the generated second utterance having the intonation matching, or based on, the first utterance. These actions lead to an improved, smoother, more human-like synthetic speech output from the system.
摘要:
Disclosed herein are systems, computer-implemented methods, and tangible computer-readable storage media for captioning a media presentation. The method includes receiving automatic speech recognition (ASR) output from a media presentation and a transcription of the media presentation. The method includes selecting via a processor a pair of anchor words in the media presentation based on the ASR output and transcription and generating captions by aligning the transcription with the ASR output between the selected pair of anchor words. The transcription can be human-generated. Selecting pairs of anchor words can be based on a similarity threshold between the ASR output and the transcription. In one variation, commonly used words on a stop list are ineligible as anchor words. The method includes outputting the media presentation with the generated captions. The presentation can be a recording of a live event.
摘要:
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating speech. One variation of the method is from a server side, and another variation of the method is from a client side. The server side method, as implemented by a network-based automatic speech processing system, includes first receiving, from a network client independent of knowledge of internal operations of the system, a request to generate a text-to-speech voice. The request can include speech samples, transcriptions of the speech samples, and metadata describing the speech samples. The system extracts sound units from the speech samples based on the transcriptions and generates an interactive demonstration of the text-to-speech voice based on the sound units, the transcriptions, and the metadata, wherein the interactive demonstration hides a back end processing implementation from the network client. The system provides access to the interactive demonstration to the network client.
摘要:
A method of detecting pre-determined phrases to determine compliance quality is provided. The method includes determining whether at least one of an event or a precursor event has occurred based on a comparison between pre-determined phrases and a communication between a sender and a recipient in a communications network, and rating the recipient based on the presence of the pre-determined phrases associated with the event or the presence of the pre-determined phrases associated with the precursor event in the communication.
摘要:
Apparatuses, systems, methods, and media for filtering a data stream are provided. The data stream is partitioned into a plurality of data stream segments. An acoustic parameter is measured in each of the data stream segments. It is determined whether the acoustic parameter satisfies a first predetermined condition. The first predetermined condition includes a number of variances, in which the acoustic parameter exceeds a predetermined variance threshold, exceeding a predetermined number threshold. An extraneous portion of the data stream is identified in which the first predetermined condition is satisfied. It is determined whether the extraneous portion satisfies a second predetermined condition in the data stream. The extraneous portion is deleted from the data stream to produce a filtered data stream in response to the second predetermined condition being satisfied.