摘要:
A method, computer system, and a computer program product for audio data augmentation are provided. Sets of audio data from different sources may be obtained. A respective normalization factor for at least two sources of the different sources may be calculated. The normalization factors from the at least two sources may be mixed to determine a mixed normalization factor. A first set of the sets may be normalized by using the mixed normalization factor and to obtain training data for training an acoustic model.
摘要:
A technique for data augmentation for speech data is disclosed. Original speech data including a sequence of feature frames is obtained. A partially prolonged copy of the original speech data is generated by inserting one or more new frames into the sequence of the feature frames. The partially prolonged copy is output as augmented speech data for training an acoustic model for training an acoustic model.
摘要:
A computer-implemented method, computer program product, and system are provided for separating a word in a dictionary. The method includes reading a word from the dictionary as a source word. The method also includes searching the dictionary for another word having a substring with a same surface string and a same reading as the source word. The method additionally includes splitting the another word by the source word to obtain one or more remaining substrings of the another word. The method further includes registering each of the one or more remaining substrings as a new word in the dictionary.
摘要:
A method for processing a speech signal. The method comprises obtaining a logmel feature of a speech signal. The method further includes one or more processors processing the logmel feature so that the logmel feature is normalized under a constraint that a power level of the logmel feature is kept as originally obtained. The method further includes inputting the processed logmel feature into a speech-to-text system to generate corresponding text data.
摘要:
Embodiments include methods and systems for improving an acoustic model. Aspects include acquiring a first standard deviation value by calculating standard deviation of a feature from first training data and acquiring a second standard deviation value by calculating standard deviation of a feature from second training data acquired in a different environment from an environment of the first training data. Aspects also include creating a feature adapted to an environment where the first training data is recorded, by multiplying the feature acquired from the second training data by a ratio obtained by dividing the first standard deviation value by the second standard deviation value. Aspects further include reconstructing an acoustic model constructed using training data acquired in the same environment as the environment of the first training data using the feature adapted to the environment where the first training data is recorded.
摘要:
A method for processing a speech signal. The method comprises obtaining a logmel feature of a speech signal. The method further includes one or more processors processing the logmel feature so that the logmel feature is normalized under a constraint that a power level of the logmel feature is kept as originally obtained. The method further includes inputting the processed logmel feature into a speech-to-text system to generate corresponding text data.
摘要:
A computer-implemented method, computer program product, and system are provided for separating a word in a dictionary. The method includes reading a word from the dictionary as a source word. The method also includes searching the dictionary for another word having a substring with a same surface string and a same reading as the source word. The method additionally includes splitting the another word by the source word to obtain one or more remaining substrings of the another word. The method further includes registering each of the one or more remaining substrings as a new word in the dictionary.
摘要:
A system and method for expanding a question and answer (Q&A) database. The method includes preparing a set of Q&A documents and speech recognition results of an agent's utterances in conversations between an agent and a customer, each Q&A document in the set having an identifier, and each speech recognition result having an identifier common with the identifier of a relevant Q&A document, and adding one or more repetition parts extracted from the speech recognition results of the agent's utterances to a corresponding Q&A document in the set.
摘要:
Embodiments include methods and systems for improving an acoustic model. Aspects include acquiring a first standard deviation value by calculating standard deviation of a feature from first training data and acquiring a second standard deviation value by calculating standard deviation of a feature from second training data acquired in a different environment from an environment of the first training data. Aspects also include creating a feature adapted to an environment where the first training data is recorded, by multiplying the feature acquired from the second training data by a ratio obtained by dividing the first standard deviation value by the second standard deviation value. Aspects further include reconstructing an acoustic model constructed using training data acquired in the same environment as the environment of the first training data using the feature adapted to the environment where the first training data is recorded.
摘要:
A construction method for a speech recognition model, in which a computer system includes; a step of acquiring alignment between speech of each of a plurality of speakers and a transcript of the speaker; a step of joining transcripts of the respective ones of the plurality of speakers along a time axis, creating a transcript of speech of mixed speakers obtained from synthesized speech of the speakers, and replacing predetermined transcribed portions of the plurality of speakers overlapping on the time axis with a unit which represents a simultaneous speech segment; and a step of constructing at least one of an acoustic model and a language model which make up a speech recognition model, based on the transcript of the speech of the mixed speakers.