摘要:
A speech recognition apparatus (2000) acquires source data (10) representing an audio signal including an utterance. The speech recognition apparatus (2000) converts the source data (10) into a text string (30). The speech recognition apparatus (2000) generates a concatenated text (40) representing a content of an utterance by concatenating a text (32) included in the text string (30). Herein, texts (32) adjacent to each other in the text string (30) are such that parts of associated audio signals overlap each other on a time axis. At a time of concatenating texts (32) adjacent to each other, the speech recognition apparatus (2000) eliminates a trailing portion of a preceding text (32) and a leading portion of a succeeding text (32).
摘要:
A speech recognition apparatus (100) includes: a speech reproduction unit (102) that reproduces, for each predetermined section, target speech for speech recognition being divided for each predetermined section; a speech recognition unit (104) that recognizes, for each target speech, spoken speech acquired by repeating the target speech by a user; a text information generation unit (106) that generates text information about the spoken speech, based on a recognition result of the speech recognition unit (104); and a storage processing unit (108) that stores, as learning data, identification information by the user, the spoken speech, and the recognition result corresponding to the spoken speech in association with one another, in which the speech recognition unit (104) performs recognition by using a recognition engine that learns the learning data by the user.
摘要:
An utterance end detection apparatus (2000) acquires source data 10 representing an audio signal including one or more utterances. The utterance end detection apparatus (2000) converts the source data (10) into text data (30). The utterance end detection apparatus (2000) detects a conversion unit that analyzes text data (30), acquires source data, and converts the source data into text data, and an end of each utterance included in an audio signal represented by the source data (10).
摘要:
A speech recognition apparatus (2000) acquires a plurality of pieces of audio data (20) for a source audio signal including an utterance. The speech recognition apparatus (2000) generates a candidate text group (30) for each of the plurality of pieces of audio data (20). The candidate text group (30) includes a plurality of candidate texts (32). The candidate text (32) is a candidate of a text representing a content of an utterance corresponding to the audio data (20), and represents a sentence. The speech recognition apparatus (2000) selects, based on a comparison result between the plurality of candidate text groups (30), for each of the pieces of audio data (20), a candidate text (32) representing a content of an utterance represented by the piece of audio data (20) from the candidate text group (30) generated for the piece of audio data (20).
摘要:
Provided are a noise reduction system that highly precisely estimates noise contained in an input signal and highly precisely reduces the noise contained in the input signal using the estimated noise, a speech detection system, a speech recognition system, a noise reduction method, and a noise reduction program. The noise reduction system includes: a first noise estimating unit (111) that estimates a stationary noise component contained in a first input signal; a first noise reduction unit (121) that reduces the stationary noise component from the first input signal; a second noise estimating unit (112) that re-estimates a stationary noise component contained in the first input signal; a third noise estimating unit (113) that estimates a second non-stationary noise component including a sum of a stationary noise component and a non-stationary noise component contained in the first input signal; an estimated noise combining unit (114) that estimates a stationary noise component and a second non-stationary noise component contained in the first input signal; and a second noise reduction unit (122) that reduces the stationary noise component and the second non-stationary noise component contained in the first input signal.
摘要:
The example embodiments provides a processing system (10) including: an acquisition unit (11) that acquires target speech data in which a target speech is recorded or a target feature value that indicates a feature of the target speech; an inference unit (12) that infers a language of the target speech, based on an inference model for inferring a language of a speech from speech data or a speech feature value and the target speech data or the target feature value; a result output unit (13) that outputs an inference result by the inference unit (12); a determination unit (14) that determines whether the inference result is correct; and a learning data output unit (15) that outputs the inference result determined to be correct by the determination unit (14) and the target speech data or the target feature value, as learning data for generating the inference model.
摘要:
The example embodiments provides a processing system (10) including: an acquisition unit (11) that acquires target speech data in which a target speech is recorded or a target feature value that indicates a feature of the target speech; an inference unit (12) that infers a language of the target speech, based on an inference model for inferring a language of a speech from speech data or a speech feature value and the target speech data or the target feature value; a result output unit (13) that outputs an inference result by the inference unit (12); a determination unit (14) that determines whether the inference result is correct; and a learning data output unit (15) that outputs the inference result determined to be correct by the determination unit (14) and the target speech data or the target feature value, as learning data for generating the inference model.
摘要:
The purpose of the present invention is to provide a technology which is capable of appropriately evaluating a person's conduct with respect to another person. Provided is an information processing device, comprising a recognition unit 11, a detection unit 12, and an evaluation unit 13. The recognition unit 11 recognizes an evaluation subject's conduct. The detection unit 12 detects a trigger which is a state of a person other than the evaluation subject which triggers the evaluation subject's conduct. Using the detected trigger and the result of recognition by the recognition unit 13 relating to the evaluation subject's conduct, the evaluation unit 13 evaluates the evaluation subject's conduct.
摘要:
A speech recognition apparatus 20, includes; a data acquisition unit 21 that acquires speech data and sensor data to be recognized; a speech recognition unit 22 that converts the acquired speech data into text data by applying the acquired speech data and the acquired sensor data to an acoustic model which is constructed by machine learning using an embedded vector generated from sensor data related to training data in addition to speech data to be the training data and teacher data to be the training data.
摘要:
According to an example embodiment, a display apparatus includes adjustment means for comparing a plurality of first feature points specified in a face region, which is extracted from a shot image obtained by shooting an inspection target person, of the inspection target person and a plurality of first feature points specified in a face region, which is extracted from image data on a person registered in a database, of the person and adjusting a positional relationship between the shot image of the inspection target person and a registered image of the person to be generated based on the image data, and display control means for displaying the shot image of the inspection target person and a mark representing a visually recognizable second feature point to be specified from the registered image of the person in an overlapping manner on a display device.