Abstract:
A voice retrieval apparatus executes processes of: converting a retrieval string into a phoneme string; obtaining, from a time length memory, a continuous time length for each phoneme contained in the converted phoneme string; deriving a plurality of time lengths corresponding to a plurality of utterance rates as candidate utterance time lengths of voices corresponding to the retrieval string based on the obtained continuous time length; specifying, for each of the plurality of time lengths, a plurality of likelihood obtainment segments having the derived time length within a time length of a retrieval sound signal; obtaining a likelihood showing a plausibility that the specified likelihood obtainment segment specified is a segment where the voices are uttered; and identifying, based on the obtained likelihood, for each of the specified likelihood obtainment segments, an estimation segment where utterance of the voices is estimated in the retrieval sound signal.
Abstract:
A voice processing apparatus includes a first storage unit which stores a known-word, and a processor. The processor executes a voice recognition process of extracting an unknown-word by executing a voice recognition process on an input voice signal, based on a storage content of the first storage unit, and a storage control process of executing storage control to the first storage unit, wherein the storage control process includes a process of storing, when information of a number of unknown-words which are recognized to be identical, among the extracted unknown-words by the voice recognition process, meets a predetermined condition, a corresponding unknown-word in the first storage unit as a known-word.
Abstract:
A voice retrieval apparatus executes processes of: obtaining, from a time length memory, a continuous time length for each phoneme contained in a phoneme string of a retrieval string; obtaining user-specified information on an utterance rate; changing the continuous time length for each obtained phoneme in accordance with the obtained information; deriving, based on the changed continuous time length, an utterance time length of voices corresponding to the retrieval string; specifying a plurality of likelihood obtainment segments of the derived utterance time length in a time length of a retrieval sound signal; obtaining a likelihood showing a plausibility that the specified likelihood obtainment segment is a segment where the voices are uttered; and identifying, based on the obtained likelihood, an estimation segment where, within the retrieval sound signal, utterance of the voices is estimated, the estimation segment being identified for each specified likelihood obtainment segment.
Abstract:
An audio interval detection apparatus has a processor and a storage storing instructions that, when executed by the processor, control the processor to: detect, from a target audio signal, a specified audio interval including a specified audio signal representing a state of a phoneme of a same consonant produced continuously over a period longer than a specified time, and, by eliminating, from the target audio signal at least the detected specified audio interval, detect from the target audio signal an utterance audio interval that includes a speech utterance signal representing a speech utterance uttered by a speaker.
Abstract:
In a voice search device, a processor acquires a search word, converts the search word into a phoneme sequence, acquires, for each frame, an output probability of a feature quantity of a target voice signal being output from each phoneme included in the phoneme sequence, and executes relative calculation of the output probability acquired from each phoneme, based on an output probability acquired from another phoneme included in the phoneme sequence. In addition, the processor successively designates likelihood acquisition zones, acquires a likelihood indicating how likely a designated likelihood acquisition zone is a zone in which voice corresponding to the search word is spoken, and identifies from the target voice signal an estimated zone for which the voice corresponding to the search word is estimated to be spoken, based on the acquired likelihood.