摘要:
Voice-controlled customized commands including customization of the command to be preformed, such as a number to be dialed to make a connection with an address of a corporate voice dialing system, and the speech pattern or utterance which may be enrolled by a user to invoke the command can be used by other users, if authorized by the enrolling user. When a current user wants to use a customized command enrolled by another user, a preferably voice actuated command is invoked to cause the search of a database containing a page of customized commands for each user and the return of commands to which access of a current user is authorized in accordance with aliases established by the enrolling user. The returned commands are preferably presented to the current user as a menu from which the current user can make a selection and obtain execution of the authorized command.
摘要:
A speech recognition system is disclosed useful in, for example, hands-free voice telephone dialing applications. The system will match a spoken word (token) to one previously enrolled in the system. The system will thereafter synthesize or replay the recognized word so that the speaker can confirm that the recognized word is indeed the correct word before further action is taken. In the case of voice activated dialing, this avoids wrong numbers. The token itself is not explicitly recorded; rather, only the lefemes may be recorded from which the token can be reconstructed for playback. This greatly reduces the amount of disk space that is needed for the database as well as provides the ability to reconstruction data in real time for synthesis use by a local name recognition machine.
摘要:
Apparatus for preventing unauthorized use of a voice dialing system and, particularly, a call forwarding feature associated with the system whereby system users may forward a telephone number respectively associated therewith to a remote location in order to receive phone calls at the remote location, comprises: a database for pre-storing telephone numbers of system users and for pre-storing acoustic models respectively representative of speech associated with each system user, the acoustic models respectively corresponding to the telephone numbers; and a speaker identification module operatively coupled to the database for obtaining and decoding a speech sample from a potential system user during the potential users' attempt to make a telephone call, the speaker identification module comparing the decoded speech sample obtained with the pre-stored acoustic model associated with the telephone number dialed by the potential user; whereby if the decoded speech sample substantially matches the pre-stored acoustic model, then the phone call attempted by the potential user is terminated.
摘要:
The present invention includes a method of generating a set of substantially shift invariant acoustic features from an input speech signal which comprises the steps of: splitting the input speech signal into a plurality of input speech signals; respectively delaying a majority of the input speech signals by a successively incrementing time interval; respectively extracting a plurality of sets of acoustic features from the plurality of input speech signals; summing the plurality of sets of acoustic features to form a set of summed acoustic features; and dividing the set of summed acoustic features by a number equivalent to the number of sets of acoustic features summed in the summing step thereby forming a set of averaged acoustic features which are substantially shift invariant. Further, the present invention may include a method for generating at least one substantially shift invariant speech recognition model from speech training data which comprises the steps of: inputting the speech training data a first time; extracting acoustic features from the speech training data input the first time; inputting the speech training data a plurality of times thereafter, each time respectively delaying the input speech training data by a successively incrementing time interval; respectively extracting acoustic features from each delayed speech training data input each time; and utilizing at least the acoustic features extracted in the extracting steps to form the at least one speech recognition model which is substantially shift invariant. Still further, the present invention may include a synchrosqueezing process in the feature extraction steps. Also, the invention contemplates implementing these processes individually, in combination with another of the processes, and a combination of all the processes.
摘要:
A method and an apparatus are provided for performing speech recognition on speech segments frequently input by a user. The method and the apparatus include use of keyword scoring in connection with a speech recognition vocabulary, a temporary score, and a predetermined margin to determine an appropriate output as being representative of the input speech segment.
摘要:
Clusters of quantized feature vectors are processed against each other using a threshold distance value to cluster mean values of sets of parameters contained in speaker specific codebooks to form classes of speakers against which feature vectors computed from an arbitrary input speech signal can be compared to identify a speaker class. The number of codebooks considered in the comparison may be thus reduced to limit mixture elements which engender ambiguity and reduce system response speed when the speaker population becomes large. A speaker class processing model which is speaker independent within the class may be trained on one or more members of the class and selected for implementation in a speech recognition processor in accordance with the speaker class recognized to further improve speech recognition to level comparable to that of a speaker dependent model. Formation of speaker classes can be supervised by identification of groups of speakers to be included in the class and the speaker class dependent model trained on members of a respective group.