摘要:
The system retrieves information from the internet using multiple search engines that are simultaneously launched by the search engine commander. The commander is responsive to a speech-enabled system including a speech recognizer and natural language parser. The user speaks to the system in natural language requests, and the parser extracts the semantic content from the user's speech, based on a set of goal oriented grammars. The preferred system includes a fixed grammar and an updatable or downloaded grammar, allowing the system to be used without extensive training and yet capable of being customized for a particular user's purposes. Results obtained from the search engines are filtered based on information extracted from an electronic program guide and from prestored user profile data. The results may be displayed on screen or through synthesized speech.
摘要:
The remote control unit supports multi-modal dialog with the user, through which the user can easily select programs for viewing or recording. The remote control houses a microphone into which the user can input natural language speech. The input speech is recognized and interpreted by a natural language parser that extracts the semantic content of the user's speech. The parser works in conjunction with an electronic program guide, through which the remote control system is able to ascertain what programs are available for viewing or recording and supply appropriate prompts to the user. In one embodiment, the remote control includes a touch screen display upon which the user may view prompts or make selections by pen input or tapping. Selections made on the touch screen automatically limit the context of the ongoing dialog between user and remote control, allowing the user to interact naturally with the unit. The remote control unit can control virtually any audio-video component, including those designed before the current technology. The remote control system can be packaged entirely within the remote control handheld unit, or components may be distributed in other systems attached to the user's multimedia equipment.
摘要:
Speech recognition and natural language parsing components are used to extract the meaning of the user's spoken input. The system stores a semantic representation of an electronic program guide, and the contents of the program guide can be mapped into the grammars used by the natural language parser. Thus, when the user wishes to navigate through the complex menu structure of the electronic program guide, he or she only needs to speak in natural language sentences. The system automatically filters the contents of the program guide and supplies the user with on-screen display or synthesized speech responses to the user's request.
摘要:
Speech input supplied by the user is evaluated by the speaker verification/identification module, and based on the evaluation, parameters are retrieved from a user profile database. These parameters adapt the speech models of the speech recognizer and also supply the natural language parser with customized dialog grammars. The user's speech is then interpreted by the speech recognizer and natural language parser to determine the meaning of the user's spoken input in order to control the television tuner. The parser works in conjunction with a command module that mediates the dialog with the user, providing on-screen prompts or synthesized speech queries to elicit further input from the user when needed. The system integrates with an electronic program guide, so that the natural language parser is made aware of what programs are available when conducting the synthetic dialog with the user.
摘要:
Users of the system can access the TV contents and program media recorder by speaking in natural language sentences. The user interacts with the television and with other multimedia equipment, such as media recorders and VCRs, through the unified access controller. A speaker verification/identification module determines the identity of the speaker and this information is used to control how the dialog between user and system proceeds. Speech can be input through either a microphone or over the telephone. In addition, the user can interact with the system using a suitable computer attached via the internet. Regardless of the mode of access, the unified access controller interprets the semantic content of the user's request and supplies the appropriate control signals to the television tuner and/or recorder.
摘要:
A computer-implemented method and apparatus is provided for processing a spoken request from a user. A speech recognizer converts the spoken request into a digital format. A frame data structure associates semantic components of the digitized spoken request with predetermined slots. The slots are indicative of data which are used to achieve a predetermined goal. A speech understanding module which is connected to the speech recognizer and to the frame data structure determines semantic components of the spoken request. The slots are populated based upon the determined semantic components. A dialog manager which is connected to the speech understanding module may determine at least one slot which is unpopulated based upon the determined semantic components and in a preferred embodiment may provide confirmation of the populated slots. A computer generated-request is formulated in order for the user to provide data related to the unpopulated slot. The method and apparatus are well-suited (but not limited) to use in a hand-held speech translation device.
摘要:
A reduced dimensionality eigenvoice analytical technique is used during training to develop context-dependent acoustic models for allophones. Re-estimation processes are performed to more strongly separate speaker-dependent and speaker-independent components of the speech model. The eigenvoice technique is also used during run time upon the speech of a new speaker. The technique removes individual speaker idiosyncrasies, to produce more universally applicable and robust allophone models. In one embodiment the eigenvoice technique is used to identify the centroid of each speaker, which may then be “subtracted out” of the recognition equation.
摘要:
Electronic commerce (E-commerce) and Voice commerce (V-commerce) proceeds by having the user speak into the system. The user's speech is converted by speech recognizer into a form required by the transaction processor that effects the electronic commerce operation. A dimensionality reduction processor converts the user's input speech into a reduced dimensionality set of values termed eigenvoice parameters. These parameters are compared with a set of previously stored eigenvoice parameters representing a speaker population (the eigenspace representing speaker space) and the comparison is used by the speech model adaptation system to rapidly adapt the speech recognizer to the user's speech characteristics. The user's eigenvoice parameters are also stored for subsequent use by the speaker verification and speaker identification modules.
摘要:
A set of speaker dependent models is trained upon a comparatively large number of training speakers, one model per speaker, and model parameters are extracted in a predefined order to construct a set of supervectors, one per speaker. Principle component analysis is then performed on the set of supervectors to generate a set of eigenvectors that define an eigenvoice space. If desired, the number of vectors may be reduced to achieve data compression. Thereafter, a new speaker provides adaptation data from which a supervector is constructed by constraining this supervector to be in the eigenvoice space based on a maximum likelihood estimation. The resulting coefficients in the eigenspace of this new speaker may then be used to construct a new set of model parameters from which an adapted model is constructed for that speaker. Environmental adaptation may be performed by including environmental variations in the training data.
摘要:
Decision trees are used to store a series of yes-no questions that can be used to convert spelled-word letter sequences into pronunciations. Letter-only trees, having internal nodes populated with questions about letters in the input sequence, generate one or more pronunciations based on probability data stored in the leaf nodes of the tree. The pronunciations may then be improved by processing them using mixed trees which are populated with questions about letters in the sequence and also questions about phonemes associated with those letters. The mixed tree screens out pronunciations that would not occur in natural speech, thereby greatly improving the results of the letter-to-pronunciation transformation.