摘要:
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating an acoustic model for use in speech recognition. A system configured to practice the method first receives training data and identifies non-contextual lexical-level features in the training data. Then the system infers sentence-level features from the training data and generates a set of decision trees by node-splitting based on the non-contextual lexical-level features and the sentence-level features. The system decorrelates training vectors, based on the training data, for each decision tree in the set of decision trees to approximate full-covariance Gaussian models, and then can train an acoustic model for use in speech recognition based on the training data, the set of decision trees, and the training vectors.
摘要:
Disclosed herein are systems, methods, and computer-readable storage media for a speech recognition application for directory assistance that is based on a user's spoken search query. The spoken search query is received by a portable device and portable device then determines its present location. Upon determining the location of the portable device, that information is incorporated into a local language model that is used to process the search query. Finally, the portable device outputs the results of the search query based on the local language model.
摘要:
Disclosed herein are systems, methods, and computer-readable storage media for a speech recognition application for directory assistance that is based on a user's spoken search query. The spoken search query is received by a portable device and portable device then determines its present location. Upon determining the location of the portable device, that information is incorporated into a local language model that is used to process the search query. Finally, the portable device outputs the results of the search query based on the local language model.
摘要:
Disclosed are a system, method and computer-readable medium for organizing images. A method aspect relates to receiving an image into a device, receiving incidental information associated with the image, organizing the image and the incidental information into a data structure such as a sparse array, classifying the received image with an image classifier and storing the classified image in an image database, receiving a search query and responding to the search query by searching for and retrieving matching images in the image database based on a comparison of the image search query to the data structure.
摘要:
Disclosed herein are systems, methods, and non-transitory computer-readable storage media for building an automatic speech recognition system through an Internet API. A network-based automatic speech recognition server configured to practice the method receives feature streams, transcriptions, and parameter values as inputs from a network client independent of knowledge of internal operations of the server. The server processes the inputs to train an acoustic model and a language model, and transmits the acoustic model and the language model to the network client. The server can also generate a log describing the processing and transmit the log to the client. On the server side, a human expert can intervene to modify how the server processes the inputs. The inputs can include an additional feature stream generated from speech by algorithms in the client's proprietary feature extraction.
摘要:
Disclosed are a system, method and computer-readable medium for organizing images. A method aspect relates to receiving an image into a device, receiving incidental information associated with the image, organizing the image and the incidental information into a data structure such as a sparse array, classifying the received image with an image classifier and storing the classified image in an image database, receiving a search query and responding to the search query by searching for and retrieving matching images in the image database based on a comparison of the image search query to the data structure.
摘要:
A network communication system includes a connection server that assigns a network address within a data communication network to a subscriber terminal. The connection server receives outgoing communications from the subscriber terminal and transmits the outgoing communications to a network access point and receives incoming communications from the network access point and transmits the incoming communications to the subscriber terminal. The connection server intercepts a tracking cookie received from a remote server in the data communications network and intended for the subscriber terminal and stores the tracking cookie at the connection server so that the tracking cookie can be used to support a communication session between the subscriber terminal and the remote server without the tracking cookie being stored at the subscriber terminal.
摘要:
A network communication system includes a connection server that assigns a network address within a data communication network to a subscriber terminal. The connection server receives outgoing communications from the subscriber terminal and transmits the outgoing communications to a network access point and receives incoming communications from the network access point and transmits the incoming communications to the subscriber terminal. The connection server intercepts a tracking cookie received from a remote server in the data communications network and intended for the subscriber terminal and stores the tracking cookie at the connection server so that the tracking cookie can be used to support a communication session between the subscriber terminal and the remote server without the tracking cookie being stored at the subscriber terminal.
摘要:
Recognition of sound units is improved by comparing frame-pair feature vectors which helps compensate for context variations in the pronunciation of sound units. A plurality of reference frames are stored of reference feature vectors representing reference words. A linear predictive coder (10) generates a plurality of spectral feature vectors for each frame of the speech signals. A filter bank system (12) transforms the spectral feature vectors to filter bank representations. A principal feature vector transformer (14) transforms the filter bank representations to an identity matrix of transformed input feature vectors. A concatenate frame system (16) concatenates the input feature vectors of adjacent frames to form the feature vector of a frame-pair. A transformer (18) and a comparator (20) compute the likelihood that each input feature vector for a frame-pair was produced by each reference frame. This computation is performed individually and independently for each reference frame-pairs. A dynamic time warper (22) constructs an optimum time path through the input speech signals for each of the computed likelihoods. A high level decision logic (24) recognizes the input speech signals as one of the reference words in response to the computed likelihoods and the optimum time paths.