摘要:
A voiced/unvoiced speech classifier (30) includes a speech segmentor (34) which segments an input digitized speech waveform into frames of speech and a band-pass filter (36) which filters the frames of speech. A relative energy generator (38) generates a relative energy value for each filtered frame of speech and a decision parameter generator (52) including an autocorrelation calculator (54) and a pitch calculator (56) generates a decision parameter based on an autocorrelation function and a pitch frequency index for the filtered frames of speech. A normalized energy calculator (46) adjusts the threshold and then normalizes the relative energy. A comparator (60) provides a signal indicative of whether a frame of speech is voiced speech or unvoiced speech depending on a comparison of the decision parameter and the normalized relative energy value for each filtered frame of speech.
摘要:
A method and apparatus for speech recognition involves classifying (38) a digitized speech segment according to whether the speech segment comprises voiced or unvoiced speech and utilizing that classification to generate tonal feature vectors (41) of the speech segment when the speech is voiced. The tonal feature vectors are then combined (42) with other non-tonal feature vectors (40) to provide speech feature vectors. The speech feature vectors are compared (35) with previously stored models of speech feature vectors (37) for different segments of speech to determine which previously stored model is a most likely match for the segment to be recognized.
摘要:
Techniques disclosed herein include systems and methods of automated speech recognition (ASR) for voice destination entry (VDE) include open voice searching (natural language searching) of destinations. A first part uses a server-based automated speech recognizer. The second part is client-based automatic speech recognition (ASR) processing. Thus, techniques include a hybrid VDE solution that provides users with an accurate and flexible way to use speech recognition technologies. A server-based speech recognizer executes the open-search task, while a client-based recognizer refines the results from the server to deliver an optimized result. This system and method significantly improves recognition accuracy for dictation engine based POI search of Chinese Mandarin input and input from other languages. Moreover, the methods herein largely improve the user experience by allowing users to say a partial POI name, and abbreviation, or even say a POI name in a reversed word order.
摘要:
There is described a method 300 for open vocabulary speech recognition performed by an electronic device (100). The method (300) includes receiving an utterance waveform (320) and Processing the waveform (350) to provide feature vectors representing the waveform. Then a step of comparing (360) is effected, the comparing compares the feature vectors with concatenated isolated word acoustic models from a concatenated isolated word acoustic model list to select a suitable concatenated isolated word acoustic model. Then a providing a response step (370) provides a response depending on the suitable concatenated isolated word acoustic model. The response typically is a control signal for activating a function of the device (100).
摘要:
A method of estimating a confidence measure for a speech recognition system, involves comparing an input speech signal with a number of predetermined models of possible speech signals. Best scores indicating the degree of similarity between the input speech signal and each of the predetermined models are then used to determine a normalized variance, which is used as the Confidence Measure, in order to determine whether the input speech signal has been correctly recognized, the Confidence Measure is compared to a threshold value. The threshold value is weighted according to the Signal to Noise Ratio of the input speech signal and according to the number of predetermined models used.
摘要:
A system [100] includes an audio reception device [105] to receive audio from a person speaking and convert the audio to a text format. An intelligent agent [110] receives the text format and detects at least one key term in the text format based on predetermined criteria. A logic engine [115] compares the at least one key term with a listener knowledge base [125] corresponding to a listener to determine context information corresponding to the at least one key term. A search device [135] searches for multimedia content corresponding to the context information. A communication device [150] communicates display content comprising at least one of: the multimedia content, and a link to the multimedia content to an electronic display device [155] adapted to display the display content.
摘要:
A method, apparatus, and electronic device for optimizing a media presentation to a group. A memory may store a personal media user profile for a user. A processor may create a group media user profile from the personal media user profile and associated individual media user profiles. A network interface may send a request to a digital media content source for a set of digital media content with a digital media content profile that matches the group media user profile.
摘要:
A system [100] includes an audio reception device [105] to receive audio from a person speaking and convert the audio to a text format. An intelligent agent [110] receives the text format and detects at least one key term in the text format based on predetermined criteria. A logic engine [115] compares the at least one key term with a listener knowledge base [125] corresponding to a listener to determine context information corresponding to the at least one key term. A search device [135] searches for multimedia content corresponding to the context information. A communication device [150] communicates display content comprising at least one of: the multimedia content, and a link to the multimedia content to an electronic display device [155] adapted to display the display content.