Abstract:
본 전자 장치는 음성 인식 모델 및 음성 인식 모델을 통해 획득한 제1 사용자 음성에 대응되는 제1 인식 정보를 저장하는 메모리, 상기 음성 인식 모델은 제1 네트워크, 제2 네트워크 및 제3 네트워크를 포함하고, 및 제2 사용자 음성을 에 대응되는 음성 데이터를 상기 제1 네트워크에 입력하여 제1 벡터를 획득하고, 제1 인식 정보를 제1 가중치 정보에 기초하여 벡터를 생성하는 상기 제2 네트워크에 입력하여 제2 벡터를 획득하고, 제1 벡터 및 제2 벡터를 제2 가중치 정보에 기초하여 인식 정보를 생성하는 제3 네트워크에 입력하여 제2 사용자 음성에 대응되는 제2 인식 정보를 획득하는 프로세서를 포함하고, 제2 가중치 정보 중 적어도 일부는 제1 가중치 정보와 동일하다.
Abstract:
A method and system that combines voice recognition engines (104, 108, 112, 114) and resolves differences between the results of individual voice recognition engines (104, 106, 108, 112, 114) using a mapping function. Speaker independent voice recognition engine (104) and speaker-dependent voice recognition engine (106) are combined. Hidden Markov Model (HMM) engines (108, 114) and Dynamic Time Warping (DTW) engines (104, 106, 112) are combined.
Abstract:
Speech recognition uses a wide token builder (66), gain and noise adapter (70) and noise adapted Dynamic Time Warping (60). Wide token builder produces a padded test token expanded with at least one blank frame before and after the input test utterance. Gain and noise adapter adapts each padded reference template with noise and gain qualities producing adapted reference templates having noise frames wherever a blank frame was originally placed and noise adapted speech where speech exists. Dynamic Time Warping (DTW) is performed on the noise adapted templates.
Abstract:
The invention concerns a device comprising: a memory containing a series of numbers and vocal prints; an acoustic transducer, for picking up a correspondent's name spoken by the user; voice recognition means, for analysing the recorded correspondent's name and transforming it into a voice print; means for selectively addressing the memory, comprising associative means, for finding in the memory a voice print information corresponding to the one supplied by the voice recognition means and, if they match, for addressing the memory on the corresponding position; and means, co-operating with the associative means, for applying to the radiotelephone circuits the addressed directory number. The voice recognition means evaluate and memorise a current sound level picked up by the transducer in the absence of a word signal; in the presence of a word signal, they subtract from the picked up signal the previously evaluated current sound level and apply on the resulting signal a DTW voice recognition algorithm with form recognition by dynamic programming adapted to the word using functions for extracting dynamic parameters, in particular a dynamic predictive algorithm with forward and/or backward and/or frequency masking.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for sequence modeling. One of the methods includes receiving an input sequence having a plurality of input positions; determining a plurality of blocks of consecutive input positions; processing the input sequence using a neural network to generate a latent alignment, comprising, at each of a plurality of input time steps: receiving a partial latent alignment from a previous input time step; selecting an input position in each block, wherein the token at the selected input position of the partial latent alignment in each block is a mask token; and processing the partial latent alignment and the input sequence using the neural network to generate a new latent alignment, wherein the new latent alignment comprises, at the selected input position in each block, an output token or a blank token; and generating, using the latent alignment, an output sequence.
Abstract:
Reference-sample feature vectors that quantify acoustic features of different respective portions of at least one reference speech sample (44), which was produced by a subject (22) at a first time while a physiological state of the subject was known, are obtained. At least one test speech sample (56) that was produced by the subject at a second time, while the physiological state of the subject was unknown, is received. Test-sample feature vectors (60) that quantify the acoustic features of different respective portions (58) of the test speech sample are computed. The test-sample feature vectors are mapped to respective ones of the reference-sample feature vectors, under predefined constraints, such that a total distance between the test-sample feature vectors and the respective ones of the reference-sample feature vectors is minimized. In response to the mapping, an output indicating the physiological state of the subject at the second time is generated.
Abstract:
Настоящее техническое решение, в общем, относится к области вычислительной обработки данных, а в частности, к методам машинного обучения для построения моделей анализа диалогов на естественном языке. Компьютерно-реализуемый способ создания модели анализа диалогов на базе искусственного интеллекта для обработки обращений пользователей, выполняемый с помощью по меньшей мере одного процессора и содержащий этапы, на которых получают набор первичных данных, причем набор включает в себя по меньшей мере текстовые данные диалогов между пользователями и операторами, содержащие обращения пользователей и ответы операторов; осуществляют обработку полученного набора данных, в ходе которой формируют обучающую выборку для искусственной нейронной сети, содержащую положительные и отрицательные примеры обращений пользователей на основании анализа контекста диалогов, причем положительные примеры содержат семантически связанный набор реплик оператора в ответ на обращение пользователя; выполняют выделение и кодирование векторное представление каждой реплики из упомянутых на предыдущем шаге положительных и отрицательных примеров обучающей выборки; применяют сформированную обучающую выборку для обучения модели определения релевантных реплик из контекста пользовательских обращений в диалогах.
Abstract:
A method and system that improves voice recognition by improving storage of voice recognition (VR) templates. The improved storage means that more VR models can be stored in memory. The more VR models that are stored in memory, the more robust the VR system and therefore the more accurate the VR system. Lossy compression techniques are used to compress VR models. In one embodiment, A-law compression and A-law expansion are used to compress and expand VR models. In another embodiment, Mu-law compression and Mu-law expansion are used to compress and expand VR models. VR models are compressed during a training process and they are expanded during voice recognition.
Abstract:
Speech recognition uses a wide token builder (66), gain and noise adapter (70) and noise adapted Dynamic Time Warping (60). Wide token builder produces a padded test token expanded with at least one blank frame before and after the input test utterance. Gain and noise adapter adapts each padded reference template with noise and gain qualities producing adapted reference templates having noise frames wherever a blank frame was originally placed and noise adapted speech where speech exists. Dynamic Time Warping (DTW) is performed on the noise adapted templates.
Abstract:
Speech recognition uses a wide token builder (66), gain and noise adapter (70) and noise adapted Dynamic Time Warping (60). Wide token builder produces a padded test token expanded with at least one blank frame before and after the input test utterance. Gain and noise adapter adapts each padded reference template with noise and gain qualities producing adapted reference templates having noise frames wherever a blank frame was originally placed and noise adapted speech where speech exists. Dynamic Time Warping (DTW) is performed on the noise adapted templates.