Search engine with natural language-based robust parsing for user query and relevance feedback learning
    13.
    发明授权
    Search engine with natural language-based robust parsing for user query and relevance feedback learning 有权
    搜索引擎采用基于自然语言的强大解析,用于用户查询和相关性反馈学习

    公开(公告)号:US06766320B1

    公开(公告)日:2004-07-20

    申请号:US09645806

    申请日:2000-08-24

    IPC分类号: G06F1730

    摘要: A search engine architecture is designed to handle a full range of user queries, from complex sentence-based queries to simple keyword searches. The search engine architecture includes a natural language parser that parses a user query and extracts syntactic and semantic information. The parser is robust in the sense that it not only returns fully-parsed results (e.g., a parse tree), but is also capable of returning partially-parsed fragments in those cases where more accurate or descriptive information in the user query is unavailable. A question matcher is employed to match the fully-parsed output and the partially-parsed fragments to a set of frequently asked questions (FAQs) stored in a database. The question matcher then correlates the questions with a group of possible answers arranged in standard templates that represent possible solutions to the user query. The search engine architecture also has a keyword searcher to locate other possible answers by searching on any keywords returned from the parser. The answers returned from the question matcher and the keyword searcher are presented to the user for confirmation as to which answer best represents the user's intentions when entering the initial search query. The search engine architecture logs the queries, the answers returned to the user, and the user's confirmation feedback in a log database. The search engine has a log analyzer to evaluate the log database to glean information that improves performance of the search engine over time by training the parser and the question matcher.

    摘要翻译: 搜索引擎架构旨在处理从复杂的基于句子的查询到简单关键词搜索的全面的用户查询。 搜索引擎架构包括解析用户查询并提取句法和语义信息的自然语言解析器。 解析器在其不仅返回完全解析的结果(例如解析树)的意义上是鲁棒的,而且还能够在用户查询中更准确或描述性的信息不可用的情况下返回部分解析的片段。 使用问题匹配器将完全解析的输出和部分解析的片段与存储在数据库中的一组常见问题(FAQ)进行匹配。 然后,问题匹配器将问题与标准模板中排列的一组可能的答案相关联,这些答案代表用户查询的可能解决方案。 搜索引擎架构还具有一个关键字搜索器,通过搜索解析器返回的任何关键字来定位其他可能的答案。 从问题匹配器和关键词搜索器返回的答案被呈现给用户以确认哪个答案最好地表示用户在输入初始搜索查询时的意图。 搜索引擎架构将查询记录,返回给用户的答案以及用户在日志数据库中的确认反馈记录。 搜索引擎有一个日志分析器来评估日志数据库以收集信息,通过训练解析器和问题匹配器来提高搜索引擎的性能。

    Handwriting signal processing front-end for handwriting recognizers
    14.
    发明授权
    Handwriting signal processing front-end for handwriting recognizers 失效
    手写信号处理前端用于手写识别

    公开(公告)号:US5577135A

    公开(公告)日:1996-11-19

    申请号:US204031

    申请日:1994-03-01

    CPC分类号: G06K9/00422 G06K9/6218

    摘要: A handwriting signal processing front-end method and apparatus for a handwriting training and recognition system which includes non-uniform segmentation and feature extraction in combination with multiple vector quantization. In a training phase, digitized handwriting samples are partitioned into segments of unequal length. Features are extracted from the segments and are grouped to form feature vectors for each segment. Groups of adjacent from feature vectors are then combined to form input frames. Feature-specific vectors are formed by grouping features of the same type from each of the feature vectors within a frame. Multiple vector quantization is then performed on each feature-specific vector to statistically model the distributions of the vectors for each feature by identifying clusters of the vectors and determining the mean locations of the vectors in the clusters. Each mean location is represented by a codebook symbol and this information is stored in a codebook for each feature. These codebooks are then used to train a recognition system. In the testing phase, where the recognition system is to identify handwriting, digitized test handwriting is first processed as in the training phase to generate feature-specific vectors from input frames. Multiple vector quantization is then performed on each feature-specific vector to represent the feature-specific vector using the codebook symbols that were generated for that feature during training. The resulting series of codebook symbols effects a reduced representation of the sampled handwriting data and is used for subsequent handwriting recognition.

    摘要翻译: 一种用于手写训练和识别系统的手写信号处理前端方法和装置,其包括与多个矢量量化相结合的非均匀分割和特征提取。 在训练阶段,数字化手写样本被划分成不等长的段。 从段中提取特征,并将其分组以形成每个段的特征向量。 然后组合来自特征向量的相邻组以形成输入帧。 特征向量通过从帧内的每个特征向量分组相同类型的特征来形成。 然后对每个特征向量执行多向量量化,以通过识别向量的簇并确定簇中的向量的平均位置来统计地对每个特征的向量的分布进行建模。 每个平均位置由码本符号表示,并且该信息存储在每个特征的码本中。 然后将这些码本用于训练识别系统。 在识别系统识别笔迹的测试阶段,数字化测试笔迹首先在训练阶段进行处理,以从输入框中生成特征向量。 然后对每个特征向量执行多向量量化,以使用在训练期间为该特征生成的码本符号来表示特征向量。 所得到的一系列码本符号影响了采样笔迹数据的缩小表示,并被用于随后的手写识别。

    Rapid tree-based method for vector quantization
    15.
    发明授权
    Rapid tree-based method for vector quantization 失效
    用于矢量量化的快速基于树的方法

    公开(公告)号:US5734791A

    公开(公告)日:1998-03-31

    申请号:US999354

    申请日:1992-12-31

    IPC分类号: G10L19/02 G10L3/02

    CPC分类号: G10L19/038

    摘要: The branching decision for each node in a vector quantization (VQ) binary tree is made by a simple comparison of a pre-selected element of the candidate vector with a stored threshold resulting in a binary decision for reaching the next lower level. Each node has a preassigned element and threshold value. Conventional centroid distance training techniques (such as LBG and k-means) are used to establish code-book indices corresponding to a set of VQ centroids. The set of training vectors are used a second time to select a vector element and threshold value at each node that approximately splits the data evenly. After processing the training vectors through the binary tree using threshold decisions, a histogram is generated for each code-book index that represents the number of times a training vector belonging to a given index set appeared at each index. The final quantization is accomplished by processing and then selecting the nearest centroid belonging to that histogram. Accuracy comparable to that achieved by conventional binary tree VQ is realized but with almost a full magnitude increase in processing speed.

    摘要翻译: 矢量量化(VQ)二叉树中的每个节点的分支决定是通过将​​候选矢量的预先选择的元素与存储的阈值进行简单比较而得到的,从而产生用于达到下一较低级别的二进制决定。 每个节点具有预分配的元素和阈值。 传统的质心距离训练技术(如LBG和k-means)用于建立与一组VQ质心相对应的代码簿索引。 训练矢量集合被用于第二次在每个节点选择一个向量元素和阈值,每个节点大致分割数据。 在通过使用阈值判定的二进制树处理训练向量之后,针对代表每个索引处出现的给定索引集的训练向量的次数的每个代码簿索引生成直方图。 最后量化通过处理然后选择属于该直方图的最近质心来实现。 实现与常规二叉树VQ实现的精度相当的精度,但处理速度几乎提高了一个全面的幅度。

    System and method for automatic subcharacter unit and lexicon generation
for handwriting recognition
    17.
    发明授权
    System and method for automatic subcharacter unit and lexicon generation for handwriting recognition 失效
    用于手写识别的自动子字符单元和词典生成的系统和方法

    公开(公告)号:US5757964A

    公开(公告)日:1998-05-26

    申请号:US901989

    申请日:1997-07-29

    IPC分类号: G06K9/62 G06K9/72

    CPC分类号: G06K9/6297 G06K9/6255

    摘要: A system for automatic subcharacter unit and lexicon generation for handwriting recognition comprises a processing unit, a handwriting input device, and a memory wherein a segmentation unit, a subcharacter generation unit, a lexicon unit, and a modeling unit reside. The segmentation unit generates feature vectors corresponding to sample characters. The subcharacter generation unit clusters feature vectors and assigns each feature vector associated with a given cluster an identical label. The lexicon unit constructs a lexical graph for each character in a character set. The modeling unit generates a Hidden Markov Model for each set of identically-labeled feature vectors. After a first set of lexical graphs and Hidden Markov Models have been created, the subcharacter generation unit determines for each feature vector which Hidden Markov Model produces a highest likelihood value. The subcharacter generation unit relabels each feature vector according to the highest likelihood value, after which the lexicon unit and the modeling unit generate a new set of lexical graphs and a new set of Hidden Markov models, respectively. The feature vector relabeling, lexicon generation, and Hidden Markov Model generation are performed iteratively until a convergence criterion is met. The final set of Hidden Markov Model model parameters provide a set of subcharacter units for handwriting recognition, where the subcharacter units are derived from information inherent in the sample characters themselves.

    摘要翻译: 用于手写识别的自动子字符单元和词典生成的系统包括处理单元,手写输入装置和存储器,其中存在分割单元,子字符生成单元,词典单元和建模单元。 分割单元生成与采样字符对应的特征矢量。 子字符生成单元簇特征向量并且将与给定簇相关联的每个特征向量分配给相同的标签。 词典单元为字符集中的每个字符构成一个词汇图。 建模单元为每组相同标记的特征向量生成隐马尔科夫模型。 在创建了第一组词汇图和隐马尔科夫模型之后,子字符生成单元为每个特征向量确定隐马尔可夫模型产生最高似然值。 子字符生成单元根据最高似然值重新标记每个特征向量,之后词法单元和建模单元分别生成一组新的词法图和一组新的隐马尔可夫模型。 迭代地执行特征向量重新标记,词法生成和隐马尔科夫模型生成,直到满足收敛标准。 最后一组隐马尔可夫模型参数提供了一组用于手写识别的子字符单元,其中子字符单元是从样本字符本身固有的信息导出的。

    Continuous mandarin chinese speech recognition system having an
integrated tone classifier
    18.
    发明授权
    Continuous mandarin chinese speech recognition system having an integrated tone classifier 失效
    连续汉语中文语音识别系统具有综合音分类器

    公开(公告)号:US5602960A

    公开(公告)日:1997-02-11

    申请号:US316257

    申请日:1994-09-30

    CPC分类号: G10L15/04 G10L25/15

    摘要: A speech recognition system for continuous Mandarin Chinese speech comprises a microphone, an A/D converter, a syllable recognition system, an integrated tone classifier, and a confidence score augmentor. The syllable recognition system generates N-best theories with initial confidence scores. The integrated tone classifier has a pitch estimator to estimate the pitch of the input once and a long-term tone analyzer to segment the estimated pitch according to the syllables of each of the N-best theories. The long-term tone analyzer performs long-term tonal analysis on the segmented, estimated pitch and generates a long-term tonal confidence signal. The confidence score augmentor receives the initial confidence scores and the long-term tonal confidence signals, modifies each initial confidence score according to the corresponding long-term tonal confidence signal, re-ranks the N-best theories according to the augmented confidence scores, and outputs the N-best theories.

    摘要翻译: 用于连续汉语普通话的语音识别系统包括麦克风,A / D转换器,音节识别系统,集成音分类器和置信分数增强器。 音节识别系统产生具有初始置信分数的N最佳理论。 综合音分类器具有估计输入音高的音调估计器和一个长期音调分析器,以根据每个N最佳理论的音节来分段估计音高。 长期音调分析仪对分段估计音高进行长期色调分析,并产生长期色调置信度信号。 信心分数增强器接收初始置信度分数和长期音调信号,根据相应的长期音调信号信号修改每个初始置信度分数,根据增强的置信度得分重新排列N最佳理论; 输出N最好的理论。

    Sub-partitioned vector quantization of probability density functions
    19.
    发明授权
    Sub-partitioned vector quantization of probability density functions 失效
    概率密度函数的子分割矢量量化

    公开(公告)号:US5535305A

    公开(公告)日:1996-07-09

    申请号:US999293

    申请日:1992-12-31

    CPC分类号: G10L15/144 G06K9/6217

    摘要: A speech recognition memory compression method and apparatus subpartitions probability density function (pdf) space along the hidden Markov model (HMM) index into packets of typically 4 to 8 log-pdf values. Vector quantization techniques are applied using a logarithmic distance metric and a probability weighted logarithmic probability space for the splitting of clusters. Experimental results indicate a significant reduction in memory can be obtained with little increase in overall speech recognition error.

    摘要翻译: 一种将隐马尔可夫模型(HMM)索引的语音识别存储器压缩方法和装置子分类概率密度函数(pdf)空间通常为4到8个log-pdf值的分组。 使用对数距离度量和用于分割簇的概率加权对数概率空间来应用矢量量化技术。 实验结果表明,在语音识别总误差增加很小的情况下可以显着降低记忆。