Cross-lingual discriminative learning of sequence models with posterior regularization

    公开(公告)号:US09779087B2

    公开(公告)日:2017-10-03

    申请号:US14105973

    申请日:2013-12-13

    Applicant: Google Inc.

    CPC classification number: G06F17/289 G06F17/27 G06F17/2827

    Abstract: A computer-implemented method can include obtaining (i) an aligned bi-text for a source language and a target language, and (ii) a supervised sequence model for the source language. The method can include labeling a source side of the aligned bi-text using the supervised sequence model and projecting labels from the labeled source side to a target side of the aligned bi-text to obtain a labeled target side of the aligned bi-text. The method can include filtering the labeled target side based on a task of a natural language processing (NLP) system configured to utilize a sequence model for the target language to obtain a filtered target side of the aligned bi-text. The method can also include training the sequence model for the target language using posterior regularization with soft constraints on the filtered target side to obtain a trained sequence model for the target language.

    SEMANTIC FRAME IDENTIFICATION WITH DISTRIBUTED WORD REPRESENTATIONS
    2.
    发明申请
    SEMANTIC FRAME IDENTIFICATION WITH DISTRIBUTED WORD REPRESENTATIONS 审中-公开
    具有分布式词汇表示的语义框架识别

    公开(公告)号:US20160239739A1

    公开(公告)日:2016-08-18

    申请号:US15008794

    申请日:2016-01-28

    Applicant: Google Inc.

    Abstract: A computer-implemented technique can include receiving, at a server, labeled training data including a plurality of groups of words, each group of words having a predicate word, each word having generic word embeddings. The technique can include extracting, at the server, the plurality of groups of words in a syntactic context of their predicate words. The technique can include concatenating, at the server, the generic word embeddings to create a high dimensional vector space representing features for each word. The technique can include obtaining, at the server, a model having a learned mapping from the high dimensional vector space to a low dimensional vector space and learned embeddings for each possible semantic frame in the low dimensional vector space. The technique can also include outputting, by the server, the model for storage, the model being configured to identify a specific semantic frame for an input.

    Abstract translation: 计算机实现的技术可以包括在服务器处接收包括多组单词的标记训练数据,每组单词具有谓词单词,每个单词具有通用单词嵌入。 该技术可以包括在服务器处提取他们的谓词单词的句法语境中的多组单词。 该技术可以包括在服务器处连接通用词嵌入以创建表示每个单词的特征的高维向量空间。 该技术可以包括在服务器处获得具有从高维矢量空间到低维向量空间的学习映射的模型,以及在低维向量空间中为每个可能的语义帧学习嵌入。 该技术还可以包括由服务器输出用于存储的模型,该模型被配置为识别用于输入的特定语义帧。

    Training a natural language processing model with information retrieval model annotations
    3.
    发明授权
    Training a natural language processing model with information retrieval model annotations 有权
    培训具有信息检索模型注释的自然语言处理模型

    公开(公告)号:US09536522B1

    公开(公告)日:2017-01-03

    申请号:US14143011

    申请日:2013-12-30

    Applicant: Google Inc.

    Abstract: Systems and techniques are provided for training a natural language processing model with information retrieval model annotations. A natural language processing model may be trained, through machine learning, using training examples that include part-of-speech tagging and annotations added by an information retrieval model. The natural language processing model may generate part-of-speech, parse-tree, beginning, inside, and outside label, mention chunking, and named-entity recognition predictions with confidence scores for text in the training examples. The information retrieval model annotations and part-of-speech tagging in the training example may be used to determine the accuracy of the predictions, and the natural language processing model may be adjusted. After training, the natural language processing model may be used to make predictions for novel input, such as search queries and potential search results. The search queries and potential search results may have information retrieval model annotations.

    Abstract translation: 提供系统和技术,用于训练具有信息检索模型注释的自然语言处理模型。 可以通过机器学习,使用包括由信息检索模型添加的词性标注和注释的训练样本来训练自然语言处理模型。 自然语言处理模型可以生成词性,解析树,开始,内部和外部标签,提及分组和命名实体识别预测,在训练示例中具有文本的置信度分数。 可以使用训练示例中的信息检索模型注释和词性标签来确定预测的准确性,并且可以调整自然语言处理模型。 训练后,自然语言处理模型可用于对新颖的输入进行预测,如搜索查询和潜在搜索结果。 搜索查询和潜在搜索结果可能具有信息检索模型注释。

    Weakly supervised part-of-speech tagging with coupled token and type constraints
    4.
    发明授权
    Weakly supervised part-of-speech tagging with coupled token and type constraints 有权
    弱化地监督了具有耦合令牌和类型限制的词性标注

    公开(公告)号:US09311299B1

    公开(公告)日:2016-04-12

    申请号:US13955491

    申请日:2013-07-31

    Applicant: Google Inc.

    CPC classification number: G06F17/28 G06F17/271 G06F17/2785 G06F17/2827

    Abstract: A method and system are provided for a part-of-speech tagger that may be particularly useful for resource-poor languages. Use of manually constructed tag dictionaries from dictionaries via bitext can be used as type constraints to overcome the scarcity of annotated data in some instances. Additional token constraints can be projected from a resource-rich source language via word-aligned bitext. Several example models are provided to demonstrate this such as a partially observed conditional random field model, where coupled token and type constraints may provide a partial signal for training. The disclosed method achieves a significant relative error reduction over the prior state of the art.

    Abstract translation: 为可能对资源贫乏的语言特别有用的词性标签器提供了一种方法和系统。 通过bitext使用手工构建的字典字典可用作类型约束来克服某些情况下注释数据的稀缺性。 额外的令牌约束可以从资源丰富的源语言通过字对齐的bitext进行投影。 提供了几个示例模型来证明这一点,例如部分观察到的条件随机场模型,其中耦合的令牌和类型约束可以提供用于训练的部分信号。 所公开的方法相对于现有技术的现有技术实现了显着的相对误差减小。

    CROSS-LINGUAL DISCRIMINATIVE LEARNING OF SEQUENCE MODELS WITH POSTERIOR REGULARIZATION
    5.
    发明申请
    CROSS-LINGUAL DISCRIMINATIVE LEARNING OF SEQUENCE MODELS WITH POSTERIOR REGULARIZATION 有权
    具有定期定期的序列模型的横向分析学习

    公开(公告)号:US20150169549A1

    公开(公告)日:2015-06-18

    申请号:US14105973

    申请日:2013-12-13

    Applicant: Google Inc.

    CPC classification number: G06F17/289 G06F17/27 G06F17/2827

    Abstract: A computer-implemented method can include obtaining (i) an aligned bi-text for a source language and a target language, and (ii) a supervised sequence model for the source language. The method can include labeling a source side of the aligned bi-text using the supervised sequence model and projecting labels from the labeled source side to a target side of the aligned bi-text to obtain a labeled target side of the aligned bi-text. The method can include filtering the labeled target side based on a task of a natural language processing (NLP) system configured to utilize a sequence model for the target language to obtain a filtered target side of the aligned bi-text. The method can also include training the sequence model for the target language using posterior regularization with soft constraints on the filtered target side to obtain a trained sequence model for the target language.

    Abstract translation: 计算机实现的方法可以包括获得(i)源语言和目标语言的对齐双文本,以及(ii)源语言的监督序列模型。 该方法可以包括使用监督序列模型来标记对准的双文本的源侧,并将标记从标记的源侧投影到对准的双文本的目标侧,以获得对齐的双文本的标记的目标侧。 该方法可以包括基于被配置为利用目标语言的序列模型来获得对齐的双文本的经滤波的目标侧的自然语言处理(NLP)系统的任务来过滤标记的目标侧。 该方法还可以包括使用经过过滤的目标侧的软约束的后验正规化来训练目标语言的序列模型,以获得用于目标语言的经训练的序列模型。

    Semantic frame identification with distributed word representations
    6.
    发明授权
    Semantic frame identification with distributed word representations 有权
    语义帧识别与分布式字表示

    公开(公告)号:US09262406B1

    公开(公告)日:2016-02-16

    申请号:US14271997

    申请日:2014-05-07

    Applicant: Google Inc.

    Abstract: A computer-implemented technique can include receiving, at a server, labeled training data including a plurality of groups of words, each group of words having a predicate word, each word having generic word embeddings. The technique can include extracting, at the server, the plurality of groups of words in a syntactic context of their predicate words. The technique can include concatenating, at the server, the generic word embeddings to create a high dimensional vector space representing features for each word. The technique can include obtaining, at the server, a model having a learned mapping from the high dimensional vector space to a low dimensional vector space and learned embeddings for each possible semantic frame in the low dimensional vector space. The technique can also include outputting, by the server, the model for storage, the model being configured to identify a specific semantic frame for an input.

    Abstract translation: 计算机实现的技术可以包括在服务器处接收包括多组单词的标记训练数据,每组单词具有谓词单词,每个单词具有通用单词嵌入。 该技术可以包括在服务器处提取他们的谓词单词的句法语境中的多组单词。 该技术可以包括在服务器处连接通用词嵌入以创建表示每个单词的特征的高维向量空间。 该技术可以包括在服务器处获得具有从高维矢量空间到低维向量空间的学习映射的模型,以及在低维向量空间中为每个可能的语义帧学习嵌入。 该技术还可以包括由服务器输出用于存储的模型,该模型被配置为识别用于输入的特定语义帧。

Patent Agency Ranking