Learning Discriminative Projections for Text Similarity Measures
    1.
    发明申请
    Learning Discriminative Projections for Text Similarity Measures 审中-公开
    用于文本相似度量度的学习判别预测

    公开(公告)号:US20120323968A1

    公开(公告)日:2012-12-20

    申请号:US13160485

    申请日:2011-06-14

    IPC分类号: G06F17/30

    CPC分类号: G06F16/31

    摘要: A model for mapping the raw text representation of a text object to a vector space is disclosed. A function is defined for computing a similarity score given two output vectors. A loss function is defined for computing an error based on the similarity scores and the labels of pairs of vectors. The parameters of the model are tuned to minimize the loss function. The label of two vectors indicates a degree of similarity of the objects. The label may be a binary number or a real-valued number. The function for computing similarity scores may be a cosine, Jaccard, or differentiable function. The loss function may compare pairs of vectors to their labels. Each element of the output vector is a linear or non-linear function of the terms of an input vector. The text objects may be different types of documents and two different models may be trained concurrently.

    摘要翻译: 公开了将文本对象的原始文本表示映射到向量空间的模型。 定义了一个功能,用于计算给定两个输出向量的相似度得分。 定义了一种损失函数,用于计算基于相似度得分和向量对的标签的误差。 调整模型的参数以最小化损失函数。 两个向量的标签表示对象的相似度。 标签可以是二进制数字或实数值。 用于计算相似性分数的函数可以是余弦,Jaccard或可微分函数。 损失函数可以将向量对与其标签进行比较。 输出向量的每个元素是输入向量的项的线性或非线性函数。 文本对象可以是不同类型的文档,并且可以同时训练两个不同的模型。

    Locating parallel word sequences in electronic documents
    2.
    发明授权
    Locating parallel word sequences in electronic documents 有权
    在电子文档中查找并行字序列

    公开(公告)号:US08560297B2

    公开(公告)日:2013-10-15

    申请号:US12794778

    申请日:2010-06-07

    CPC分类号: G06F17/2827 G06F17/278

    摘要: Systems and methods for automatically extracting parallel word sequences from comparable corpora are described. Electronic documents, such as web pages belonging to a collaborative online encyclopedia, are analyzed to locate parallel word sequences between electronic documents written in different languages. These parallel word sequences are then used to train a machine translation system that can translate text from one language to another.

    摘要翻译: 描述了从可比较的语料库自动提取并行字序列的系统和方法。 分析电子文档,例如属于协作式在线百科全书的网页,以在以不同语言编写的电子文档之间定位并行字序列。 然后,这些并行字序列用于训练可以将文本从一种语言翻译成另一种语言的机器翻译系统。

    UNSUPERVISED LEARNING USING GLOBAL FEATURES, INCLUDING FOR LOG-LINEAR MODEL WORD SEGMENTATION
    3.
    发明申请
    UNSUPERVISED LEARNING USING GLOBAL FEATURES, INCLUDING FOR LOG-LINEAR MODEL WORD SEGMENTATION 有权
    使用全球特色的不可预知的学习,包括线性模型文字分段

    公开(公告)号:US20110144992A1

    公开(公告)日:2011-06-16

    申请号:US12637802

    申请日:2009-12-15

    IPC分类号: G10L15/06 G10L15/04

    CPC分类号: G10L15/18

    摘要: Described is a technology for performing unsupervised learning using global features extracted from unlabeled examples. The unsupervised learning process may be used to train a log-linear model, such as for use in morphological segmentation of words. For example, segmentations of the examples are sampled based upon the global features to produce a segmented corpus and log-linear model, which are then iteratively reprocessed to produce a final segmented corpus and a log-linear model.

    摘要翻译: 描述了使用从未标记的示例提取的全局特征来执行无监督学习的技术。 无监督学习过程可用于训练对数线性模型,例如用于词语的形态分割。 例如,基于全局特征对示例的分段进行采样,以产生分段语料库和对数线性模型,然后对其进行迭代重新处理以产生最终的分段语料库和对数线性模型。

    Unsupervised learning using global features, including for log-linear model word segmentation
    4.
    发明授权
    Unsupervised learning using global features, including for log-linear model word segmentation 有权
    使用全局特征的无监督学习,包括对数线性模型词分割

    公开(公告)号:US08909514B2

    公开(公告)日:2014-12-09

    申请号:US12637802

    申请日:2009-12-15

    IPC分类号: G06F17/27 G06F15/18 G10L15/18

    CPC分类号: G10L15/18

    摘要: Described is a technology for performing unsupervised learning using global features extracted from unlabeled examples. The unsupervised learning process may be used to train a log-linear model, such as for use in morphological segmentation of words. For example, segmentations of the examples are sampled based upon the global features to produce a segmented corpus and log-linear model, which are then iteratively reprocessed to produce a final segmented corpus and a log-linear model.

    摘要翻译: 描述了使用从未标记的示例提取的全局特征来执行无监督学习的技术。 无监督学习过程可用于训练对数线性模型,例如用于词语的形态分割。 例如,基于全局特征对示例的分段进行采样,以产生分段语料库和对数线性模型,然后对其进行迭代重新处理以产生最终的分段语料库和对数线性模型。

    LOCATING PARALLEL WORD SEQUENCES IN ELECTRONIC DOCUMENTS
    5.
    发明申请
    LOCATING PARALLEL WORD SEQUENCES IN ELECTRONIC DOCUMENTS 有权
    在电子文件中定位并行词汇序列

    公开(公告)号:US20110301935A1

    公开(公告)日:2011-12-08

    申请号:US12794778

    申请日:2010-06-07

    IPC分类号: G06F17/28 G06F17/27

    CPC分类号: G06F17/2827 G06F17/278

    摘要: Systems and methods for automatically extracting parallel word sequences from comparable corpora are described. Electronic documents, such as web pages belonging to a collaborative online encyclopedia, are analyzed to locate parallel word sequences between electronic documents written in different languages. These parallel word sequences are then used to train a machine translation system that can translate text from one language to another.

    摘要翻译: 描述了从可比较的语料库自动提取并行字序列的系统和方法。 分析电子文档,例如属于协作式在线百科全书的网页,以在以不同语言编写的电子文档之间定位并行字序列。 然后,这些并行字序列用于训练可以将文本从一种语言翻译成另一种语言的机器翻译系统。