Domain dictionary creation by detection of new topic words using divergence value comparison
    1.
    发明授权
    Domain dictionary creation by detection of new topic words using divergence value comparison 有权
    通过使用发散值比较检测新主题词来创建域名词典

    公开(公告)号:US08386240B2

    公开(公告)日:2013-02-26

    申请号:US13158125

    申请日:2011-06-10

    IPC分类号: G06F17/21 G06F17/20 G06F17/27

    CPC分类号: G06F17/2745

    摘要: Methods, systems, and apparatus, including computer program products, to identify topic words in a collection of documents that includes topic documents related to a topic are disclosed. A reference topic word divergence value based on a document collection and the topic document collection is determined. A candidate topic word divergence value for a candidate topic word is determined based on the document collection and the topic document collection. The candidate topic word is determined to be a topic word if the candidate topic word divergence value is greater than the reference topic word divergence value.

    摘要翻译: 公开了包括计算机程序产品在包括与主题相关的主题文档的文档集合中的主题词的方法,系统和装置。 确定基于文档收集和主题文档收集的参考主题词分歧值。 基于文档收集和主题文档收集来确定候选主题词的候选主题词分歧值。 如果候选主题词发散值大于参考主题词发散值,则将候选主题词确定为主题词。

    Word Detection
    2.
    发明申请
    Word Detection 有权
    字检测

    公开(公告)号:US20110137642A1

    公开(公告)日:2011-06-09

    申请号:US13016338

    申请日:2011-01-28

    IPC分类号: G06F17/21

    摘要: Methods, systems, and apparatus, including computer program products, in which data from web documents are partitioned into a training corpus and a development corpus are provided. First word probabilities for words are determined for the training corpus, and second word probabilities for the words are determined for the development corpus. Uncertainty values based on the word probabilities for the training corpus and the development corpus are compared, and new words are identified based on the comparison.

    摘要翻译: 提供了包括计算机程序产品在内的方法,系统和装置,其中将来自web文档的数据分成训练语料库和开发语料库。 为训练语料库确定单词的第一个单词概率,并为开发语料库确定单词的第二个单词概率。 比较了基于训练语料库和开发语料库的单词概率的不确定性值,并根据比较来确定新词。

    Word detection
    3.
    发明授权
    Word detection 有权
    词检测

    公开(公告)号:US07917355B2

    公开(公告)日:2011-03-29

    申请号:US11844153

    申请日:2007-08-23

    IPC分类号: G06F17/21 G06F17/27 G06F17/20

    摘要: Methods, systems, and apparatus, including computer program products, in which data from web documents are partitioned into a training corpus and a development corpus are provided. First word probabilities for words are determined for the training corpus, and second word probabilities for the words are determined for the development corpus. Uncertainty values based on the word probabilities for the training corpus and the development corpus are compared, and new words are identified based on the comparison.

    摘要翻译: 提供了包括计算机程序产品在内的方法,系统和装置,其中将来自web文档的数据分成训练语料库和开发语料库。 为训练语料库确定单词的第一个单词概率,并为开发语料库确定单词的第二个单词概率。 比较了基于训练语料库和开发语料库的单词概率的不确定性值,并根据比较来确定新词。

    Word detection
    4.
    发明授权
    Word detection 有权
    词检测

    公开(公告)号:US08463598B2

    公开(公告)日:2013-06-11

    申请号:US13016338

    申请日:2011-01-28

    IPC分类号: G06F17/20 G06F17/21 G06F17/27

    摘要: Methods, systems, and apparatus, including computer program products, in which data from web documents are partitioned into a training corpus and a development corpus are provided. First word probabilities for words are determined for the training corpus, and second word probabilities for the words are determined for the development corpus. Uncertainty values based on the word probabilities for the training corpus and the development corpus are compared, and new words are identified based on the comparison.

    摘要翻译: 提供了包括计算机程序产品在内的方法,系统和装置,其中将来自web文档的数据分成训练语料库和开发语料库。 为训练语料库确定单词的第一个单词概率,并为开发语料库确定单词的第二个单词概率。 比较了基于训练语料库和开发语料库的单词概率的不确定性值,并根据比较来确定新词。

    DOMAIN DICTIONARY CREATION
    5.
    发明申请
    DOMAIN DICTIONARY CREATION 有权
    域名字典创建

    公开(公告)号:US20110238413A1

    公开(公告)日:2011-09-29

    申请号:US13158125

    申请日:2011-06-10

    IPC分类号: G06F17/21

    CPC分类号: G06F17/2745

    摘要: Methods, systems, and apparatus, including computer program products, to identify topic words in a collection of documents that includes topic documents related to a topic are disclosed. A reference topic word divergence value based on a document collection and the topic document collection is determined. A candidate topic word divergence value for a candidate topic word is determined based on the document collection and the topic document collection. The candidate topic word is determined to be a topic word if the candidate topic word divergence value is greater than the reference topic word divergence value.

    摘要翻译: 公开了包括计算机程序产品在包括与主题相关的主题文档的文档集合中的主题词的方法,系统和装置。 确定基于文档收集和主题文档收集的参考主题词分歧值。 基于文档收集和主题文档收集来确定候选主题词的候选主题词分歧值。 如果候选主题词发散值大于参考主题词发散值,则将候选主题词确定为主题词。

    Domain dictionary creation by detection of new topic words using divergence value comparison
    6.
    发明授权
    Domain dictionary creation by detection of new topic words using divergence value comparison 有权
    通过使用发散值比较检测新主题词来创建域名词典

    公开(公告)号:US07983902B2

    公开(公告)日:2011-07-19

    申请号:US11844067

    申请日:2007-08-23

    IPC分类号: G06F17/21 G06F17/20 G06F17/27

    CPC分类号: G06F17/2745

    摘要: Methods, systems, and apparatus, including computer program products, to identify topic words in a document corpus that includes topic documents related to a topic are disclosed. A reference topic word divergence value based on the document corpus and the topic document corpus is determined. A candidate topic word divergence value for a candidate topic word is determined based on the document corpus and the topic document corpus. The candidate topic word is determined to be a topic word if the candidate topic word divergence value is greater than the reference topic word divergence value.

    摘要翻译: 公开了包括计算机程序产品在包括与主题相关的主题文档的文档语料库中的主题词的方法,系统和装置。 确定基于文档语料库和主题文档语料库的参考主题词分歧值。 基于文档语料库和主题文档语料库确定候选主题词的候选主题词分歧值。 如果候选主题词发散值大于参考主题词发散值,则将候选主题词确定为主题词。

    Domain Dictionary Creation
    7.
    发明申请
    Domain Dictionary Creation 有权
    域名词典创作

    公开(公告)号:US20090055381A1

    公开(公告)日:2009-02-26

    申请号:US11844067

    申请日:2007-08-23

    IPC分类号: G06F17/30

    CPC分类号: G06F17/2745

    摘要: Methods, systems, and apparatus, including computer program products, to identify topic words in a document corpus that includes topic documents related to a topic are disclosed. A reference topic word divergence value based on the document corpus and the topic document corpus is determined. A candidate topic word divergence value for a candidate topic word is determined based on the document corpus and the topic document corpus. The candidate topic word is determined to be a topic word if the candidate topic word divergence value is greater than the reference topic word divergence value.

    摘要翻译: 公开了包括计算机程序产品在包括与主题相关的主题文档的文档语料库中的主题词的方法,系统和装置。 确定基于文档语料库和主题文档语料库的参考主题词分歧值。 基于文档语料库和主题文档语料库确定候选主题词的候选主题词分歧值。 如果候选主题词发散值大于参考主题词发散值,则将候选主题词确定为主题词。

    Word Detection
    8.
    发明申请
    Word Detection 有权
    字检测

    公开(公告)号:US20090055168A1

    公开(公告)日:2009-02-26

    申请号:US11844153

    申请日:2007-08-23

    IPC分类号: G06F17/21

    摘要: Methods, systems, and apparatus, including computer program products, in which data from web documents are partitioned into a training corpus and a development corpus are provided. First word probabilities for words are determined for the training corpus, and second word probabilities for the words are determined for the development corpus. Uncertainty values based on the word probabilities for the training corpus and the development corpus are compared, and new words are identified based on the comparison.

    摘要翻译: 提供了包括计算机程序产品在内的方法,系统和装置,其中将来自web文档的数据分成训练语料库和开发语料库。 为训练语料库确定单词的第一个单词概率,并为开发语料库确定单词的第二个单词概率。 比较了基于训练语料库和开发语料库的单词概率的不确定性值,并根据比较来确定新词。