Efficient stemming of semitic languages
    1.
    发明授权
    Efficient stemming of semitic languages 有权
    高效地产生语言语言

    公开(公告)号:US08438010B2

    公开(公告)日:2013-05-07

    申请号:US11951388

    申请日:2007-12-06

    IPC分类号: G06F17/27

    CPC分类号: G06F17/2755

    摘要: A system for stemming words of Semitic languages, the system including an affix scanner configured to scan a word of a Semitic language for at least one affix according to a predefined scanning sequence and determine if at least one predefined scanning criterion is met, and a stemmer configured to remove the affix from the word if the predefined scanning criterion is met.

    摘要翻译: 一种用于产生Semitic语言单词的系统,该系统包括一个配色扫描器,其配置为根据预定义的扫描序列扫描至少一个词缀的Semitic语言的单词,并确定是否满足至少一个预定义的扫描标准; 如果满足预定义的扫描标准,则配置为从单词中移除词缀。

    Hybrid text segmentation using N-grams and lexical information
    2.
    发明授权
    Hybrid text segmentation using N-grams and lexical information 有权
    使用N-gram和词汇信息的混合文本分割

    公开(公告)号:US07917353B2

    公开(公告)日:2011-03-29

    申请号:US11693324

    申请日:2007-03-29

    IPC分类号: G06F17/21

    CPC分类号: G06F17/277

    摘要: A hybrid n-gram/lexical analysis tokenization system including a lexicon and a hybrid tokenizer operative to perform both N-gram tokenization of a text and lexical analysis tokenization of a text using the lexicon, and to construct either of an index and a classifier from the results of both of the N-gram tokenization and the lexical analysis tokenization, where the hybrid tokenizer is implemented in at least one of computer hardware and computer software and is embodied within a computer-readable medium.

    摘要翻译: 一种混合的n-gram /词汇分析标记化系统,其包括词汇和混合令牌器,其可操作用于使用词典执行文本的N-gram令牌化和词汇分析令牌化,并且构建索引和分类器 N-gram标记化和词汇分析标记化两者的结果,其中混合标记器在计算机硬件和计算机软件中的至少一个中实现并被体现在计算机可读介质内。

    Unsupervised stemming schema learning and lexicon acquisition from corpora
    3.
    发明授权
    Unsupervised stemming schema learning and lexicon acquisition from corpora 失效
    无监督的茎模式学习和语料库的词汇获取

    公开(公告)号:US07912703B2

    公开(公告)日:2011-03-22

    申请号:US11953572

    申请日:2007-12-10

    IPC分类号: G06F17/27

    CPC分类号: G06F17/30731

    摘要: Illustrated embodiments provide a computer implemented method, an apparatus, and a computer program product for unsupervised stemming schema learning and lexicon acquisition from corpora. In one illustrative embodiment, the computer implemented method obtains a corpus from corpora, analyzes the corpus to deduce a set of possible stemming schema and reviews and revises the set of possible stemming schema, to create a pruned set of stemming schema. The computer implemented method further deduces a lexicon from the corpus using the pruned set of stemming schema.

    摘要翻译: 说明性的实施例提供了一种用于无监督的茎模式学习和来自语料库的词典获取的计算机实现的方法,装置和计算机程序产品。 在一个说明性实施例中,计算机实现的方法从语料库获得语料库,分析语料库以推导出一组可能的词干模式,并且对可能的词干模式的集合进行评估和修改,以创建一个修剪的词干模式集合。 计算机实现的方法使用修剪的词干模式集合进一步推导了语料库中的词典。

    Learning word segmentation from non-white space languages corpora
    4.
    发明申请
    Learning word segmentation from non-white space languages corpora 失效
    从非空白语言语料库学习单词分割

    公开(公告)号:US20090150145A1

    公开(公告)日:2009-06-11

    申请号:US11953635

    申请日:2007-12-10

    IPC分类号: G10L15/00

    CPC分类号: G06F17/2863 G06F17/277

    摘要: Illustrative embodiments provide a computer implemented method, apparatus, and computer program product for learning word segmentation from non-white space language corpora. In one illustrative embodiment, the computer implemented method receives text input characters and calculates a ratio-measure for each pair of characters in the input characters. The computer implemented method further determines whether the ratio-measure of each pair of characters is equal to a predetermined threshold value. Responsive to determining the ratio-measure is less than the predetermined threshold value, and a local-minimum value, the computer method further identifies the pair as a weak pair and breaks the weak pair of characters.

    摘要翻译: 说明性实施例提供了一种用于从非空白语言语料库学习单词分割的计算机实现的方法,装置和计算机程序产品。 在一个说明性实施例中,计算机实现的方法接收文本输入字符并且计算输入字符中每对字符的比率度量。 计算机实现的方法还确定每对字符的比例度量是否等于预定阈值。 响应于确定比率测量值小于预定阈值,并且局部最小值,计算机方法进一步将该对识别为弱对,并打破弱对的一对字符。

    HYBRID TEXT SEGMENTATION USING N-GRAMS AND LEXICAL INFORMATION
    5.
    发明申请
    HYBRID TEXT SEGMENTATION USING N-GRAMS AND LEXICAL INFORMATION 有权
    混合文本分段使用N粒子和LEXICAL信息

    公开(公告)号:US20080243487A1

    公开(公告)日:2008-10-02

    申请号:US11693324

    申请日:2007-03-29

    IPC分类号: G06F17/21

    CPC分类号: G06F17/277

    摘要: A hybrid n-gram/lexical analysis tokenization system including a lexicon and a hybrid tokenizer operative to perform both N-gram tokenization of a text and lexical analysis tokenization of a text using the lexican, and to construct either of an index and a classifier from the results of both of the N-gram tokenization and the lexical analysis tokenization, where the hybrid tokenizer is implemented in at least one of computer hardware and computer software and is embodied within a computer-readable medium.

    摘要翻译: 一种混合的n-gram /词法分析标记化系统,其包括词典和混合令牌器,其操作以执行文本的N-gram令牌化和使用词典的文本的词法分析标记化,并且构建索引和分类器 N-gram标记化和词汇分析标记化两者的结果,其中混合标记器在计算机硬件和计算机软件中的至少一个中实现并被体现在计算机可读介质内。

    Efficient Implementation of Morphology for Agglutinative Languages
    6.
    发明申请
    Efficient Implementation of Morphology for Agglutinative Languages 有权
    有效执行凝集语言形态学

    公开(公告)号:US20080243478A1

    公开(公告)日:2008-10-02

    申请号:US11692228

    申请日:2007-03-28

    IPC分类号: G06F17/27

    CPC分类号: G06F17/2755

    摘要: A method for constructing an automaton for automated analysis of agglutinative languages, the method including constructing an affix automaton for each of a plurality of affix types of an agglutinative language, where each of the affix types is associated with one or more affixes associated with a morphological concept, combining any of the affix automatons to form a plurality of template automatons, where each of the template automatons is patterned after any of a plurality of agglutination templates of any of the affix types for the language, and combining the template automatons into a master automaton.

    摘要翻译: 一种用于构建用于凝集语言的自动分析的自动机的方法,所述方法包括为凝集语言的多个贴缀类型中的每一个构造贴缀自动机,其中每个附件类型与一个或多个与形态学相关联的附件相关联 概念,组合任何一个贴纸自动机,形成多个模板自动机,其中模板自动机中的每一个都在图形化之后的语言的任何一种粘贴类型的多个凝集模板中的任一个之后,并将模板自动机组合成一个主 自动机。

    Efficient implementation of morphology for agglutinative languages
    7.
    发明授权
    Efficient implementation of morphology for agglutinative languages 有权
    有效实施凝集语言的形态学

    公开(公告)号:US09218336B2

    公开(公告)日:2015-12-22

    申请号:US11692228

    申请日:2007-03-28

    IPC分类号: G06F17/28 G06F17/27

    CPC分类号: G06F17/2755

    摘要: Constructing an automaton for automated analysis of agglutinative languages comprises: constructing an affix automaton for each of a plurality of affix types of an agglutinative language, where each of the affix types is associated with one or more affixes associated with a morphological concept; combining any of the affix automatons to form a plurality of template automatons, where each of the template automatons is patterned after any of a plurality of agglutination templates of any of the affix types for the language; and combining the template automatons into a master automaton.

    摘要翻译: 构建用于自动分析凝集语言的自动机包括:针对凝集语言的多种粘贴类型中的每一种构建词缀自动机,其中每个词缀类型与与形态概念相关联的一个或多个词缀相关联; 组合任何一个附加自动机以形成多个模板自动机,其中模板自动机中的每一个在语言的任意一种缀合类型的多个凝集模板中的任一个之后被图案化; 并将模板自动机组合成主自动机。

    UNSUPERVISED STEMMING SCHEMA LEARNING AND LEXICON ACQUISITION FROM CORPORA
    8.
    发明申请
    UNSUPERVISED STEMMING SCHEMA LEARNING AND LEXICON ACQUISITION FROM CORPORA 失效
    不间断的STEMING SCHEMA学习和LEXICON从公司收购

    公开(公告)号:US20090150415A1

    公开(公告)日:2009-06-11

    申请号:US11953572

    申请日:2007-12-10

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30731

    摘要: Illustrated embodiments provide a computer implemented method, an apparatus, and a computer program product for unsupervised stemming schema learning and lexicon acquisition from corpora. In one illustrative embodiment, the computer implemented method obtains a corpus from corpora, analyzes the corpus to deduce a set of possible stemming schema and reviews and revises the set of possible stemming schema, to create a pruned set of stemming schema. The computer implemented method further deduces a lexicon from the corpus using the pruned set of stemming schema.

    摘要翻译: 说明性的实施例提供了一种用于无监督的茎模式学习和来自语料库的词典获取的计算机实现的方法,装置和计算机程序产品。 在一个说明性实施例中,计算机实现的方法从语料库获得语料库,分析语料库以推导出一组可能的词干模式,并且对可能的词干模式的集合进行评估和修改,以创建一个修剪的词干模式集合。 计算机实现的方法使用修剪的词干模式集合进一步推导了语料库中的词典。

    EFFICIENT STEMMING OF SEMITIC LANGUAGES
    9.
    发明申请
    EFFICIENT STEMMING OF SEMITIC LANGUAGES 有权
    有效的语言表达

    公开(公告)号:US20090150140A1

    公开(公告)日:2009-06-11

    申请号:US11951388

    申请日:2007-12-06

    IPC分类号: G06F17/28

    CPC分类号: G06F17/2755

    摘要: A system for stemming words of Semitic languages, the system including an affix scanner configured to scan a word of a Semitic language for at least one affix according to a predefined scanning sequence and determine if at least one predefined scanning criterion is met, and a stemmer configured to remove the affix from the word if the predefined scanning criterion is met.

    摘要翻译: 一种用于产生Semitic语言单词的系统,该系统包括一个配色扫描器,其配置为根据预定义的扫描序列扫描至少一个词缀的Semitic语言的单词,并确定是否满足至少一个预定义的扫描标准; 如果满足预定义的扫描标准,则配置为从单词中移除词缀。

    Learning word segmentation from non-white space languages corpora
    10.
    发明授权
    Learning word segmentation from non-white space languages corpora 失效
    从非空白语言语料库学习单词分割

    公开(公告)号:US08165869B2

    公开(公告)日:2012-04-24

    申请号:US11953635

    申请日:2007-12-10

    IPC分类号: G06F17/27 G06F17/20

    CPC分类号: G06F17/2863 G06F17/277

    摘要: Illustrative embodiments provide a computer implemented method, apparatus, and computer program product for learning word segmentation from non-white space language corpora. In one illustrative embodiment, the computer implemented method receives text input characters and calculates a ratio-measure for each pair of characters in the input characters. The computer implemented method further determines whether the ratio-measure of each pair of characters is equal to a predetermined threshold value. Responsive to determining the ratio-measure is less than the predetermined threshold value, and a local-minimum value, the computer method further identifies the pair as a weak pair and breaks the weak pair of characters.

    摘要翻译: 说明性实施例提供了一种用于从非空白语言语料库学习单词分割的计算机实现的方法,装置和计算机程序产品。 在一个说明性实施例中,计算机实现的方法接收文本输入字符并且计算输入字符中每对字符的比率度量。 计算机实现的方法还确定每对字符的比例度量是否等于预定阈值。 响应于确定比率测量值小于预定阈值,并且局部最小值,计算机方法进一步将该对识别为弱对,并打破弱对的一对字符。