Representing n-gram language models for compact storage and fast retrieval
    1.
    发明授权
    Representing n-gram language models for compact storage and fast retrieval 有权
    代表用于紧凑存储和快速检索的n-gram语言模型

    公开(公告)号:US08175878B1

    公开(公告)日:2012-05-08

    申请号:US12968108

    申请日:2010-12-14

    IPC分类号: G10L15/18 G10L15/06 G06F17/27

    摘要: Systems, methods, and apparatuses, including computer program products, are provided for representing language models. In some implementations, a computer-implemented method is provided. The method includes generating a compact language model including receiving a collection of n-grams from the corpus, each n-gram of the collection having a corresponding first probability of occurring in the corpus and generating a trie representing the collection of n-grams. The method also includes using the language model to identify a second probability of a particular string of words occurring.

    摘要翻译: 提供了用于表示语言模型的系统,方法和装置,包括计算机程序产品。 在一些实现中,提供了计算机实现的方法。 该方法包括生成紧凑语言模型,包括从语料库接收n-gram的集合,每个n-gram的集合具有在语料库中发生的对应的第一概率,并且生成代表n-gram的集合的特里。 该方法还包括使用语言模型来识别发生的特定字符串字符串的第二概率。

    Representing n-gram language models for compact storage and fast retrieval
    2.
    发明授权
    Representing n-gram language models for compact storage and fast retrieval 有权
    代表用于紧凑存储和快速检索的n-gram语言模型

    公开(公告)号:US07877258B1

    公开(公告)日:2011-01-25

    申请号:US11693613

    申请日:2007-03-29

    IPC分类号: G10L15/18 G10L15/06 G06F17/27

    摘要: Systems, methods, and apparatuses, including computer program products, are provided for representing language models. In some implementations, a computer-implemented method is provided. The method includes generating a compact language model including receiving a collection of n-grams from the corpus, each n-gram of the collection having a corresponding first probability of occurring in the corpus and generating a trie representing the collection of n-grams. The method also includes using the language model to identify a second probability of a particular string of words occurring.

    摘要翻译: 提供了用于表示语言模型的系统,方法和装置,包括计算机程序产品。 在一些实现中,提供了计算机实现的方法。 该方法包括生成紧凑语言模型,包括从语料库接收n-gram的集合,每个n-gram的集合具有在语料库中发生的对应的第一概率,并且生成代表n-gram的集合的特里。 该方法还包括使用语言模型来识别发生的特定字符串字符串的第二概率。

    Identifying gibberish content in resources
    3.
    发明授权
    Identifying gibberish content in resources 有权
    识别资源中的乱七八糟的内容

    公开(公告)号:US08554769B1

    公开(公告)日:2013-10-08

    申请号:US12486626

    申请日:2009-06-17

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864

    摘要: This specification describes technologies relating to providing search results. One aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving a network resource, the network resource including text content; generating a language model score for the resource including applying a language model to the text content of the resource; generating a query stuffing score for the reference, the query stuffing score being a function of term frequency in the resource content and a query index; calculating a gibberish score for the resource using the language model score and the query stuffing score; and using the calculated gibberish score to determine whether to modify a ranking score of the resource.

    摘要翻译: 本规范描述了与提供搜索结果有关的技术。 本说明书中描述的主题的一个方面可以体现在包括接收网络资源的动作,包括文本内容的网络资源的方法中; 生成资源的语言模型得分,包括将语言模型应用于资源的文本内容; 产生用于参考的查询填充得分,查询填充得分是资源内容中的术语频率和查询索引的函数; 使用语言模型得分和查询填充得分计算资源的乱世分数; 并使用计算出的乱码来确定是否修改资源的排名得分。

    Machine translation using information retrieval
    4.
    发明授权
    Machine translation using information retrieval 有权
    机器翻译使用信息检索

    公开(公告)号:US08972432B2

    公开(公告)日:2015-03-03

    申请号:US12108415

    申请日:2008-04-23

    IPC分类号: G06F17/00 G06F17/30

    CPC分类号: G06F17/2827

    摘要: Systems, methods, and apparatuses, including computer program products, are provided for machine translation using information retrieval techniques. In general, in one implementation, a method is provided. The method includes providing a received input segment as a query to a search engine, the search engine searching an index of one or more collections of documents, receiving one or more candidate segments in response to the query, determining a similarity of each candidate segment to the received input segment, and for one or more candidate segments having a determined similarity that exceeds a threshold similarity, providing a translated target segment corresponding to the respective candidate segment.

    摘要翻译: 使用信息检索技术提供了用于机器翻译的系统,方法和装置,包括计算机程序产品。 通常,在一个实现中,提供了一种方法。 该方法包括将接收到的输入段作为查询提供给搜索引擎,搜索引擎搜索文档的一个或多个集合的索引,响应于该查询接收一个或多个候选段,确定每个候选段的相似性 所接收的输入段,并且对于具有超过阈值相似度的确定的相似度的一个或多个候选段,提供对应于相应候选段的转换的目标段。

    Context-based filtering of search results
    5.
    发明授权
    Context-based filtering of search results 有权
    基于上下文的搜索结果过滤

    公开(公告)号:US08762368B1

    公开(公告)日:2014-06-24

    申请号:US13459540

    申请日:2012-04-30

    IPC分类号: G06F17/30

    摘要: A server is configured to receive, from a client, a query and context information associated with a document; obtain search results, based on the query, that identify documents relevant to the query; analyze the context information to identify content; generate first scores for a hierarchy of topics, that correspond to measures of relevance of the topics to the content; select a topic that is most relevant to the context information when the topic is associated with a greatest first score; generate second scores for the search results that correspond to measures of relevance, of the search results, to the topic; select one or more of the search results as being most relevant to the topic when the search results are associated with one or more greatest second scores; generate a search result document that includes the selected search results; and send, to a client, the search result document.

    摘要翻译: 服务器被配置为从客户端接收与文档相关联的查询和上下文信息; 根据查询获取搜索结果,识别与查询相关的文档; 分析上下文信息以识别内容; 为主题层次结构生成第一个分数,这些分数与主题与内容的相关度相对应; 当主题与最高的第一分相关联时,选择与上下文信息最相关的主题; 为与搜索结果的相关性度量,搜索结果相对应的搜索结果生成第二个分数; 当搜索结果与一个或多个最大的第二分数相关联时,选择一个或多个搜索结果与主题最相关; 生成包含所选搜索结果的搜索结果文档; 并向客户发送搜索结果文档。

    Large language models in machine translation
    6.
    发明授权
    Large language models in machine translation 有权
    机器翻译中的大语言模型

    公开(公告)号:US08332207B2

    公开(公告)日:2012-12-11

    申请号:US11767436

    申请日:2007-06-22

    IPC分类号: G06F17/27

    摘要: Systems, methods, and computer program products for machine translation are provided. In some implementations a system is provided. The system includes a language model including a collection of n-grams from a corpus, each n-gram having a corresponding relative frequency in the corpus and an order n corresponding to a number of tokens in the n-gram, each n-gram corresponding to a backoff n-gram having an order of n-1 and a collection of backoff scores, each backoff score associated with an n-gram, the backoff score determined as a function of a backoff factor and a relative frequency of a corresponding backoff n-gram in the corpus.

    摘要翻译: 提供了用于机器翻译的系统,方法和计算机程序产品。 在一些实现中,提供了一种系统。 该系统包括语言模型,其包括来自语料库的n-gram的集合,每个n-gram在语料库中具有相应的相对频率,并且n个对应于n-gram中的令牌数量的次序n,每个n-gram对应 到具有n-1级的退避n-gram和回退分数的集合,与n-gram相关联的每个回退分数,作为退避因子的函数确定的退避分数和相应退避n的相对频率 -gram在语料库中。

    Randomized language models
    7.
    发明授权
    Randomized language models 有权
    随机语言模型

    公开(公告)号:US08209178B1

    公开(公告)日:2012-06-26

    申请号:US11972349

    申请日:2008-01-10

    IPC分类号: G06F17/27 G10L15/00 G10L15/28

    摘要: Systems, methods, and apparatuses including computer program products are provided for encoding and using a language model. In one implementation, a method is provided. The method includes generating a compact language model, including receiving a collection of n-grams, each n-gram having one or more associated parameter values, determining a fingerprint for each n-gram of the collection of n-grams, identifying locations in an array for each n-gram using a plurality of hash functions, and encoding the one or more parameter values associated with each n-gram in the identified array locations as a function of corresponding array values and the fingerprint for the n-gram.

    摘要翻译: 提供包括计算机程序产品在内的系统,方法和装置用于编码和使用语言模型。 在一个实现中,提供了一种方法。 该方法包括生成紧凑语言模型,包括接收n克的集合,每个n-gram具有一个或多个相关联的参数值,为n-gram的每个n-gram集合确定指纹, 使用多个散列函数对每个n-gram进行排列,并且将与识别的阵列位置中的每个n-gram相关联的一个或多个参数值编码为相应阵列值和n-gram的指纹的函数。

    Semantic unit recognition
    8.
    发明授权
    Semantic unit recognition 有权
    语义单位识别

    公开(公告)号:US08140321B1

    公开(公告)日:2012-03-20

    申请号:US12503771

    申请日:2009-07-15

    IPC分类号: G06F17/20

    CPC分类号: G06F17/277 G06F17/2785

    摘要: A semantic locator determines whether input sequences form semantically meaningful units. The semantic locator includes a coherence component that calculates a coherence of the terms in the sequence and a variation component that calculates the variation in terms that surround the sequence. A heuristics component may additionally refine results of the coherence component and the variation component. A decision component may make the determination of whether the sequence is a semantic unit based on the results of the coherence component, variation component, and heuristics component.

    摘要翻译: 语义定位器确定输入序列是否形成语义有意义的单元。 语义定位器包括一个相干分量,该相干分量计算序列中的项的相干性,以及计算该序列周围的变化的变化分量。 启发式组件可以另外改进相干分量和变化分量的结果。 决策组件可以基于相干分量,变化分量和启发式分量的结果来确定序列是否是语义单元。

    Semantic unit recognition
    9.
    发明授权
    Semantic unit recognition 有权
    语义单位识别

    公开(公告)号:US07580827B1

    公开(公告)日:2009-08-25

    申请号:US10748654

    申请日:2003-12-31

    IPC分类号: G06F17/20

    CPC分类号: G06F17/277 G06F17/2785

    摘要: A semantic locator determines whether input sequences form semantically meaningful units. The semantic locator includes a coherence component that calculates a coherence of the terms in the sequence and a variation component that calculates the variation in terms that surround the sequence. A heuristics component may additionally refine results of the coherence component and the variation component. A decision component may make the determination of whether the sequence is a semantic unit based on the results of the coherence component, variation component, and heuristics component.

    摘要翻译: 语义定位器确定输入序列是否形成语义有意义的单元。 语义定位器包括一个相干分量,该相干分量计算序列中的项的相干性,以及计算该序列周围的变化的变化分量。 启发式组件可以另外改进相干分量和变化分量的结果。 决策组件可以基于相干分量,变化分量和启发式分量的结果来确定序列是否是语义单元。

    Systems and methods for new event detection
    10.
    发明申请
    Systems and methods for new event detection 失效
    新事件检测的系统和方法

    公开(公告)号:US20050021324A1

    公开(公告)日:2005-01-27

    申请号:US10626856

    申请日:2003-07-25

    IPC分类号: G06F17/27

    CPC分类号: G06F17/2785 Y10S707/99936

    摘要: Techniques for new event detection are provided. For a new story and a corpus of stories, story-pairs based on the new story and each corpus story are determined. Adjustments to the importance of terms are determined based on story characteristics associated with each story. Story characteristics are based on direct or indirect characteristics. Direct story characteristics include authorship, language associated with a story and the like. Indirect story characteristics may include derived characteristics such as an ROI category characteristic, a same ROI characteristic, a same event-same source characteristic, an average story similarity characteristic or any other known or later developed characteristic associated with a story. Adjustments to the inter-story similarity metrics are then determined based on story characteristics and/or a weighting function. New event scores and/or new event categorizations for stories are determined based on the inter-story similarity metrics and the adjustments based on the story characteristics. Optionally new events are selected based on new event scores and a threshold value.

    摘要翻译: 提供了新的事件检测技术。 对于一个新故事和一组故事,基于新故事和每个语料库故事的故事对决定。 根据与每个故事相关的故事特征来确定术语重要性的调整。 故事特征基于直接或间接的特征。 直接故事特征包括作者身份,与故事相关的语言等。 间接故事特征可以包括衍生特征,例如ROI类别特征,相同的ROI特征,相同的事件相同的源特征,平均故事相似性特征或与故事相关联的任何其他已知或未来发展的特征。 然后基于故事特征和/或加权函数来确定对故事间相似性度量的调整。 基于故事间的相似性度量和基于故事特征的调整来确定故事的新事件评分和/或新事件分类。 可选地,基于新事件得分和阈值来选择新的事件。