SYSTEMS AND METHODS FOR STANDARDIZATION AND DE-DUPLICATION OF ADDRESSES USING TAXONOMY
    25.
    发明申请
    SYSTEMS AND METHODS FOR STANDARDIZATION AND DE-DUPLICATION OF ADDRESSES USING TAXONOMY 有权
    使用税收的地址标准化和失效的系统和方法

    公开(公告)号:US20120047179A1

    公开(公告)日:2012-02-23

    申请号:US12859607

    申请日:2010-08-19

    CPC classification number: G06F17/30961

    Abstract: Systems and associated methods for address standardization and applications related thereto are described. Embodiments exploit a common context in a taxonomy and a given address to detect and correct deviations in the address. Embodiments establish a possible path from a root of the taxonomy to a leaf in the taxonomy that can possibly generate a given address. Given a new address, embodiments use complete addresses, and/or segments or elements thereof, to compute the representations of the elements and find a closest matching leaf in the taxonomy. Embodiments then traverse the path to a root node to detect the agreement and disagreement between the path and the address entry. Taxonomical structured is thus used to detect, segregate and standardize the expected fields.

    Abstract translation: 描述用于地址标准化的系统和相关方法及其相关的应用。 实施例利用分类法和给定地址中的公共上下文来检测和纠正地址中的偏差。 实体建立了从分类的根到可能产生给定地址的分类中的叶的可行路径。 给定新的地址,实施例使用完整的地址和/或其部分或元素来计算元素的表示并在分类中找到最接近的匹配叶。 然后,实施例遍历到根节点的路径以检测路径和地址条目之间的协议和不一致。 因此,分类结构用于检测,分离和规范预期的领域。

    DYNAMICALLY DETECTING NEAR-DUPLICATE DOCUMENTS
    26.
    发明申请
    DYNAMICALLY DETECTING NEAR-DUPLICATE DOCUMENTS 有权
    动态检测近似文件

    公开(公告)号:US20110029491A1

    公开(公告)日:2011-02-03

    申请号:US12511175

    申请日:2009-07-29

    CPC classification number: G06F17/30675

    Abstract: Techniques for detecting one or more documents that are duplicate or near-duplicate of a first document are provided. The techniques include obtaining a first document, obtaining one or more additional documents, retrieving a set of one or more document signatures for each document, and detecting one or more documents that are duplicate or near-duplicate of the first document by detecting each of the one or more additional documents that have at least a minimum number of signatures in common with the first document, wherein detecting each of the one or more additional documents that have at least a minimum number of signatures in common with the first document comprises dynamically using at least one of a user-configurable similarity definition and a user-configurable similarity threshold value.

    Abstract translation: 提供了用于检测与第一文档重复或近似重复的一个或多个文档的技术。 这些技术包括获得第一文档,获得一个或多个附加文档,检索每个文档的一个或多个文档签名的集合,以及通过检测第一文档中的每一个来检测与第一文档重复或近似重复的一个或多个文档 一个或多个附加文档具有与第一文档相同的至少最小数量的签名,其中检测至少具有与第一文档共同的最小签名数量的一个或多个附加文档中的每一个,包括动态地使用 用户可配置的相似性定义和用户可配置的相似性阈值中的至少一个。

    Methods, apparatus and computer programs for characterizing web resources

    公开(公告)号:US20060026496A1

    公开(公告)日:2006-02-02

    申请号:US10901275

    申请日:2004-07-28

    CPC classification number: G06F17/30864 G06F17/30896

    Abstract: Methods, apparatus and computer programs are provided for characterizing Web-based information resources based on their interactions. A Web-based information resource is a single Web document or a collection of related Web documents. Unlike simple text documents, Web documents contain hyperlinks and other HTML tags. Different types of interactions, including inbound hyperlinks, outbound hyperlinks and internal links associated with a Web-based information resource, are used to characterize the Web-based information resource. A DOM tree representing the tag structure of a Web-based information resource is used to identify text items likely to be useful as context for a hyperlink anchor text, and the anchor text is combined with the context to generate a representation. The representation of Web-based information resources based on interactions can be used for clustering and classification, and in Web mining applications such as query disambiguation and automatic taxonomy generation.

    System and a method for generating semantically similar sentences for building a robust SLM
    29.
    发明授权
    System and a method for generating semantically similar sentences for building a robust SLM 有权
    系统和一种用于生成语义上相似的句子来构建稳健的SLM的方法

    公开(公告)号:US09135237B2

    公开(公告)日:2015-09-15

    申请号:US13181923

    申请日:2011-07-13

    CPC classification number: G06F17/274 G06F17/2795 G06F17/2881 G10L15/26

    Abstract: A system and method are described for generating semantically similar sentences for a statistical language model. A semantic class generator determines for each word in an input utterance a set of corresponding semantically similar words. A sentence generator computes a set of candidate sentences each containing at most one member from each set of semantically similar words. A sentence verifier grammatically tests each candidate sentence to determine a set of grammatically correct sentences semantically similar to the input utterance. Also note that the generated semantically similar sentences are not restricted to be selected from an existing sentence database.

    Abstract translation: 描述了用于为统计语言模型生成语义上类似的句子的系统和方法。 语义类生成器确定输入语义中的每个单词一组相应的语义上相似的单词。 句子生成器从每个语义上相似的单词集合中计算出一组候选句子,每个候选句子最多包含一个成员。 句子验证器语法测试每个候选句子以确定一组语法上正确的句子,其语义上类似于输入的话语。 还要注意,生成的语义上相似的句子不限于从现有句子数据库中选择。

Patent Agency Ranking