SYSTEMS AND METHODS FOR STANDARDIZATION AND DE-DUPLICATION OF ADDRESSES USING TAXONOMY
    26.
    发明申请
    SYSTEMS AND METHODS FOR STANDARDIZATION AND DE-DUPLICATION OF ADDRESSES USING TAXONOMY 有权
    使用税收的地址标准化和失效的系统和方法

    公开(公告)号:US20120047179A1

    公开(公告)日:2012-02-23

    申请号:US12859607

    申请日:2010-08-19

    CPC classification number: G06F17/30961

    Abstract: Systems and associated methods for address standardization and applications related thereto are described. Embodiments exploit a common context in a taxonomy and a given address to detect and correct deviations in the address. Embodiments establish a possible path from a root of the taxonomy to a leaf in the taxonomy that can possibly generate a given address. Given a new address, embodiments use complete addresses, and/or segments or elements thereof, to compute the representations of the elements and find a closest matching leaf in the taxonomy. Embodiments then traverse the path to a root node to detect the agreement and disagreement between the path and the address entry. Taxonomical structured is thus used to detect, segregate and standardize the expected fields.

    Abstract translation: 描述用于地址标准化的系统和相关方法及其相关的应用。 实施例利用分类法和给定地址中的公共上下文来检测和纠正地址中的偏差。 实体建立了从分类的根到可能产生给定地址的分类中的叶的可行路径。 给定新的地址,实施例使用完整的地址和/或其部分或元素来计算元素的表示并在分类中找到最接近的匹配叶。 然后,实施例遍历到根节点的路径以检测路径和地址条目之间的协议和不一致。 因此,分类结构用于检测,分离和规范预期的领域。

    DYNAMICALLY DETECTING NEAR-DUPLICATE DOCUMENTS
    27.
    发明申请
    DYNAMICALLY DETECTING NEAR-DUPLICATE DOCUMENTS 有权
    动态检测近似文件

    公开(公告)号:US20110029491A1

    公开(公告)日:2011-02-03

    申请号:US12511175

    申请日:2009-07-29

    CPC classification number: G06F17/30675

    Abstract: Techniques for detecting one or more documents that are duplicate or near-duplicate of a first document are provided. The techniques include obtaining a first document, obtaining one or more additional documents, retrieving a set of one or more document signatures for each document, and detecting one or more documents that are duplicate or near-duplicate of the first document by detecting each of the one or more additional documents that have at least a minimum number of signatures in common with the first document, wherein detecting each of the one or more additional documents that have at least a minimum number of signatures in common with the first document comprises dynamically using at least one of a user-configurable similarity definition and a user-configurable similarity threshold value.

    Abstract translation: 提供了用于检测与第一文档重复或近似重复的一个或多个文档的技术。 这些技术包括获得第一文档,获得一个或多个附加文档,检索每个文档的一个或多个文档签名的集合,以及通过检测第一文档中的每一个来检测与第一文档重复或近似重复的一个或多个文档 一个或多个附加文档具有与第一文档相同的至少最小数量的签名,其中检测至少具有与第一文档共同的最小签名数量的一个或多个附加文档中的每一个,包括动态地使用 用户可配置的相似性定义和用户可配置的相似性阈值中的至少一个。

    Methods, apparatus and computer programs for characterizing web resources

    公开(公告)号:US20060026496A1

    公开(公告)日:2006-02-02

    申请号:US10901275

    申请日:2004-07-28

    CPC classification number: G06F17/30864 G06F17/30896

    Abstract: Methods, apparatus and computer programs are provided for characterizing Web-based information resources based on their interactions. A Web-based information resource is a single Web document or a collection of related Web documents. Unlike simple text documents, Web documents contain hyperlinks and other HTML tags. Different types of interactions, including inbound hyperlinks, outbound hyperlinks and internal links associated with a Web-based information resource, are used to characterize the Web-based information resource. A DOM tree representing the tag structure of a Web-based information resource is used to identify text items likely to be useful as context for a hyperlink anchor text, and the anchor text is combined with the context to generate a representation. The representation of Web-based information resources based on interactions can be used for clustering and classification, and in Web mining applications such as query disambiguation and automatic taxonomy generation.

    Dynamically detecting near-duplicate documents
    30.
    发明授权
    Dynamically detecting near-duplicate documents 有权
    动态检测近重复文件

    公开(公告)号:US09245007B2

    公开(公告)日:2016-01-26

    申请号:US12511175

    申请日:2009-07-29

    CPC classification number: G06F17/30675

    Abstract: Techniques for detecting one or more documents that are duplicate or near-duplicate of a first document are provided. The techniques include obtaining a first document, obtaining one or more additional documents, retrieving a set of one or more document signatures for each document, and detecting one or more documents that are duplicate or near-duplicate of the first document by detecting each of the one or more additional documents that have at least a minimum number of signatures in common with the first document, wherein detecting each of the one or more additional documents that have at least a minimum number of signatures in common with the first document comprises dynamically using at least one of a user-configurable similarity definition and a user-configurable similarity threshold value.

    Abstract translation: 提供了用于检测与第一文档重复或近似重复的一个或多个文档的技术。 这些技术包括获得第一文档,获得一个或多个附加文档,检索每个文档的一个或多个文档签名的集合,以及通过检测第一文档中的每一个来检测与第一文档重复或近似重复的一个或多个文档 一个或多个附加文档具有与第一文档相同的至少最小数量的签名,其中检测至少具有与第一文档共同的最小签名数量的一个或多个附加文档中的每一个,包括动态地使用 用户可配置的相似性定义和用户可配置的相似性阈值中的至少一个。

Patent Agency Ranking