SYSTEMS AND METHODS OF UNIVERSAL RESOURCE LOCATOR NORMALIZATION
    1.
    发明申请
    SYSTEMS AND METHODS OF UNIVERSAL RESOURCE LOCATOR NORMALIZATION 审中-公开
    通用资源定位器正常化的系统与方法

    公开(公告)号:US20090164502A1

    公开(公告)日:2009-06-25

    申请号:US11963925

    申请日:2007-12-24

    IPC分类号: G06F17/30

    CPC分类号: G06F16/9566

    摘要: Disclosed herein are method, systems and architectures for normalizing identifiers corresponding to resources using normalization rules that can be generalized for use with different resources. By way of a non-limiting example, an identifier can be a uniform resource locator (URL), and a normalization rule can be used to normalize URLs that correspond to different resources, e.g., content. A normalization rule can be generated by generalizing two or more normalization rules corresponding to different resources, such that a content determinative component is generalized. A normalization rule can be defined to include a context portion used to determine the rule's applicability to an identifier, and a transformation portion that identifies the transformations to be applied to an applicable identifier to yield a normalized form of the URL. A generalization of two or more normalization rules can include a normalization of one or both of the context and transformation portions.

    摘要翻译: 这里公开了用于使用归一化规则对与资源相对应的标识符进行归一化的方法,系统和体系结构,该规范化规则可以被推广以用于不同的资源。 作为非限制性示例,标识符可以是统一资源定位符(URL),并且归一化规则可以用于对与不同资源(例如,内容)相对应的URL进行归一化。 可以通过对与不同资源相对应的两个或多个规范化规则进行泛化来生成规范化规则,使得内容确定组件被广义化。 归一化规则可以被定义为包括用于确定规则对标识符的适用性的上下文部分,以及标识要应用于适用标识符以产生URL的归一化形式的变换的变换部分。 两个或多个归一化规则的泛化可以包括上下文和转换部分中的一个或两个的归一化。

    Techniques for detecting duplicate web pages
    2.
    发明申请
    Techniques for detecting duplicate web pages 有权
    检测重复网页的技术

    公开(公告)号:US20080263026A1

    公开(公告)日:2008-10-23

    申请号:US11788505

    申请日:2007-04-20

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864 G06F17/2211

    摘要: Techniques are disclosed for detecting web pages with duplicate content. In one embodiment, a set of shingles is computed for each page of a group of pages. An aggregate set of shingles is determined based on the sets of shingles computed for the group of pages. A first subset from the aggregate set of shingles is determined by selecting, from the aggregate set, shingles whose frequencies in the aggregate set exceed a specified threshold. A modified set of shingles is generated for each page of the group of pages by removing, from the set of shingles for that page, any shingle included in the first subset. One or more duplicate pages in the group of pages are determined based at least in part on the modified sets of shingles generated for the group of pages.

    摘要翻译: 公开了用于检测具有重复内容的网页的技术。 在一个实施例中,针对一组页面的每个页面计算一组带状块。 基于为该组页面计算的带状块的集合确定聚合的带状块组。 通过从聚合集合中选择聚合集合中的频率超过指定阈值的带状键确定来自聚合散列集合的第一子集。 通过从该页面的一组带状键移除包括在第一子集中的任何瓦片,为该组页面的每个页面生成经修改的带状块组。 至少部分地基于为该组页生成的带状块的修改的集合来确定该组页面中的一个或多个重复页面。

    Techniques for detecting duplicate web pages
    3.
    发明授权
    Techniques for detecting duplicate web pages 有权
    检测重复网页的技术

    公开(公告)号:US07698317B2

    公开(公告)日:2010-04-13

    申请号:US11788505

    申请日:2007-04-20

    IPC分类号: G06F17/00

    CPC分类号: G06F17/30864 G06F17/2211

    摘要: Techniques are disclosed for detecting web pages with duplicate content. In one embodiment, a set of shingles is computed for each page of a group of pages. An aggregate set of shingles is determined based on the sets of shingles computed for the group of pages. A first subset from the aggregate set of shingles is determined by selecting, from the aggregate set, shingles whose frequencies in the aggregate set exceed a specified threshold. A modified set of shingles is generated for each page of the group of pages by removing, from the set of shingles for that page, any shingle included in the first subset. One or more duplicate pages in the group of pages are determined based at least in part on the modified sets of shingles generated for the group of pages.

    摘要翻译: 公开了用于检测具有重复内容的网页的技术。 在一个实施例中,针对一组页面的每个页面计算一组带状块。 基于为该组页面计算的带状块的集合确定聚合的带状块组。 通过从聚合集合中选择聚合集合中的频率超过指定阈值的带状键确定来自聚合散列集合的第一子集。 通过从该页面的一组带状键移除包括在第一子集中的任何瓦片,为该组页面的每个页面生成经修改的带状块组。 至少部分地基于为该组页生成的带状块的修改的集合来确定该组页面中的一个或多个重复页面。

    Method and Apparatus for Identifying if Two Websites are Co-Owned
    4.
    发明申请
    Method and Apparatus for Identifying if Two Websites are Co-Owned 审中-公开
    用于识别两个网站是否共同拥有的方法和设备

    公开(公告)号:US20090228438A1

    公开(公告)日:2009-09-10

    申请号:US12044339

    申请日:2008-03-07

    IPC分类号: G06F7/06

    CPC分类号: G06F21/6218

    摘要: A method and apparatus are provided for identifying if two websites are co-owned. In one example, the method includes obtaining redirect URL (uniform resource locator) pairs from the Internet, constructing a training set using the redirect URL pairs, constructing a feature set based on the training set, and learning co-ownership decisions based on the feature set and the training set.

    摘要翻译: 提供了一种用于识别两个网站是否共同拥有的方法和装置。 在一个示例中,该方法包括从因特网获取重定向URL(统一资源定位符)对,使用重定向URL对构建训练集,基于训练集构建特征集,以及基于特征学习共有权决定 设置和训练集。

    METHOD AND SYSTEM FOR FAST SIMILARITY COMPUTATION IN HIGH DIMENSIONAL SPACE
    5.
    发明申请
    METHOD AND SYSTEM FOR FAST SIMILARITY COMPUTATION IN HIGH DIMENSIONAL SPACE 有权
    用于在高维空间中快速相似计算的方法和系统

    公开(公告)号:US20130031059A1

    公开(公告)日:2013-01-31

    申请号:US13189696

    申请日:2011-07-25

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30628

    摘要: Method, system, and programs for computing similarity. Input data is first received from one or more data sources and then analyzed to obtain an input feature vector that characterizes the input data. An index is then generated based on the input feature vector and is used to archive the input data, where the value of the index is computed based on an improved Johnson-Lindenstrass transformation (FJLT) process. With the improved FJLT process, first, the sign of each feature in the input feature vector is randomly flipped to obtain a flipped vector. A Hadamard transformation is then applied to the flipped vector to obtain a transformed vector. An inner product between the transformed vector and a sparse vector is then computed to obtain a base vector, based on which the value of the index is determined.

    摘要翻译: 用于计算相似度的方法,系统和程序。 首先从一个或多个数据源接收输入数据,然后分析以获得表征输入数据的输入特征向量。 然后基于输入特征向量生成索引,并且用于存档输入数据,其中基于改进的约翰逊 - 林登斯特拉斯变换(FJLT)处理来计算索引的值。 随着改进的FJLT过程,首先,输入特征向量中的每个特征的符号被随机翻转以获得翻转矢量。 然后将Hadamard变换应用于翻转矢量以获得变换矢量。 然后计算变换向量和稀疏向量之间的内积,以获得基准向量,基于此确定索引的值。

    Multi-step captcha with serial time-consuming decryption of puzzles
    6.
    发明授权
    Multi-step captcha with serial time-consuming decryption of puzzles 有权
    多步验证码具有串行耗时的解谜难题

    公开(公告)号:US08522327B2

    公开(公告)日:2013-08-27

    申请号:US13206583

    申请日:2011-08-10

    IPC分类号: H04L29/06

    摘要: A system and method for implementing a multi-step challenge and response test includes steps or acts of: using an input/output subsystem for presenting a series of challenges to a user that require said user to correctly solve each challenge before a next challenge is revealed to the user; receiving the user's response to each challenge; and submitting a last response in the series of challenges to a server for validation. The method further includes: using a processor device configured to perform for each challenge in the series of challenges: internally validating the response by comparing the user's response to a correct response; and using the user's response, decrypting the next challenge to reveal the next challenge; wherein the next challenge remains obfuscated until a previous challenge is correctly solved.

    摘要翻译: 用于实现多步骤挑战和响应测试的系统和方法包括以下步骤或动作:使用输入/输出子系统向使用者呈现一系列挑战,要求所述用户在下一个挑战被揭露之前正确地解决每个挑战 给用户; 接收用户对每个挑战的响应; 并将一系列挑战中的最后一个响应提交给服务器进行验证。 该方法还包括:使用配置成针对一系列挑战中的每个挑战执行的处理器设备:通过将用户的响应与正确响应进行比较来内部验证响应; 并使用用户的响应,解密下一个挑战,揭示下一个挑战; 其中下一个挑战保持混淆,直到前一个挑战被正确地解决。

    Mail compression scheme with individual message decompressability
    7.
    发明授权
    Mail compression scheme with individual message decompressability 有权
    具有消息解压缩功能的邮件压缩方案

    公开(公告)号:US07836099B2

    公开(公告)日:2010-11-16

    申请号:US11831828

    申请日:2007-07-31

    IPC分类号: G06F17/30

    CPC分类号: H04L51/00 Y10S707/99942

    摘要: Embodiments of the present inversion relate to a two-pass compression scheme that achieves compression performance on par with existing methods while admitting individual message decompression. These methods provide both storage savings and lower end-user latency. They preserve the advantages of standard text compression in exploiting short-range similarities in data, while introducing a second step to take advantage of long-range similarities often present in certain types of structured data, e.g. email archival files.

    摘要翻译: 本反转的实施例涉及一种双通道压缩方案,其在允许单独的消息解压缩的同时实现与现有方法相当的压缩性能。 这些方法提供了存储节省和较低的终端用户延迟。 它们在利用数据中的短距离相似性的同时保留标准文本压缩的优点,同时引入第二步来利用通常存在于某些类型的结构化数据中的长程相似性,例如, 电子邮件归档文件

    CONSTRUCTING IMAGE CAPTCHAS UTILIZING PRIVATE INFORMATION OF THE IMAGES
    8.
    发明申请
    CONSTRUCTING IMAGE CAPTCHAS UTILIZING PRIVATE INFORMATION OF THE IMAGES 审中-公开
    使用图像的私人信息构建图像CAPTCHAS

    公开(公告)号:US20100228804A1

    公开(公告)日:2010-09-09

    申请号:US12397561

    申请日:2009-03-04

    IPC分类号: H04L9/32 G06F17/30

    摘要: An image CAPTCHA having one or more images, a challenge, and a correct answer to the challenge is constructed by selecting the one or more images from a plurality of candidate images based at least in part on each image's public information and private information. The private information of each of the images is accessible only to an entity responsible for constructing the CAPTCHA. Optionally, the one or more images are selected further based on the specific type of the CAPTCHA to be constructed.

    摘要翻译: 通过至少部分地基于每个图像的公共信息和私人信息,通过从多个候选图像中选择一个或多个图像来构建具有一个或多个图像,挑战和对挑战的正确答案的图像验证码。 每个图像的私人信息只能由负责构建CAPTCHA的实体访问。 可选地,基于要构建的CAPTCHA的具体类型进一步选择一个或多个图像。

    Mail Compression Scheme with Individual Message Decompressability
    9.
    发明申请
    Mail Compression Scheme with Individual Message Decompressability 有权
    具有个人消息可解压缩性的邮件压缩方案

    公开(公告)号:US20090037447A1

    公开(公告)日:2009-02-05

    申请号:US11831828

    申请日:2007-07-31

    IPC分类号: G06F17/30

    CPC分类号: H04L51/00 Y10S707/99942

    摘要: Embodiments of the present inversion relate to a two-pass compression scheme that achieves compression performance on par with existing methods while admitting individual message decompression. These methods provide both storage savings and lower end-user latency. They preserve the advantages of standard text compression in exploiting short-range similarities in data, while introducing a second step to take advantage of long-range similarities often present in certain types of structured data, e.g. email archival files.

    摘要翻译: 本反转的实施例涉及一种双通道压缩方案,其在允许单独的消息解压缩的同时实现与现有方法相当的压缩性能。 这些方法提供了存储节省和较低的终端用户延迟。 它们在利用数据中的短距离相似性的同时保留标准文本压缩的优点,同时引入第二步来利用通常存在于某些类型的结构化数据中的长程相似性,例如, 电子邮件归档文件

    Method and system for fast similarity computation in high dimensional space
    10.
    发明授权
    Method and system for fast similarity computation in high dimensional space 有权
    高维空间快速相似度计算方法与系统

    公开(公告)号:US08515964B2

    公开(公告)日:2013-08-20

    申请号:US13189696

    申请日:2011-07-25

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30628

    摘要: Method, system, and programs for computing similarity. Input data is first received from one or more data sources and then analyzed to obtain an input feature vector that characterizes the input data. An index is then generated based on the input feature vector and is used to archive the input data, where the value of the index is computed based on an improved Johnson-Lindenstrass transformation (FJLT) process. With the improved FJLT process, first, the sign of each feature in the input feature vector is randomly flipped to obtain a flipped vector. A Hadamard transformation is then applied to the flipped vector to obtain a transformed vector. An inner product between the transformed vector and a sparse vector is then computed to obtain a base vector, based on which the value of the index is determined.

    摘要翻译: 用于计算相似度的方法,系统和程序。 首先从一个或多个数据源接收输入数据,然后分析以获得表征输入数据的输入特征向量。 然后基于输入特征向量生成索引,并且用于存档输入数据,其中基于改进的约翰逊 - 林登斯特拉斯变换(FJLT)处理来计算索引的值。 随着改进的FJLT过程,首先,输入特征向量中的每个特征的符号被随机翻转以获得翻转矢量。 然后将Hadamard变换应用于翻转矢量以获得变换矢量。 然后计算变换向量和稀疏向量之间的内积,以获得基准向量,基于此确定索引的值。