System and method for focused re-crawling of web sites
    1.
    发明授权
    System and method for focused re-crawling of web sites 失效
    网站重点重新抓取的系统和方法

    公开(公告)号:US07882099B2

    公开(公告)日:2011-02-01

    申请号:US12054482

    申请日:2008-03-25

    IPC分类号: G06F17/30

    摘要: A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.

    摘要翻译: 公开了一种爬行网(620)的方法(100)。 该方法(100)从给定的(110)种子通用资源定位符(URL)集合起,爬行(120)Web上的网页。 抓取的网页被分割(140)成相关和不相关的页面集合。 从相关和不相关页面的集合中发现一组排除和/或包含模式(150),并且通过一组排除和/或包含模式来限制Web的后续爬网。

    System and a method for focused re-crawling of Web sites
    2.
    发明授权
    System and a method for focused re-crawling of Web sites 有权
    系统和重点重新抓取网站的方法

    公开(公告)号:US07379932B2

    公开(公告)日:2008-05-27

    申请号:US11314432

    申请日:2005-12-21

    IPC分类号: G06F17/30

    摘要: A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.

    摘要翻译: 公开了一种爬行网(620)的方法(100)。 该方法(100)从给定的(110)种子通用资源定位符(URL)集合起,爬行(120)Web上的网页。 抓取的网页被分割(140)成相关和不相关的页面集合。 从相关和不相关页面的集合中发现一组排除和/或包含模式(150),并且通过一组排除和/或包含模式来限制Web的后续爬网。

    SYSTEM AND METHOD FOR FOCUSED RE-CRAWLING OF WEB SITES
    3.
    发明申请
    SYSTEM AND METHOD FOR FOCUSED RE-CRAWLING OF WEB SITES 失效
    网站重点破解的系统与方法

    公开(公告)号:US20080168041A1

    公开(公告)日:2008-07-10

    申请号:US12054482

    申请日:2008-03-25

    IPC分类号: G06F17/30

    摘要: A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.

    摘要翻译: 公开了一种爬行网(620)的方法(100)。 该方法(100)从给定的(110)种子通用资源定位符(URL)集合起,爬行(120)Web上的网页。 抓取的网页被分割(140)成相关和不相关的页面集合。 从一组相关和不相关的页面中发现一组排除和/或包含模式(150),并且通过一组排除和/或包含模式来限制Web的后续爬网。

    Annotating token sequences within documents
    4.
    发明申请
    Annotating token sequences within documents 审中-公开
    在文档中注释令牌序列

    公开(公告)号:US20080072134A1

    公开(公告)日:2008-03-20

    申请号:US11532977

    申请日:2006-09-19

    IPC分类号: G06F17/00 G06F7/00

    CPC分类号: G06F17/278 G06F16/313

    摘要: Token sequences within a number of documents are annotated. First, a base inverse index for unique tokens within the documents is received. The base inverse index includes a set of the unique tokens within the documents and a set of location lists for each unique token. Second, indices are created for a set of the token sequences within the documents from the base inverse index, to annotate the token sequences.

    摘要翻译: 许多文件中的令牌序列被注释。 首先,收到文档内唯一令牌的基本反向索引。 基本反向索引包括一组文档内的唯一标记和每个唯一令牌的一组位置列表。 其次,针对来自基本反向索引的文档中的一组令牌序列创建索引,以注释令牌序列。

    System and a method for focused re-crawling of Web sites
    5.
    发明申请
    System and a method for focused re-crawling of Web sites 有权
    系统和重点重新抓取网站的方法

    公开(公告)号:US20070143263A1

    公开(公告)日:2007-06-21

    申请号:US11314432

    申请日:2005-12-21

    IPC分类号: G06F17/30

    摘要: A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.

    摘要翻译: 公开了一种爬行网(620)的方法(100)。 该方法(100)从给定的(110)种子通用资源定位符(URL)集合起,爬行(120)Web上的网页。 抓取的网页被分割(140)成相关和不相关的页面集合。 从相关和不相关页面的集合中发现一组排除和/或包含模式(150),并且通过一组排除和/或包含模式来限制Web的后续爬网。

    Electronic mail duplicate detection
    8.
    发明授权
    Electronic mail duplicate detection 有权
    电子邮件重复检测

    公开(公告)号:US08788500B2

    公开(公告)日:2014-07-22

    申请号:US12879478

    申请日:2010-09-10

    IPC分类号: G06F17/30

    摘要: Embodiments of the invention are related to a method and system for identifying linked electronic mails by receiving a query from a user, wherein the query comprises at least a segment of an electronic mail; and based on the segment received, rendering to the user at least one of related subsets or a related supersets of electronic mails related to the received segment, wherein the related subsets and related supersets are threads of the segment received and arranged in a hierarchical manner.

    摘要翻译: 本发明的实施例涉及一种用于通过从用户接收查询来识别所链接的电子邮件的方法和系统,其中所述查询至少包括电子邮件的一部分; 并且基于所接收的段,向用户呈现与所接收的段相关的相关子集或电子邮件的相关超集中的至少一个,其中相关子集和相关超集是以分层方式接收和排列的段的线程。

    System and method for extraction of factoids from textual repositories
    9.
    发明授权
    System and method for extraction of factoids from textual repositories 失效
    从文本库中提取事实的系统和方法

    公开(公告)号:US08706730B2

    公开(公告)日:2014-04-22

    申请号:US11321177

    申请日:2005-12-29

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864 G06F17/30705

    摘要: A method (400) is disclosed of extracting factoids from text repositories, with the factoids being associated with a given factoid category. The method (400) starts by training a classifier (230) to recognize factoids relevant to that given factoid category. Documents or document summaries relevant to the given factoid category is next collected (410) from the text repositories. Sentences having a predetermined association to the given factoid category is extracted (420) from the documents or said document summaries. Those sentences are classified (440), in a noisy environment, using the classifier (230) to extract snippets containing phrases relevant to the given factoid category. It is the extracted snippets that are the factoid associated with the given factoid category.

    摘要翻译: 公开了一种从文本存储库中提取事实框架的方法(400),其中事实框架与给定的类别类别相关联。 方法(400)通过训练分类器(230)开始,以识别与该给定的类别类别相关的因子。 接下来从文本存储库收集与文件类型相关的文档或文档摘要(410)。 具有与给定类别类别的预定关联的句子从文档或所述文档摘要中提取(420)。 这些句子在嘈杂的环境中被分类(440),使用分类器(230)提取包含与给定类别类别相关的短语的片段。 提取的片段是与给定类实体类别相关联的实例。

    Cross-guided data clustering based on alignment between data domains
    10.
    发明授权
    Cross-guided data clustering based on alignment between data domains 有权
    基于数据域之间的对齐的交叉引导数据聚类

    公开(公告)号:US08589396B2

    公开(公告)日:2013-11-19

    申请号:US12652987

    申请日:2010-01-06

    IPC分类号: G06F17/30 G06F17/27

    CPC分类号: G06K9/6222 G06K9/6224

    摘要: A system and associated method for cross-guided data clustering by aligning target clusters in a target domain to source clusters in a source domain. The cross-guided clustering process takes the target domain and the source domain as inputs. A common word attribute shared by both the target domain and the source domain is a pivot vocabulary, and all other words in both domains are a non-pivot vocabulary. The non-pivot vocabulary is projected onto the pivot vocabulary to improve measurement of similarity between data items. Source centroids representing clusters in the source domain are created and projected to the pivot vocabulary. Target centroids representing clusters in the target domain are initially created by conventional clustering method and then repetitively aligned to converge with the source centroids by use of a cross-domain similarity graph that measures a respective similarity of each target centroid to each source centroid.

    摘要翻译: 一种用于通过将目标域中的目标集群与源域中的源集群进行对齐的交叉引导数据集群的系统和关联方法。 交叉引导的聚类过程将目标域和源域作为输入。 目标域和源域共享的通用字属性是一个枢轴词汇表,两个域中的所有其他单词都是一个非重要词汇。 非枢纽词汇被投影到枢纽词汇表上,以改进数据项之间相似度的测量。 源代码域中的聚类的源中心被创建并投影到枢纽词汇表。 目标域中的聚类的目标质心最初是通过传统聚类方法创建的,然后通过使用跨域相似度图重复对齐以与源中心收敛,该跨域相似度图测量每个目标质心与每个源质心的相应相似度。