-
公开(公告)号:US07882099B2
公开(公告)日:2011-02-01
申请号:US12054482
申请日:2008-03-25
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935
摘要: A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.
摘要翻译: 公开了一种爬行网(620)的方法(100)。 该方法(100)从给定的(110)种子通用资源定位符(URL)集合起,爬行(120)Web上的网页。 抓取的网页被分割(140)成相关和不相关的页面集合。 从相关和不相关页面的集合中发现一组排除和/或包含模式(150),并且通过一组排除和/或包含模式来限制Web的后续爬网。
-
公开(公告)号:US07379932B2
公开(公告)日:2008-05-27
申请号:US11314432
申请日:2005-12-21
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935
摘要: A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.
摘要翻译: 公开了一种爬行网(620)的方法(100)。 该方法(100)从给定的(110)种子通用资源定位符(URL)集合起,爬行(120)Web上的网页。 抓取的网页被分割(140)成相关和不相关的页面集合。 从相关和不相关页面的集合中发现一组排除和/或包含模式(150),并且通过一组排除和/或包含模式来限制Web的后续爬网。
-
公开(公告)号:US20080168041A1
公开(公告)日:2008-07-10
申请号:US12054482
申请日:2008-03-25
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935
摘要: A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.
摘要翻译: 公开了一种爬行网(620)的方法(100)。 该方法(100)从给定的(110)种子通用资源定位符(URL)集合起,爬行(120)Web上的网页。 抓取的网页被分割(140)成相关和不相关的页面集合。 从一组相关和不相关的页面中发现一组排除和/或包含模式(150),并且通过一组排除和/或包含模式来限制Web的后续爬网。
-
公开(公告)号:US20080072134A1
公开(公告)日:2008-03-20
申请号:US11532977
申请日:2006-09-19
CPC分类号: G06F17/278 , G06F16/313
摘要: Token sequences within a number of documents are annotated. First, a base inverse index for unique tokens within the documents is received. The base inverse index includes a set of the unique tokens within the documents and a set of location lists for each unique token. Second, indices are created for a set of the token sequences within the documents from the base inverse index, to annotate the token sequences.
摘要翻译: 许多文件中的令牌序列被注释。 首先,收到文档内唯一令牌的基本反向索引。 基本反向索引包括一组文档内的唯一标记和每个唯一令牌的一组位置列表。 其次,针对来自基本反向索引的文档中的一组令牌序列创建索引,以注释令牌序列。
-
公开(公告)号:US20070143263A1
公开(公告)日:2007-06-21
申请号:US11314432
申请日:2005-12-21
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935
摘要: A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.
摘要翻译: 公开了一种爬行网(620)的方法(100)。 该方法(100)从给定的(110)种子通用资源定位符(URL)集合起,爬行(120)Web上的网页。 抓取的网页被分割(140)成相关和不相关的页面集合。 从相关和不相关页面的集合中发现一组排除和/或包含模式(150),并且通过一组排除和/或包含模式来限制Web的后续爬网。
-
公开(公告)号:US20050038785A1
公开(公告)日:2005-02-17
申请号:US10629133
申请日:2003-07-29
CPC分类号: G06F17/30911 , G06F17/2211 , G06F17/2247 , Y10S707/99932 , Y10S707/99933 , Y10S707/99936 , Y10S707/99942
摘要: Documents are represented based on their structure, which arises from the relationship between various elements in the document. After representing documents based on their structure in vector form, a method of measuring similarity between vectors is used to obtain the measure of structural similarity between two given documents.
摘要翻译: 文件基于它们的结构来表示,这些结构源于文档中各种元素之间的关系。 在以向量形式的结构表示文档之后,使用测量向量之间的相似性的方法来获得两个给定文档之间的结构相似度的度量。
-
公开(公告)号:US07203679B2
公开(公告)日:2007-04-10
申请号:US10629133
申请日:2003-07-29
IPC分类号: G06F17/30
CPC分类号: G06F17/30911 , G06F17/2211 , G06F17/2247 , Y10S707/99932 , Y10S707/99933 , Y10S707/99936 , Y10S707/99942
摘要: Documents are represented based on their structure, which arises from the relationship between various elements in the document. After representing documents based on their structure in vector form, a method of measuring similarity between vectors is used to obtain the measure of structural similarity between two given documents.
摘要翻译: 文件基于它们的结构来表示,这些结构源于文档中各种元素之间的关系。 在以向量形式的结构表示文档之后,使用测量向量之间的相似性的方法来获得两个给定文档之间的结构相似度的度量。
-
公开(公告)号:US08788500B2
公开(公告)日:2014-07-22
申请号:US12879478
申请日:2010-09-10
IPC分类号: G06F17/30
CPC分类号: G06F17/30156 , G06F17/30657 , G06F17/30684 , G06F17/30722 , H04L51/16
摘要: Embodiments of the invention are related to a method and system for identifying linked electronic mails by receiving a query from a user, wherein the query comprises at least a segment of an electronic mail; and based on the segment received, rendering to the user at least one of related subsets or a related supersets of electronic mails related to the received segment, wherein the related subsets and related supersets are threads of the segment received and arranged in a hierarchical manner.
摘要翻译: 本发明的实施例涉及一种用于通过从用户接收查询来识别所链接的电子邮件的方法和系统,其中所述查询至少包括电子邮件的一部分; 并且基于所接收的段,向用户呈现与所接收的段相关的相关子集或电子邮件的相关超集中的至少一个,其中相关子集和相关超集是以分层方式接收和排列的段的线程。
-
9.
公开(公告)号:US08706730B2
公开(公告)日:2014-04-22
申请号:US11321177
申请日:2005-12-29
申请人: Sachindra Joshi , Raghuram Krishnapuram , Nimit Kumar , Kiran Mehta , Sumit Negi , Ganesh Ramakrishnan , Scott R Holmes
发明人: Sachindra Joshi , Raghuram Krishnapuram , Nimit Kumar , Kiran Mehta , Sumit Negi , Ganesh Ramakrishnan , Scott R Holmes
IPC分类号: G06F17/30
CPC分类号: G06F17/30864 , G06F17/30705
摘要: A method (400) is disclosed of extracting factoids from text repositories, with the factoids being associated with a given factoid category. The method (400) starts by training a classifier (230) to recognize factoids relevant to that given factoid category. Documents or document summaries relevant to the given factoid category is next collected (410) from the text repositories. Sentences having a predetermined association to the given factoid category is extracted (420) from the documents or said document summaries. Those sentences are classified (440), in a noisy environment, using the classifier (230) to extract snippets containing phrases relevant to the given factoid category. It is the extracted snippets that are the factoid associated with the given factoid category.
摘要翻译: 公开了一种从文本存储库中提取事实框架的方法(400),其中事实框架与给定的类别类别相关联。 方法(400)通过训练分类器(230)开始,以识别与该给定的类别类别相关的因子。 接下来从文本存储库收集与文件类型相关的文档或文档摘要(410)。 具有与给定类别类别的预定关联的句子从文档或所述文档摘要中提取(420)。 这些句子在嘈杂的环境中被分类(440),使用分类器(230)提取包含与给定类别类别相关的短语的片段。 提取的片段是与给定类实体类别相关联的实例。
-
10.
公开(公告)号:US08589396B2
公开(公告)日:2013-11-19
申请号:US12652987
申请日:2010-01-06
申请人: Jeffrey M. Achtermann , Indrajit Bhattacharya , Kevin W. English, Jr. , Shantanu R. Godbole , Sachindra Joshi , Ashwin Srinivasan , Ashish Verma
发明人: Jeffrey M. Achtermann , Indrajit Bhattacharya , Kevin W. English, Jr. , Shantanu R. Godbole , Sachindra Joshi , Ashwin Srinivasan , Ashish Verma
CPC分类号: G06K9/6222 , G06K9/6224
摘要: A system and associated method for cross-guided data clustering by aligning target clusters in a target domain to source clusters in a source domain. The cross-guided clustering process takes the target domain and the source domain as inputs. A common word attribute shared by both the target domain and the source domain is a pivot vocabulary, and all other words in both domains are a non-pivot vocabulary. The non-pivot vocabulary is projected onto the pivot vocabulary to improve measurement of similarity between data items. Source centroids representing clusters in the source domain are created and projected to the pivot vocabulary. Target centroids representing clusters in the target domain are initially created by conventional clustering method and then repetitively aligned to converge with the source centroids by use of a cross-domain similarity graph that measures a respective similarity of each target centroid to each source centroid.
摘要翻译: 一种用于通过将目标域中的目标集群与源域中的源集群进行对齐的交叉引导数据集群的系统和关联方法。 交叉引导的聚类过程将目标域和源域作为输入。 目标域和源域共享的通用字属性是一个枢轴词汇表,两个域中的所有其他单词都是一个非重要词汇。 非枢纽词汇被投影到枢纽词汇表上,以改进数据项之间相似度的测量。 源代码域中的聚类的源中心被创建并投影到枢纽词汇表。 目标域中的聚类的目标质心最初是通过传统聚类方法创建的,然后通过使用跨域相似度图重复对齐以与源中心收敛,该跨域相似度图测量每个目标质心与每个源质心的相应相似度。
-
-
-
-
-
-
-
-
-