专利检索 ap:("Neeraj Agrawal" OR "Sreeram Viswanath Balakrishnan" OR "Sachindra Joshi") AND inv:"Sachindra Joshi" 第 1 页

1.

发明授权
System and method for focused re-crawling of web sites 失效
标题翻译：网站重点重新抓取的系统和方法

公开(公告)号：US07882099B2

公开(公告)日：2011-02-01

申请号：US12054482

申请日：2008-03-25

申请人： Neeraj Agrawal , Sreeram Viswanath Balakrishnan , Sachindra Joshi

发明人： Neeraj Agrawal , Sreeram Viswanath Balakrishnan , Sachindra Joshi

IPC分类号： G06F17/30

CPC分类号： G06F17/30864 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935

摘要： A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.

摘要翻译： 公开了一种爬行网（620）的方法（100）。该方法（100）从给定的（110）种子通用资源定位符（URL）集合起，爬行（120）Web上的网页。抓取的网页被分割（140）成相关和不相关的页面集合。从相关和不相关页面的集合中发现一组排除和/或包含模式（150），并且通过一组排除和/或包含模式来限制Web的后续爬网。

2.

发明授权
System and a method for focused re-crawling of Web sites 有权
标题翻译：系统和重点重新抓取网站的方法

公开(公告)号：US07379932B2

公开(公告)日：2008-05-27

申请号：US11314432

申请日：2005-12-21

申请人： Neeraj Agrawal , Sreeram Viswanath Balakrishnan , Sachindra Joshi

发明人： Neeraj Agrawal , Sreeram Viswanath Balakrishnan , Sachindra Joshi

IPC分类号： G06F17/30

CPC分类号： G06F17/30864 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935

摘要： A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.

摘要翻译： 公开了一种爬行网（620）的方法（100）。该方法（100）从给定的（110）种子通用资源定位符（URL）集合起，爬行（120）Web上的网页。抓取的网页被分割（140）成相关和不相关的页面集合。从相关和不相关页面的集合中发现一组排除和/或包含模式（150），并且通过一组排除和/或包含模式来限制Web的后续爬网。

3.

发明申请
SYSTEM AND METHOD FOR FOCUSED RE-CRAWLING OF WEB SITES 失效
标题翻译：网站重点破解的系统与方法

公开(公告)号：US20080168041A1

公开(公告)日：2008-07-10

申请号：US12054482

申请日：2008-03-25

申请人： Sachindra Joshi , Neeraj Agrawal , Sreeram Viswanath Balakrishnan

发明人： Sachindra Joshi , Neeraj Agrawal , Sreeram Viswanath Balakrishnan

IPC分类号： G06F17/30

CPC分类号： G06F17/30864 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935

摘要： A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.

摘要翻译： 公开了一种爬行网（620）的方法（100）。该方法（100）从给定的（110）种子通用资源定位符（URL）集合起，爬行（120）Web上的网页。抓取的网页被分割（140）成相关和不相关的页面集合。从一组相关和不相关的页面中发现一组排除和/或包含模式（150），并且通过一组排除和/或包含模式来限制Web的后续爬网。

4.

发明申请
Annotating token sequences within documents 审中-公开
标题翻译：在文档中注释令牌序列

公开(公告)号：US20080072134A1

公开(公告)日：2008-03-20

申请号：US11532977

申请日：2006-09-19

申请人： Sreeram Viswanath Balakrishnan , Ganesh Ramakrishnan , Sachindra Joshi

发明人： Sreeram Viswanath Balakrishnan , Ganesh Ramakrishnan , Sachindra Joshi

IPC分类号： G06F17/00 , G06F7/00

CPC分类号： G06F17/278 , G06F16/313

摘要： Token sequences within a number of documents are annotated. First, a base inverse index for unique tokens within the documents is received. The base inverse index includes a set of the unique tokens within the documents and a set of location lists for each unique token. Second, indices are created for a set of the token sequences within the documents from the base inverse index, to annotate the token sequences.

摘要翻译： 许多文件中的令牌序列被注释。首先，收到文档内唯一令牌的基本反向索引。基本反向索引包括一组文档内的唯一标记和每个唯一令牌的一组位置列表。其次，针对来自基本反向索引的文档中的一组令牌序列创建索引，以注释令牌序列。

5.

发明申请
System and a method for focused re-crawling of Web sites 有权
标题翻译：系统和重点重新抓取网站的方法

公开(公告)号：US20070143263A1

公开(公告)日：2007-06-21

申请号：US11314432

申请日：2005-12-21

申请人： Neeraj Agrawal , Sreeram Balakrishnan , Sachindra Joshi

发明人： Neeraj Agrawal , Sreeram Balakrishnan , Sachindra Joshi

IPC分类号： G06F17/30

CPC分类号： G06F17/30864 , Y10S707/99931 , Y10S707/99932 , Y10S707/99933 , Y10S707/99935

摘要： A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.

摘要翻译： 公开了一种爬行网（620）的方法（100）。该方法（100）从给定的（110）种子通用资源定位符（URL）集合起，爬行（120）Web上的网页。抓取的网页被分割（140）成相关和不相关的页面集合。从相关和不相关页面的集合中发现一组排除和/或包含模式（150），并且通过一组排除和/或包含模式来限制Web的后续爬网。

6.

发明申请
Determining structural similarity in semi-structured documents 有权
标题翻译：确定半结构文件的结构相似性

公开(公告)号：US20050038785A1

公开(公告)日：2005-02-17

申请号：US10629133

申请日：2003-07-29

申请人： Neeraj Agrawal , Sachindra Joshi , Raghuram Krishnapuram , Sumit Negi

发明人： Neeraj Agrawal , Sachindra Joshi , Raghuram Krishnapuram , Sumit Negi

IPC分类号： G06F17/22 , G06F17/30

CPC分类号： G06F17/30911 , G06F17/2211 , G06F17/2247 , Y10S707/99932 , Y10S707/99933 , Y10S707/99936 , Y10S707/99942

摘要： Documents are represented based on their structure, which arises from the relationship between various elements in the document. After representing documents based on their structure in vector form, a method of measuring similarity between vectors is used to obtain the measure of structural similarity between two given documents.

摘要翻译： 文件基于它们的结构来表示，这些结构源于文档中各种元素之间的关系。在以向量形式的结构表示文档之后，使用测量向量之间的相似性的方法来获得两个给定文档之间的结构相似度的度量。

7.

发明授权
Determining structural similarity in semi-structured documents 有权
标题翻译：确定半结构文件的结构相似性

公开(公告)号：US07203679B2

公开(公告)日：2007-04-10

申请号：US10629133

申请日：2003-07-29

申请人： Neeraj Agrawal , Sachindra Joshi , Raghuram Krishnapuram , Sumit Negi

发明人： Neeraj Agrawal , Sachindra Joshi , Raghuram Krishnapuram , Sumit Negi

IPC分类号： G06F17/30

CPC分类号： G06F17/30911 , G06F17/2211 , G06F17/2247 , Y10S707/99932 , Y10S707/99933 , Y10S707/99936 , Y10S707/99942

摘要： Documents are represented based on their structure, which arises from the relationship between various elements in the document. After representing documents based on their structure in vector form, a method of measuring similarity between vectors is used to obtain the measure of structural similarity between two given documents.

摘要翻译： 文件基于它们的结构来表示，这些结构源于文档中各种元素之间的关系。在以向量形式的结构表示文档之后，使用测量向量之间的相似性的方法来获得两个给定文档之间的结构相似度的度量。

8.

发明授权
Electronic mail duplicate detection 有权
标题翻译：电子邮件重复检测

公开(公告)号：US08788500B2

公开(公告)日：2014-07-22

申请号：US12879478

申请日：2010-09-10

申请人： Danish Contractor , Manjula Golla Hosurmath , Sachindra Joshi , Kenney Ng

发明人： Danish Contractor , Manjula Golla Hosurmath , Sachindra Joshi , Kenney Ng

IPC分类号： G06F17/30

CPC分类号： G06F17/30156 , G06F17/30657 , G06F17/30684 , G06F17/30722 , H04L51/16

摘要： Embodiments of the invention are related to a method and system for identifying linked electronic mails by receiving a query from a user, wherein the query comprises at least a segment of an electronic mail; and based on the segment received, rendering to the user at least one of related subsets or a related supersets of electronic mails related to the received segment, wherein the related subsets and related supersets are threads of the segment received and arranged in a hierarchical manner.

摘要翻译： 本发明的实施例涉及一种用于通过从用户接收查询来识别所链接的电子邮件的方法和系统，其中所述查询至少包括电子邮件的一部分; 并且基于所接收的段，向用户呈现与所接收的段相关的相关子集或电子邮件的相关超集中的至少一个，其中相关子集和相关超集是以分层方式接收和排列的段的线程。

9.

发明授权
System and method for extraction of factoids from textual repositories 失效
标题翻译：从文本库中提取事实的系统和方法

公开(公告)号：US08706730B2

公开(公告)日：2014-04-22

申请号：US11321177

申请日：2005-12-29

申请人： Sachindra Joshi , Raghuram Krishnapuram , Nimit Kumar , Kiran Mehta , Sumit Negi , Ganesh Ramakrishnan , Scott R Holmes

发明人： Sachindra Joshi , Raghuram Krishnapuram , Nimit Kumar , Kiran Mehta , Sumit Negi , Ganesh Ramakrishnan , Scott R Holmes

IPC分类号： G06F17/30

CPC分类号： G06F17/30864 , G06F17/30705

摘要： A method (400) is disclosed of extracting factoids from text repositories, with the factoids being associated with a given factoid category. The method (400) starts by training a classifier (230) to recognize factoids relevant to that given factoid category. Documents or document summaries relevant to the given factoid category is next collected (410) from the text repositories. Sentences having a predetermined association to the given factoid category is extracted (420) from the documents or said document summaries. Those sentences are classified (440), in a noisy environment, using the classifier (230) to extract snippets containing phrases relevant to the given factoid category. It is the extracted snippets that are the factoid associated with the given factoid category.

摘要翻译： 公开了一种从文本存储库中提取事实框架的方法（400），其中事实框架与给定的类别类别相关联。方法（400）通过训练分类器（230）开始，以识别与该给定的类别类别相关的因子。接下来从文本存储库收集与文件类型相关的文档或文档摘要（410）。具有与给定类别类别的预定关联的句子从文档或所述文档摘要中提取（420）。这些句子在嘈杂的环境中被分类（440），使用分类器（230）提取包含与给定类别类别相关的短语的片段。提取的片段是与给定类实体类别相关联的实例。

10.

发明授权
Cross-guided data clustering based on alignment between data domains 有权
标题翻译：基于数据域之间的对齐的交叉引导数据聚类

公开(公告)号：US08589396B2

公开(公告)日：2013-11-19

申请号：US12652987

申请日：2010-01-06

申请人： Jeffrey M. Achtermann , Indrajit Bhattacharya , Kevin W. English, Jr. , Shantanu R. Godbole , Sachindra Joshi , Ashwin Srinivasan , Ashish Verma

发明人： Jeffrey M. Achtermann , Indrajit Bhattacharya , Kevin W. English, Jr. , Shantanu R. Godbole , Sachindra Joshi , Ashwin Srinivasan , Ashish Verma

IPC分类号： G06F17/30 , G06F17/27

CPC分类号： G06K9/6222 , G06K9/6224

摘要： A system and associated method for cross-guided data clustering by aligning target clusters in a target domain to source clusters in a source domain. The cross-guided clustering process takes the target domain and the source domain as inputs. A common word attribute shared by both the target domain and the source domain is a pivot vocabulary, and all other words in both domains are a non-pivot vocabulary. The non-pivot vocabulary is projected onto the pivot vocabulary to improve measurement of similarity between data items. Source centroids representing clusters in the source domain are created and projected to the pivot vocabulary. Target centroids representing clusters in the target domain are initially created by conventional clustering method and then repetitively aligned to converge with the source centroids by use of a cross-domain similarity graph that measures a respective similarity of each target centroid to each source centroid.

摘要翻译： 一种用于通过将目标域中的目标集群与源域中的源集群进行对齐的交叉引导数据集群的系统和关联方法。交叉引导的聚类过程将目标域和源域作为输入。目标域和源域共享的通用字属性是一个枢轴词汇表，两个域中的所有其他单词都是一个非重要词汇。非枢纽词汇被投影到枢纽词汇表上，以改进数据项之间相似度的测量。源代码域中的聚类的源中心被创建并投影到枢纽词汇表。目标域中的聚类的目标质心最初是通过传统聚类方法创建的，然后通过使用跨域相似度图重复对齐以与源中心收敛，该跨域相似度图测量每个目标质心与每个源质心的相应相似度。

搜索结果

国家/区域

专利有效性

申请日

公布(公告)日

申请人

申请人所在国/区域

发明人

IPC

IPC部

IPC大类

IPC小类

IPC大组

IPC小组

外观分类