Electronic mail duplicate detection
    1.
    发明授权
    Electronic mail duplicate detection 有权
    电子邮件重复检测

    公开(公告)号:US08788500B2

    公开(公告)日:2014-07-22

    申请号:US12879478

    申请日:2010-09-10

    Abstract: Embodiments of the invention are related to a method and system for identifying linked electronic mails by receiving a query from a user, wherein the query comprises at least a segment of an electronic mail; and based on the segment received, rendering to the user at least one of related subsets or a related supersets of electronic mails related to the received segment, wherein the related subsets and related supersets are threads of the segment received and arranged in a hierarchical manner.

    Abstract translation: 本发明的实施例涉及一种用于通过从用户接收查询来识别所链接的电子邮件的方法和系统,其中所述查询至少包括电子邮件的一部分; 并且基于所接收的段,向用户呈现与所接收的段相关的相关子集或电子邮件的相关超集中的至少一个,其中相关子集和相关超集是以分层方式接收和排列的段的线程。

    System and method for extraction of factoids from textual repositories
    2.
    发明授权
    System and method for extraction of factoids from textual repositories 失效
    从文本库中提取事实的系统和方法

    公开(公告)号:US08706730B2

    公开(公告)日:2014-04-22

    申请号:US11321177

    申请日:2005-12-29

    CPC classification number: G06F17/30864 G06F17/30705

    Abstract: A method (400) is disclosed of extracting factoids from text repositories, with the factoids being associated with a given factoid category. The method (400) starts by training a classifier (230) to recognize factoids relevant to that given factoid category. Documents or document summaries relevant to the given factoid category is next collected (410) from the text repositories. Sentences having a predetermined association to the given factoid category is extracted (420) from the documents or said document summaries. Those sentences are classified (440), in a noisy environment, using the classifier (230) to extract snippets containing phrases relevant to the given factoid category. It is the extracted snippets that are the factoid associated with the given factoid category.

    Abstract translation: 公开了一种从文本存储库中提取事实框架的方法(400),其中事实框架与给定的类别类别相关联。 方法(400)通过训练分类器(230)开始,以识别与该给定的类别类别相关的因子。 接下来从文本存储库收集与文件类型相关的文档或文档摘要(410)。 具有与给定类别类别的预定关联的句子从文档或所述文档摘要中提取(420)。 这些句子在嘈杂的环境中被分类(440),使用分类器(230)提取包含与给定类别类别相关的短语的片段。 提取的片段是与给定类实体类别相关联的实例。

    Cross-guided data clustering based on alignment between data domains
    3.
    发明授权
    Cross-guided data clustering based on alignment between data domains 有权
    基于数据域之间的对齐的交叉引导数据聚类

    公开(公告)号:US08589396B2

    公开(公告)日:2013-11-19

    申请号:US12652987

    申请日:2010-01-06

    CPC classification number: G06K9/6222 G06K9/6224

    Abstract: A system and associated method for cross-guided data clustering by aligning target clusters in a target domain to source clusters in a source domain. The cross-guided clustering process takes the target domain and the source domain as inputs. A common word attribute shared by both the target domain and the source domain is a pivot vocabulary, and all other words in both domains are a non-pivot vocabulary. The non-pivot vocabulary is projected onto the pivot vocabulary to improve measurement of similarity between data items. Source centroids representing clusters in the source domain are created and projected to the pivot vocabulary. Target centroids representing clusters in the target domain are initially created by conventional clustering method and then repetitively aligned to converge with the source centroids by use of a cross-domain similarity graph that measures a respective similarity of each target centroid to each source centroid.

    Abstract translation: 一种用于通过将目标域中的目标集群与源域中的源集群进行对齐的交叉引导数据集群的系统和关联方法。 交叉引导的聚类过程将目标域和源域作为输入。 目标域和源域共享的通用字属性是一个枢轴词汇表,两个域中的所有其他单词都是一个非重要词汇。 非枢纽词汇被投影到枢纽词汇表上,以改进数据项之间相似度的测量。 源代码域中的聚类的源中心被创建并投影到枢纽词汇表。 目标域中的聚类的目标质心最初是通过传统聚类方法创建的,然后通过使用跨域相似度图重复对齐以与源中心收敛,该跨域相似度图测量每个目标质心与每个源质心的相应相似度。

    ELECTRONIC MAIL DUPLICATE DETECTION
    4.
    发明申请
    ELECTRONIC MAIL DUPLICATE DETECTION 有权
    电子邮件重复检测

    公开(公告)号:US20120066209A1

    公开(公告)日:2012-03-15

    申请号:US12879478

    申请日:2010-09-10

    Abstract: Embodiments of the invention are related to a method and system for identifying linked electronic mails by receiving a query from a user, wherein the query comprises at least a segment of an electronic mail; and based on the segment received, rendering to the user at least one of related subsets or a related supersets of electronic mails related to the received segment, wherein the related subsets and related supersets are threads of the segment received and arranged in a hierarchical manner.

    Abstract translation: 本发明的实施例涉及一种用于通过从用户接收查询来识别所链接的电子邮件的方法和系统,其中所述查询至少包括电子邮件的一部分; 并且基于所接收的段,向用户呈现与所接收的段相关的相关子集或电子邮件的相关超集中的至少一个,其中相关子集和相关超集是以分层方式接收和排列的段的线程。

    SYSTEM AND METHOD FOR FOCUSED RE-CRAWLING OF WEB SITES
    5.
    发明申请
    SYSTEM AND METHOD FOR FOCUSED RE-CRAWLING OF WEB SITES 失效
    网站重点破解的系统与方法

    公开(公告)号:US20080168041A1

    公开(公告)日:2008-07-10

    申请号:US12054482

    申请日:2008-03-25

    Abstract: A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.

    Abstract translation: 公开了一种爬行网(620)的方法(100)。 该方法(100)从给定的(110)种子通用资源定位符(URL)集合起,爬行(120)Web上的网页。 抓取的网页被分割(140)成相关和不相关的页面集合。 从一组相关和不相关的页面中发现一组排除和/或包含模式(150),并且通过一组排除和/或包含模式来限制Web的后续爬网。

    Annotating token sequences within documents
    6.
    发明申请
    Annotating token sequences within documents 审中-公开
    在文档中注释令牌序列

    公开(公告)号:US20080072134A1

    公开(公告)日:2008-03-20

    申请号:US11532977

    申请日:2006-09-19

    CPC classification number: G06F17/278 G06F16/313

    Abstract: Token sequences within a number of documents are annotated. First, a base inverse index for unique tokens within the documents is received. The base inverse index includes a set of the unique tokens within the documents and a set of location lists for each unique token. Second, indices are created for a set of the token sequences within the documents from the base inverse index, to annotate the token sequences.

    Abstract translation: 许多文件中的令牌序列被注释。 首先,收到文档内唯一令牌的基本反向索引。 基本反向索引包括一组文档内的唯一标记和每个唯一令牌的一组位置列表。 其次,针对来自基本反向索引的文档中的一组令牌序列创建索引,以注释令牌序列。

    System and a method for focused re-crawling of Web sites
    7.
    发明申请
    System and a method for focused re-crawling of Web sites 有权
    系统和重点重新抓取网站的方法

    公开(公告)号:US20070143263A1

    公开(公告)日:2007-06-21

    申请号:US11314432

    申请日:2005-12-21

    Abstract: A method (100) of crawling the Web (620) is disclosed. The method (100) crawls (120) Web pages on the Web starting from a given (110) set of seed Universal Resource Locators (URLs). Crawled Web pages are partitioned (140) into sets of relevant and irrelevant pages. A set of exclusion and/or inclusion patterns are discovered (150) from the sets of relevant and irrelevant pages, and subsequent crawling of the Web is restricted through the set of exclusion and/or inclusion patterns.

    Abstract translation: 公开了一种爬行网(620)的方法(100)。 该方法(100)从给定的(110)种子通用资源定位符(URL)集合起,爬行(120)Web上的网页。 抓取的网页被分割(140)成相关和不相关的页面集合。 从相关和不相关页面的集合中发现一组排除和/或包含模式(150),并且通过一组排除和/或包含模式来限制Web的后续爬网。

    E-mail thread hierarchy detection
    9.
    发明授权
    E-mail thread hierarchy detection 有权
    电子邮件线程层次检测

    公开(公告)号:US08898177B2

    公开(公告)日:2014-11-25

    申请号:US12879454

    申请日:2010-09-10

    CPC classification number: G06Q10/107 G06F17/30946

    Abstract: A plurality of segments in an e-mail collection by parsing content of e-mails is generated. Corresponding segment signature for each segment is created and a signature index is populated using the generated segment signatures. After receiving a query e-mail, a plurality of query segments in the query e-mail is generated using content of the query e-mail and corresponding query segment signature for each query segment is generated. A query root segment is identified and corresponding query root segment signature is generated. A set of root segment signatures of the signature index is identified and the query root segment signature is compared with each root segment signature from the signature index. A subset of the signature index is identified, using a match between the root segment signature and the query root segment signature. An e-mail thread hierarchy is built using the identified subset of the signature index.

    Abstract translation: 生成通过解析电子邮件的内容的电子邮件集合中的多个片段。 创建每个段的相应段签名,并使用所生成的段签名来填充签名索引。 在接收到查询电子邮件之后,使用查询电子邮件的内容生成查询电子邮件中的多个查询段,并且生成每个查询段的相应查询段签名。 识别查询根段,生成对应的查询根段签名。 识别签名索引的一组根段签名,并将查询根分段签名与来自签名索引的每个根分段签名进行比较。 使用根段签名和查询根段签名之间的匹配来标识签名索引的子集。 使用识别的签名索引的子集构建电子邮件线程层次结构。

    Intent mining via analysis of utterances
    10.
    发明授权
    Intent mining via analysis of utterances 有权
    通过分析话语的意图挖掘

    公开(公告)号:US08688453B1

    公开(公告)日:2014-04-01

    申请号:US13037114

    申请日:2011-02-28

    Abstract: According to example configurations, a speech processing system can include a syntactic parser, a word extractor, word extraction rules, and an analyzer. The syntactic parser of the speech processing system parses the utterance to identify syntactic relationships amongst words in the utterance. The word extractor utilizes word extraction rules to identify groupings of related words in the utterance that most likely represent an intended meaning of the utterance. The analyzer in the speech processing system maps each set of the sets of words produced by the word extractor to a respective candidate intent value to produce a list of candidate intent values for the utterance. The analyzer is configured to select, from the list of candidate intent values (i.e., possible intended meanings) of the utterance, a particular candidate intent value as being representative of the intent (i.e., intended meaning) of the utterance.

    Abstract translation: 根据示例配置,语音处理系统可以包括语法解析器,字提取器,字提取规则和分析器。 语音处理系统的句法解析器解析话语,以确定话语中的语法关系。 词抽取器利用字提取规则来识别话语中相关词的分组,这最有可能代表话语的意图。 语音处理系统中的分析器将由单词提取器产生的每组单词映射到相应的候选意图值,以产生用于发音的候选意图值的列表。 分析器被配置为从话音的候选意图值列表(即,可能的意图的含义)中选择特定的候选意图值来表示话语的意图(即,意图的意思)。

Patent Agency Ranking