Filtering invalid tokens from a document using high IDF token filtering
    21.
    发明授权
    Filtering invalid tokens from a document using high IDF token filtering 有权
    使用高IDF令牌过滤从文档过滤无效令牌

    公开(公告)号:US07908279B1

    公开(公告)日:2011-03-15

    申请号:US11856581

    申请日:2007-09-17

    CPC分类号: G06F17/2211 Y10S707/917

    摘要: Systems and methods for filtering tokens from a document for determining whether the document describes substantially similar subject matter compared to another document are described. In one embodiment, a first document is obtained. This document is organized into a plurality of fields, and at least some of the fields include tokens representing the subject matter described by the document. A field of this document is selected and a token from within the selected field having the highest inverse document frequency (IDF) is selected. Those tokens that have a higher IDF than the selected token are removed. Using the remaining tokens, a determination is made as to whether the first document describes substantially similar subject matter to the subject matter described by a second document. An indication is provided as to whether the first document describes substantially similar subject matter to that described by a second document according to the determination.

    摘要翻译: 描述用于从文档过滤标记以确定文档是否描述与另一文档相比基本相似的主题的系统和方法。 在一个实施例中,获得第一文档。 该文档被组织成多个字段,并且至少一些字段包括表示文档描述的主题的令牌。 选择该文档的字段,并且选择具有最高逆文档频率(IDF)的所选字段内的令牌。 删除IDF高于所选令牌的令牌。 使用剩余的令牌,确定第一文档是否描述与第二文档描述的主题相当的主题。 提供关于第一文档是否根据确定描述与第二文档描述的主题相当的主题的指示。

    Comparison engine for identifying documents describing similar subject matter
    22.
    发明授权
    Comparison engine for identifying documents describing similar subject matter 有权
    用于识别描述相似主题的文档的比较引擎

    公开(公告)号:US07904462B1

    公开(公告)日:2011-03-08

    申请号:US11953726

    申请日:2007-12-10

    IPC分类号: G06F7/00 G06F17/00

    CPC分类号: G06Q30/06

    摘要: Systems and methods for determining whether a first document is a potential duplicate of a second document such that the two documents describe the same or substantially the same subject matter, wherein the first and second documents include attribute data in attribute fields. A set of rules is obtained for determining whether the first document is a potential duplicate of the second document. Moreover, for each rule in the set of rules, a determination is made as to whether data in a first set of attributes of the first document is contained in a second set of attributes of the second document. According to the results of the evaluated rules in the rules set, determining whether the first document is a potential duplicate of the second document. If, according to the evaluated rules in the rules set, the first document is determined to be a potential duplicate of the second document, storing a reference to the first document in a set of potential duplicates of the second document.

    摘要翻译: 用于确定第一文档是否是第二文档的潜在副本的系统和方法,使得两个文档描述相同或基本相同的主题,其中第一和第二文档包括属性字段中的属性数据。 获得一组用于确定第一文档是否是第二文档的潜在副本的规则。 此外,对于该组规则中的每个规则,确定第一文档的第一组属性中的数据是否包含在第二文档的第二组属性中。 根据规则集中评估规则的结果,确定第一个文档是否是第二个文档的潜在副本。 如果根据规则集中的评估规则,确定第一文档是第二文档的潜在副本,则将第一文档的引用存储在第二文档的一组潜在重复项中。

    Providing artifact and configuration cohesion across disparate portal application models
    23.
    发明授权
    Providing artifact and configuration cohesion across disparate portal application models 失效
    在不同的门户应用模型中提供工件和配置的凝聚力

    公开(公告)号:US07877465B2

    公开(公告)日:2011-01-25

    申请号:US10891287

    申请日:2004-07-14

    IPC分类号: G06F15/177

    CPC分类号: G06F17/24

    摘要: Under the present invention, a client-based editor is launched (e.g., from a web server or the like) within a client interface such as a browser. Upon being launched, initial configuration parameters are passed from a portal server to the editor. The present invention also provides a “communications tunnel” between the editor and the portal server in the form of a portlet interface on the web server. This is so that any characteristics expressed by the portal server (e.g., changes to the initial configuration parameters) can be pushed to the editor. Moreover, the portlet interface allows the editor to query the portal server to obtain any needed services (e.g. a spreadsheet computation).

    摘要翻译: 在本发明的基础上,在诸如浏览器的客户端界面中启动基于客户端的编辑器(例如,从web服务器等)。 启动后,初始配置参数从门户服务器传递到编辑器。 本发明还以Web服务器上的Portlet接口的形式提供编辑器和门户服务器之间的“通信隧道”。 这使得门户服务器表达的任何特征(例如,对初始配置参数的改变)都可以被推送到编辑器。 此外,portlet接口允许编辑器查询门户服务器以获得任何所需的服务(例如电子表格计算)。

    Managing web tier session state objects in a content delivery network (CDN)
    24.
    发明申请
    Managing web tier session state objects in a content delivery network (CDN) 有权
    管理内容传送网络(CDN)中的Web层会话状态对象

    公开(公告)号:US20100293281A1

    公开(公告)日:2010-11-18

    申请号:US12843278

    申请日:2010-07-26

    IPC分类号: G06F15/16

    摘要: Business applications running on a content delivery network (CDN) having a distributed application framework can create, access and modify state for each client. Over time, a single client may desire to access a given application on different CDN edge servers within the same region and even across different regions. Each time, the application may need to access the latest “state” of the client even if the state was last modified by an application on a different server. A difficulty arises when a process or a machine that last modified the state dies or is temporarily or permanently unavailable. The present invention provides techniques for migrating session state data across CDN servers in a manner transparent to the user. A distributed application thus can access a latest “state” of a client even if the state was last modified by an application instance executing on a different CDN server, including a nearby (in-region) or a remote (out-of-region) server.

    摘要翻译: 在具有分布式应用程序框架的内容传送网络(CDN)上运行的业务应用程序可以为每个客户端创建,访问和修改状态。 随着时间的推移,单个客户端可能希望访问同一区域内甚至跨不同区域的不同CDN边缘服务器上的给定应用。 每次应用程序可能需要访问客户端的最新“状态”,即使该状态最后被不同服务器上的应用程序修改。 当最后修改状态的过程或机器死亡或临时或永久不可用时,会出现困难。 本发明提供了以对用户透明的方式跨CDN服务器迁移会话状态数据的技术。 因此,分布式应用程序可以访问客户端的最新“状态”,即使状态最后由在不同的CDN服务器上执行的应用程序实例进行修改,包括附近(区域内)或远程(区域外) 服务器。

    Reverse associate website discovery

    公开(公告)号:US10013699B1

    公开(公告)日:2018-07-03

    申请号:US13170043

    申请日:2011-06-27

    IPC分类号: G06Q30/00 G06Q30/02

    CPC分类号: G06Q30/0214 G06Q30/0211

    摘要: Extracting content from an associate website may enable a host website to gain insight into web content that are effective at driving consumers to the host website. The content extraction may involve selecting an associate website from multiple associate websites for content extraction, with the associate website including a referral link to an item for sale on the host merchant website. Content may be obtained from one or more web pages of the associate website, and at least a part of the content may be associated with the item that is listed for sale on the host website.

    Duplicate entry detection system and method
    27.
    发明授权
    Duplicate entry detection system and method 有权
    重复条目检测系统和方法

    公开(公告)号:US08046372B1

    公开(公告)日:2011-10-25

    申请号:US11754237

    申请日:2007-05-25

    IPC分类号: G06F7/00 G06F17/30

    CPC分类号: G06F17/30616

    摘要: A computer system and method for determining whether the subject matter described in a received document is substantially similar to the subject matter of other documents in a document corpus, such that the received document can be considered a duplicate document. After receiving a first document, a set of tokens for the first document is generated. A non-fielded relevance search on a token index is executed. The relevance search returns a set of candidate duplicate documents with scores corresponding to each candidate document. For each candidate document with a score above a threshold, filtering is performed on each candidate document to determine whether each candidate document is a true duplicate of the first document. A set of candidate documents with a score above the threshold that were not disqualified as candidate documents is then provided.

    摘要翻译: 一种计算机系统和方法,用于确定在接收到的文档中描述的主题与文档语料库中的其他文档的主题是否基本相似,使得所接收的文档可以被认为是重复的文档。 在收到第一个文档之后,生成第一个文档的一组令牌。 执行令牌索引上的非字段相关搜索。 相关性搜索返回一组具有与每个候选文档相对应的分数的候选重复文档。 对于分数高于阈值的每个候选文档,对每个候选文档进行过滤以确定每个候选文档是否是第一个文档的真实副本。 然后提供一组具有不超过门槛的分数的候选文件,不被取消作为候选文件的资格。

    Identifying potential duplicates of a document in a document corpus
    28.
    发明授权
    Identifying potential duplicates of a document in a document corpus 有权
    在文档语料库中识别文档的潜在重复项

    公开(公告)号:US07895225B1

    公开(公告)日:2011-02-22

    申请号:US11952020

    申请日:2007-12-06

    IPC分类号: G06F7/00 G06F17/00

    摘要: According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document is provided. A source document is obtained. A list of queries corresponding to a source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.

    摘要翻译: 根据所公开的主题的方面,提供了一种用于从源文档的潜在重复的文档语料库中识别一组文档的方法。 得到一个源文件。 识别与源文档相对应的查询的列表。 在所识别的查询列表中的每个查询在文档语料库上执行,其中每个查询的执行产生标识文档语料库中的有序文档集合的相应结果集。 对于每个结果集中识别的每个文档,根据识别的文档在其结果集中的序数位置,为所识别的文档生成文档分数。 根据满足预定选择标准的所生成的文档分数来选择结果集的识别文档的子集。 识别的文档的所选子集被存储或显示。

    Generating similarity scores for matching non-identical data strings
    29.
    发明授权
    Generating similarity scores for matching non-identical data strings 有权
    生成匹配不相同数据字符串的相似度分数

    公开(公告)号:US07814107B1

    公开(公告)日:2010-10-12

    申请号:US11754241

    申请日:2007-05-25

    IPC分类号: G06F7/00 G06F17/00

    CPC分类号: G06F17/30011

    摘要: A system and method for determining the likelihood of two documents describing substantially similar subject matter is presented. A set of tokens for each of two documents is obtained, each set representing strings of characters found in the corresponding document. A matrix of token pairs is determined, each token pair comprising a token from each set of tokens. For each token pair in the matrix, a similarity score is determined. Those token pairs in the matrix with a similarity score above a threshold score are selected and added to a set of matched tokens. A similarity score for the two documents is determined according to the scores of the token pairs added to the set of matched tokens. The determined similarity score is provided as the likelihood that the first and second documents describing substantially similar subject matter.

    摘要翻译: 提出了一种用于确定描述基本相似主题的两个文档的可能性的系统和方法。 获得两个文档中的每一个的一组令牌,每组代表在相应文档中找到的字符串。 确定令牌对的矩阵,每个令牌对包括来自每组令牌的令牌。 对于矩阵中的每个令牌对,确定相似性得分。 选择具有相似性得分高于阈值分数的矩阵中的那些令牌对并将其添加到一组匹配的令牌中。 根据添加到匹配令牌集中的令牌对的分数来确定两个文档的相似性得分。 确定的相似度得分被提供为第一和第二文档描述基本相似的主题的可能性。