摘要:
Systems and methods for filtering tokens from a document for determining whether the document describes substantially similar subject matter compared to another document are described. In one embodiment, a first document is obtained. This document is organized into a plurality of fields, and at least some of the fields include tokens representing the subject matter described by the document. A field of this document is selected and a token from within the selected field having the highest inverse document frequency (IDF) is selected. Those tokens that have a higher IDF than the selected token are removed. Using the remaining tokens, a determination is made as to whether the first document describes substantially similar subject matter to the subject matter described by a second document. An indication is provided as to whether the first document describes substantially similar subject matter to that described by a second document according to the determination.
摘要:
Systems and methods for determining whether a first document is a potential duplicate of a second document such that the two documents describe the same or substantially the same subject matter, wherein the first and second documents include attribute data in attribute fields. A set of rules is obtained for determining whether the first document is a potential duplicate of the second document. Moreover, for each rule in the set of rules, a determination is made as to whether data in a first set of attributes of the first document is contained in a second set of attributes of the second document. According to the results of the evaluated rules in the rules set, determining whether the first document is a potential duplicate of the second document. If, according to the evaluated rules in the rules set, the first document is determined to be a potential duplicate of the second document, storing a reference to the first document in a set of potential duplicates of the second document.
摘要:
Under the present invention, a client-based editor is launched (e.g., from a web server or the like) within a client interface such as a browser. Upon being launched, initial configuration parameters are passed from a portal server to the editor. The present invention also provides a “communications tunnel” between the editor and the portal server in the form of a portlet interface on the web server. This is so that any characteristics expressed by the portal server (e.g., changes to the initial configuration parameters) can be pushed to the editor. Moreover, the portlet interface allows the editor to query the portal server to obtain any needed services (e.g. a spreadsheet computation).
摘要:
Business applications running on a content delivery network (CDN) having a distributed application framework can create, access and modify state for each client. Over time, a single client may desire to access a given application on different CDN edge servers within the same region and even across different regions. Each time, the application may need to access the latest “state” of the client even if the state was last modified by an application on a different server. A difficulty arises when a process or a machine that last modified the state dies or is temporarily or permanently unavailable. The present invention provides techniques for migrating session state data across CDN servers in a manner transparent to the user. A distributed application thus can access a latest “state” of a client even if the state was last modified by an application instance executing on a different CDN server, including a nearby (in-region) or a remote (out-of-region) server.
摘要:
Extracting content from an associate website may enable a host website to gain insight into web content that are effective at driving consumers to the host website. The content extraction may involve selecting an associate website from multiple associate websites for content extraction, with the associate website including a referral link to an item for sale on the host merchant website. Content may be obtained from one or more web pages of the associate website, and at least a part of the content may be associated with the item that is listed for sale on the host website.
摘要:
Enabling network-accessible applications to be integrated into content aggregation frameworks (such as portals) and to become dynamically interactive through proxying components (such as proxying portlets), thereby providing run-time cooperation and data sharing.
摘要:
A computer system and method for determining whether the subject matter described in a received document is substantially similar to the subject matter of other documents in a document corpus, such that the received document can be considered a duplicate document. After receiving a first document, a set of tokens for the first document is generated. A non-fielded relevance search on a token index is executed. The relevance search returns a set of candidate duplicate documents with scores corresponding to each candidate document. For each candidate document with a score above a threshold, filtering is performed on each candidate document to determine whether each candidate document is a true duplicate of the first document. A set of candidate documents with a score above the threshold that were not disqualified as candidate documents is then provided.
摘要:
According to aspects of the disclosed subject matter, a method for identifying a set of documents from a document corpus that are potential duplicates of a source document is provided. A source document is obtained. A list of queries corresponding to a source document is identified. Each query in the identified list of queries is executed on the document corpus, wherein the execution of each query yields a corresponding results set identifying an ordered set of documents in the document corpus. For each document identified in each results set, a document score is generated for the identified document based on the identified document's ordinal position in its results set. A subset of the identified documents of the results set is selected according to the generated document scores that satisfy predetermined selection criteria. The selected subset of identified documents are stored or displayed.
摘要:
A system and method for determining the likelihood of two documents describing substantially similar subject matter is presented. A set of tokens for each of two documents is obtained, each set representing strings of characters found in the corresponding document. A matrix of token pairs is determined, each token pair comprising a token from each set of tokens. For each token pair in the matrix, a similarity score is determined. Those token pairs in the matrix with a similarity score above a threshold score are selected and added to a set of matched tokens. A similarity score for the two documents is determined according to the scores of the token pairs added to the set of matched tokens. The determined similarity score is provided as the likelihood that the first and second documents describing substantially similar subject matter.
摘要:
The invention is an intelligent traffic redirection system that does global load balancing. It can be used in any situation where an end-user requires access to a replicated resource. The method directs end-users to the appropriate replica so that the route to the replica is good from a network standpoint and the replica is not overloaded. The technique preferably uses a Domain Name Service (DNS) to provide IP addresses for the appropriate replica. The most common use is to direct traffic to a mirrored web site.