摘要:
A distributed collection of web-crawlers to gather information over a large portion of the cyberspace. These crawlers share the overall crawling through a cyberspace partition scheme. They also collaborate with each other through load balancing to maximally utilize the computing resources of each of the crawlers. The invention takes advantage of the hierarchical nature of the cyberspace namespace and uses the syntactic components of the URL structure as the main vehicle for dividing and assigning crawling workload to individual crawler. The partition scheme is completely distributed in which each crawler makes the partitioning decision based on its own crawling status and a globally replicated partition tree data structure.
摘要:
A system for automatically generating user interest profiles and delivering information to users learns a user's interests by monitoring the user's outbound communication streams, i.e., the information that the user produces either by typing (e.g., while a user is composing an e-mail message or editing a word processor document) or by speaking (e.g., while a user is engaged in a phone conversation or listening to a lecture). The system uses the monitored text to build (and possibly update) a user interest profile. The profile is constructed from current text generated by the user, so that the retrieved information reflects present user interests. In addition, the profile may also retain past user interests, so that the profile reflects a combination of past and present user interests. The system then automatically queries diverse databases for information relevant to the interest profile. The databases may include internet web pages, files stored on the user's local network, and other local or remote data repositories. The queries may use a combination of internet search engines, the specific selection of which may depend upon the nature and/or content of the queries. The information retrieved in response to the queries is then presented to the user. The retrieved information may contain, for example, answers to questions that the user might ask and/or data related to the user's current and continuing interests. Because a user's current speech or typed text is highly correlated with the user's current interests, the retrieved information will be relevant to the user's actual interests. The communication stream monitoring, interest profile building, data base querying, and presentation of retrieved information are all performed automatically, in real time, and in the background of current user activities.
摘要:
A system generates user interest profiles by monitoring and analyzing a user's access to a variety of hierarchical levels within a set of structured documents, e.g., documents available at a web site. Each information document has parts associated with it and the documents are classified into categories using a known taxonomy. The user interest profiles are automatically generated based on the type of content viewed by the user. The type of content is determined by the text within the parts of the documents viewed and the classifications of the documents viewed. In addition, the profiles also are generated based on other factors including the frequency and currency of visits to documents having a given classification, and/or the hierarchical depth of the levels or parts of the documents viewed. User profiles include an interest category code and an interest score to indicate a level of interest in a particular category. The profiles are updated automatically to accurately reflect the current interests of an individual, as well as past interests. A time-dependent decay factor is applied to the past interests. The system presents to the user documents or references to documents that match the current profile.
摘要:
A method and apparatus for efficiently matching a large collection of user profiles against a large volume of data in a webcasting system. The invention generally includes in one embodiment four steps to parallelize the profiles. First, an initial profile set is partitioned into several subsets also referred to as sub-partitions using various heuristic methods. Second, each sub-partition is mapped onto one or more independent processing units. Each processing unit is not required to have equal processing performance. However, for best performance results, subset data should be mapped in one embodiment where the subset with a highest cost is mapped to a fastest processor, and the next highest cost subset mapped to the next fastest processor. Where appropriate, the invention evaluates the relative subset processing speed of each processor and adjusts future subset mapping based upon these evaluations. For each information item I that needs to be matched with a profile predicate, a third and a fourth step are executed. The third step broadcasts I to all processing units, and a fourth step performs a sequential profile match on I.
摘要:
An efficient method and apparatus for regulating access to information objects stored in a database in which there are a large number of users and access groups. The invention uses a representation of a hierarchical access group structure in terms of intervals over a set of integers and a decomposition scheme that reduces any group structure to ones that have interval representation. This representation allows the problem for checking access rights to be reduced to an interval containment problem. An interval tree, a popular data structure in computational geometry, may be implemented to efficiently execute the access-right checking method.
摘要:
A According to the invention, a music search system includes a music player, music analyzer, a search engine and a sophisticated user interface that enables users to visually build complex query profiles from the structural information of one or more musical pieces. The complex query profiles are useful for performing searches for musical pieces matching the structural information in the query profile. The system allows the user to supply an existing piece of music, or some components thereof, as query arguments, and lets the music search engine find music that is similar to the given sample by certain similarity measurement.
摘要:
The claimed subject matter relates to an architecture that can identify, store, and/or output local contributions to a rank of a vertex in a directed graph. The architecture can receive a directed graph and a parameter, and examine a local subset of vertices (e.g., local to a given vertex) in order to determine a local supporting set. The local supporting set can include a local set of vertices that each contributes a minimum fraction of the parameter to a rank of the vertex. The local supporting set can be the basis for an estimate of the supporting set and/or rank of the vertex for the entire graph and can be employed as a means for detecting link or web spam as well as other influence-based social network applications.
摘要:
The claimed subject matter provides a system and/or a method that facilitates reducing spam in search results. An interface can obtain web graph information that represents a web of pages. A spam detection component can determines one or more features based at least in part on the web graph information. The one or more features can provide indications that a particular page of the web graph is spam. In addition, a robust rank component is provided that limits amount of contribution a single page can provide to the target page.
摘要:
A method in a computer network for identifying users with similar interests is described. The method includes accepting a first query statement from a first user and storing a first item of information related to the first query. The method further includes accepting a second query from a second user and storing a second item of information related to the second query. The method further includes computing a measure of similarity of the first query and the second query by using the first item of information and the second item of information. A system and computer readable medium for carrying out the above method is also described.
摘要:
An optimizing compilation process generates executable code which defines the computation and communication actions that are to be taken by each individual processor of a computer having a distributed memory, parallel processor architecture to run a program written in a data-parallel language. To this end, local memory layouts of the one-dimensional and multidimensional arrays that are used in the program are derived from one-level and two-level data mappings consisting of alignment and distribution, so that array elements are laid out in canonical order and local memory space is conserved. Executable code then is generated to produce at program run time, a set of tables for each individual processor for each computation requiring access to a regular section of an array, so that the entries of these tables specify the spacing between successive elements of said regular section resident in the local memory of said processor, and so that all the elements of said regular section can be located in a single pass through local memory using said tables. Further executable code is generated to produce at program run time, another set of tables for each individual processor for each communication action requiring a given processor to transfer array data to another processor, so that the entries of these tables specify the identity of a destination processor to which the array data must be transferred and the location in said destination processor's local memory at which the array data must be stored, and so that all of said array data can be located in a single pass through local memory using these communication tables. And, executable node code is generated for each individual processor that uses the foregoing tables at program run time to perform the necessary computation and communication actions on each individual processor of the parallel computer.