摘要:
The subject disclosure is directed towards simulating query execution to provide incremental visualization for a global data set. A data store may be configured for searching at least a portion of a global data set being stored at an enterprise-level data store. In response to a user-issued query, partial query results are provided to a front-end interface for display to the user. The front-end interface also provides statistical information corresponding to the partial query results in relation to the global data set, which may be used to determine when a current set of query results becomes acceptable as a true/accurate estimate.
摘要:
The subject disclosure is directed towards simulating query execution to provide incremental visualization for a global data set. A data store may be configured for searching at least a portion of a global data set being stored at an enterprise-level data store. In response to a user-issued query, partial query results are provided to a front-end interface for display to the user. The front-end interface also provides statistical information corresponding to the partial query results in relation to the global data set, which may be used to determine when a current set of query results becomes acceptable as a true/accurate estimate.
摘要:
Incremental query results and confidence interval values associated with respective incremental query results may be obtained. Visualization shape objects indicating uncertainty values may be determined, based on mapping values of respective incremental query results and confidence interval values to points in the associated visualization shape objects, the uncertainty values visualized based on proportional shapes of the visualization shape objects. At least one visualization comparison object representing a comparison of a plurality of distributions associated with the obtained incremental query results and confidence interval values may be determined. Display of the plurality of visualization shape objects and the at least one visualization comparison object may be initiated.
摘要:
A unique multi-stage classification system and method that facilitates reducing human resources or costs associated with text classification while still obtaining a desired level of accuracy is provided. The multi-stage classification system and method involve a pattern-based classifier and a machine learning classifier. The pattern-based classifier is trained on discriminative patterns as identified by humans rather than machines which allow a smaller training set to be employed. Given humans' superior abilities to reason over text, discriminative patterns can be more accurately and more readily identified by them. Unlabeled items can be initially processed by the pattern-based classifier and if no pattern match exists, then the unlabeled data can be processed by the machine learning classifier. By employing the classifiers in this manner, less human involvement is required in the classification process. Even more, classification accuracy is maintained and/or improved.
摘要:
Similarity is determined between documents based on a method for identifying documents that are likely to be based on another document. The method can include the determination of a containment coefficient, which can indicate when a template document is a subset or substantially a subset of another document. Based on this determination, an appropriate document management action can be taken, such as implementing a security policy or modifying the display of messages from a user interface.
摘要:
Each of a plurality of documents is divided into samples. Small bit-strings are generated for selected samples from each of the documents and used to create a sketch for each document. Because the bit-strings are small (e.g., only one, two, or three bits in length), the generated sketches are smaller than the sketches generated using previous methods for generating sketches, and therefore use less storage space. The generated sketches are compared to determine documents that are near-duplicates of one another.
摘要:
A system that facilitates selecting advertisements that match a search query is described herein. The system includes a search query receiver component that receives a search query including keywords. The system also includes a match component that uses an associative data structure to identify in the associative data structure one or more data nodes that are associated in the associative data structure with respective unique keys corresponding to respective one or more hashes of combinations of the keywords in the search query. For each identified data node, the match component selects advertisements associated with bid phrases stored in the identified data node that respectively only include keywords included in the search query.
摘要:
Hash values corresponding to a file are processed in windows to determine a minimum hash value for each window. Each window may begin at a minimum hash value determined for a previous window and end after a fixed number of hash values. If a hash value is less than a threshold hash value, it is added to a buffer that is used to store the hash values in sorted order for a current window. If a hash value is greater than the threshold, it is added to another buffer whose hash values are not stored in sorted order. At the end of the current window, the minimum hash value in the first buffer is selected as the landmark for the window. If the first buffer is empty, then the hash values in the other buffer are sorted and the minimum hash value is selected as the landmark for the window.
摘要:
Each of a plurality of documents is divided into samples. Small bit-strings are generated for selected samples from each of the documents and used to create a sketch for each document. Because the bit-strings are small (e.g., only one, two, or three bits in length), the generated sketches are smaller than the sketches generated using previous methods for generating sketches, and therefore use less storage space. The generated sketches are compared to determine documents that are near-duplicates of one another.
摘要:
Hash values corresponding to a file are processed in windows to determine a minimum hash value for each window. Each window may begin at a minimum hash value determined for a previous window and end after a fixed number of hash values. If a hash value is less than a threshold hash value, it is added to a buffer that is used to store the hash values in sorted order for a current window. If a hash value is greater than the threshold, it is added to another buffer whose hash values are not stored in sorted order. At the end of the current window, the minimum hash value in the first buffer is selected as the landmark for the window. If the first buffer is empty, then the hash values in the other buffer are sorted and the minimum hash value is selected as the landmark for the window.