摘要:
A computer-implemented method for scan sharing across multiple cores in a business intelligence (BI) query. The method includes receiving a plurality of BI queries, storing a block of data in a first cache, scanning the block of data in the first cache against a first batch of queries on a first processor core, and scanning the block of data against a second batch of queries on a second processor core. The first cache is associated with a first processor core. The block of data includes a subset of data stored in an in-memory database (IMDB). The first batch of queries includes two or more of the BI queries. The second batch of queries includes one or more of the BI queries that are not included in the first batch of queries.
摘要:
A system is described for creating compact aggregation working areas for efficient grouping and aggregation using multi-core CPUs. The system implements operations including computing a running aggregate for a group within a business intelligence (BI) query, and identifying a location to store running aggregate information within an aggregation working area of a cache. The aggregation working area includes first and second data structures. The first data structure stores running aggregate information that is associated with a group that is accessed frequently relative to a threshold. The second data structure stores running aggregate information that is associated with a group that is accessed infrequently relative to the threshold. The operations also include storing the running aggregate information in either the first or second data structure of the aggregation working area based on a characterization of the group as a frequently or infrequently accessed group.
摘要:
A method, system, and article are provided for employment of a hybrid layout of representation of data objects in computer memory. Columns of the database are separated based upon a classification of the columns. A vertical partition in the form of a bank is provided to receive an assignment of one or more data objects identified in the columns. Each bank is sized to be a divisor of a size of an associated hardware register. Assignment of data objects to banks organizes the data in a manner that support efficient query processing that mitigates the quantity of banks required to respond to the query.
摘要:
A method, system, and article are provided for employment of a hybrid layout of representation of data objects in computer memory. Columns of the database are separated based upon a classification of the columns. A vertical partition in the form of a bank is provided to receive an assignment of one or more data objects identified in the columns. Each bank is sized to be a divisor of a size of an associated hardware register. Assignment of data objects to banks organizes the data in a manner that support efficient query processing that mitigates the quantity of banks required to respond to the query.
摘要:
A method for refining a dictionary for information extraction, the operations including: inputting a set of extracted results from execution of an extractor comprising the dictionary on a collection of text, wherein the extracted results are labeled as correct results or incorrect results; processing the extracted results using an algorithm configured to set a score of the extractor above a score threshold, wherein the score threshold balances a precision and a recall of the extractor; and outputting a set of candidate dictionary entries corresponding to a full set of dictionary entries, wherein the candidate dictionary entries are candidates to be removed from the dictionary based on the extracted results.
摘要:
A method for refining a dictionary for information extraction, the operations including: inputting a set of extracted results from execution of an extractor comprising the dictionary on a collection of text, wherein the extracted results are labeled as correct results or incorrect results; processing the extracted results using an algorithm configured to set a score of the extractor above a score threshold, wherein the score threshold balances a precision and a recall of the extractor; and outputting a set of candidate dictionary entries corresponding to a full set of dictionary entries, wherein the candidate dictionary entries are candidates to be removed from the dictionary based on the extracted results.
摘要:
A method and system for automatically refining information extraction (IE) rules. A provenance graph for IE rules on a set of test documents is determined. The provenance graph indicates a sequence of evaluations of the IE rules that generates an output of each operator of the IE rules. Based on the provenance graph, high-level rule changes (HLCs) of the IE rules are determined. Low-level rule changes (LLCs) of the IE rules are determined to specify how to implement the HLCs. Each LLC specifies changing an operator's structure or inserting a new operator in between two operators. Based on how the LLCs affect the IE rules and previously received correct results of applying the rules on the test documents, a ranked list of the LLCs is determined. The IE rules are refined based on the ranked list.
摘要:
A data mashup system having information extraction capabilities for receiving multiple streams of textual data, at least one of which contains unstructured textual data. A repository stores annotators that describe how to analyze the streams of textual data for specified unstructured data components. The annotators are applied to the data streams to identify and extract the specified data components according to the annotators. The extracted data components are tagged to generate structured data components and the specified unstructured data components in the input data streams are replaced with the tagged data components. The system then combines the tagged data from the multiple streams to form a mashup output data stream.
摘要:
Systems, methods and computer program products for an algebraic approach to rule-based information extraction. Exemplary embodiments include a method for rule-based information extraction, the method including specifying an annotator using algebraic operators, wherein each algebraic operator describes annotations identification from text documents.
摘要:
A method for refining a dictionary for information extraction, the operations including: inputting a set of extracted results from execution of an extractor comprising the dictionary on a collection of text, wherein the extracted results are labeled as correct results or incorrect results; processing the extracted results using an algorithm configured to set a score of the extractor above a score threshold, wherein the score threshold balances a precision and a recall of the extractor; and outputting a set of candidate dictionary entries corresponding to a full set of dictionary entries, wherein the candidate dictionary entries are candidates to be removed from the dictionary based on the extracted results.