INFORMATION EXTRACTION FROM DOCUMENT CORPORA

    公开(公告)号:US20230132061A1

    公开(公告)日:2023-04-27

    申请号:US17508117

    申请日:2021-10-22

    摘要: Information extraction systems and computer-implemented methods for producing a searchable representation of information contained in a corpus of documents by generating a document structure graph for each document, the graph indicating a structural hierarchy of document items in that document based on a predefined hierarchy of predetermined item-types, and linking document items to a parent document item in the structural hierarchy, for each document, generating a knowledge graph including first nodes, representing document items in the corpus and second nodes representing language items identified in those document items, interconnecting the first nodes and second nodes by edges representing a defined relation between items represented by the nodes interconnected by that edge, storing the knowledge graph in a knowledge graph database, and producing the searchable representation by traversing edges of the graph in response to input search queries.

    DATASET MANAGEMENT IN MACHINE LEARNING

    公开(公告)号:US20210350274A1

    公开(公告)日:2021-11-11

    申请号:US16868565

    申请日:2020-05-07

    IPC分类号: G06N20/00 G06F11/34

    摘要: A method, a computer system, and a computer program product for managing a dataset of training samples, labeled by class, during training of a machine learning model is provided. Embodiments of the present invention may include training the model on a sequence of increasing-sized sets of the training samples and testing performance of the model after training with each set to obtain class-specific performance metrics corresponding to each set size. Embodiments of the present invention may include generating class-specific learning curves from the performance metrics for the plurality of classes. Embodiments of the present invention may include extrapolating the learning curves. Embodiments of the present invention may include optimizing a function of the predicted performance metrics to identify a set of augmentation actions to augment the dataset for further training of the model. Embodiments of the present invention may include providing an output indicative of the set of augmentation actions.

    BUILDING KNOWLEDGE GRAPHS BASED ON PARTIAL TOPOLOGIES FORMULATED BY USERS

    公开(公告)号:US20230252309A1

    公开(公告)日:2023-08-10

    申请号:US17650086

    申请日:2022-02-07

    IPC分类号: G06N5/02 G06F40/279

    CPC分类号: G06N5/022 G06F40/279

    摘要: A computer-implemented method, a computer program product, and a computer system for building a knowledge graph. A computer system converts user inputs as to a partial topology of a knowledge graph that a user wants to build into one or more initial nodes corresponding to respective natural language descriptions. A computer system interprets the respective natural language descriptions using natural language processing to match the one or more initial nodes against reference data. A computer system, based on matched reference data, obtains a valid topology of nodes and edges, wherein the nodes and edges are mapped onto the matched reference data. A computer system, based on the valid topology, generates a data flow linking to the matched reference data via associations of the nodes and edges and the matched reference data. A computer system builds an executable knowledge graph from the data flow.

    Generating a structure of a PDF-document

    公开(公告)号:US11687700B1

    公开(公告)日:2023-06-27

    申请号:US17649597

    申请日:2022-02-01

    摘要: The present disclosure relates to a method for generating a structure of a PDF-document, wherein the PDF-document comprises elements. The method comprises detecting document cells of the PDF-document dependent on commands of a page description language for printing the elements of the PDF-document. The method comprises determining parts of the PDF-document dependent on the PDF-document by a machine learning module. The determining of the respective part comprises associating a respective portion of the elements of the PDF-document with the respective part. Furthermore, a respective label may be assigned to the respective part. The method may further comprise using a symbolic artificial intelligence module, wherein rules of the symbolic AI-module for reconciling the document cells with the parts may be applied. The elements of the structure of the PDF-document may be generated and labelled dependent on a result of the reconciling and dependent on the respective label to the respective part.