DETERMINATION OF DENSE EMBEDDING TENSORS FOR LOG DATA USING BLOCKWISE RECURRENT NEURAL NETWORKS

    公开(公告)号:US20240346286A1

    公开(公告)日:2024-10-17

    申请号:US18301102

    申请日:2023-04-14

    IPC分类号: G06N3/0442

    CPC分类号: G06N3/0442

    摘要: In some implementations, a device may receive information associated with a software log corpus. The device may identify alphanumeric blocks in the software log corpus. The device may encode the blocks to generate numeric encoded blocks. The device may generate a set of input sequences and a set of target sequences based on the encoded blocks and a statistical block length associated with the blocks, wherein the set of target sequences are shifted versions of the set of input sequences. The device may generate a training dataset for embedding computation based on combining the set of input sequences and the set of target sequences into a tuple, partitioning the tuple into batches, and shuffling the batches to obtain the training dataset. The device may generate a set of dense embedding tensors using the training dataset and the encoded blocks.

    System and Method for Optimized Training of a Neural Network Model for Data Extraction

    公开(公告)号:US20240281664A1

    公开(公告)日:2024-08-22

    申请号:US18136985

    申请日:2023-04-20

    IPC分类号: G06N3/09 G06N3/0442

    CPC分类号: G06N3/09 G06N3/0442

    摘要: A system and method for optimized training of a neural network model for data extraction is provided. The present invention provides for generating a pre-determined format type of input document by extracting words from input document along with coordinates corresponding to each word. Further, N-grams are generated by analyzing neighboring words associated with entity text present in predetermined format type of document based on threshold measurement criterion and combining extracted neighboring words in pre-defined order. Further, generated N-grams are compared with coordinates corresponding to words for labelling N-grams with field name. Further, each word in N-gram identified by the field name is tokenized in accordance with location of each of the words relative to named entity (NE) for assigning token marker. Lastly, neural network model is trained based on tokenized words in N-gram identified by token marker. The trained neural network model is implemented for extracting data from documents.