LOG SOURCETYPE INFERENCE MODEL TRAINING FOR A DATA INTAKE AND QUERY SYSTEM

    公开(公告)号:US20220036002A1

    公开(公告)日:2022-02-03

    申请号:US16945448

    申请日:2020-07-31

    Applicant: Splunk Inc.

    Abstract: Systems and methods are described for training an artificial intelligence model to infer a log sourcetype of a log. For example, logs may have different log sourcetypes, and logs having the same log sourcetypes may have different messagetypes. The artificial intelligence model may be a machine learning model, and can be trained using training data that includes logs with known log sourcetypes. Each log can be tokenized, filtered, converted into a vector, and applied to a machine learning model as an input to perform the training. The machine learning model may output an inferred log sourcetype, which can be compared with the known log sourcetype to update model parameters to improve the machine learning model accuracy. The trained machine learning model may be trained to infer a log sourcetype of a log regardless of the messagetype of the log.

    DATA FIELD EXTRACTION MODEL TRAINING FOR A DATA INTAKE AND QUERY SYSTEM

    公开(公告)号:US20220035775A1

    公开(公告)日:2022-02-03

    申请号:US16945229

    申请日:2020-07-31

    Applicant: Splunk Inc.

    Abstract: Systems and methods are described for training an artificial intelligence model to extract one or more data fields from a log. For example, the artificial intelligence model may be a neural network. The neural network may be trained using training data obtained by iterating through a plurality of logs using active learning, and selecting a subset of the logs in the plurality to be labeled by a user. For example, the selected subset of logs may be logs that are not similar to other logs already labeled by a user. The user may be prompted to label the selected subset of logs to identify one or more data fields to extract. Once the selected subset of logs are labeled, these labeled logs can be used as the training data to train the neural network.

    Selection of a representative data subset of a set of unstructured data

    公开(公告)号:US11232124B2

    公开(公告)日:2022-01-25

    申请号:US16751063

    申请日:2020-01-23

    Applicant: SPLUNK INC.

    Abstract: Embodiments are directed towards generating a representative sampling as a subset from a larger dataset that includes unstructured data. A graphical user interface enables a user to provide various data selection parameters, including specifying a data source and one or more subset types desired, including one or more of latest records, earliest records, diverse records, outlier records, and/or random records. Diverse and/or outlier subset types may be obtained by generating clusters from an initial selection of records obtained from the larger dataset. An iteration analysis is performed to determine whether a sufficient number of clusters and/or cluster types have been generated that exceed at least one threshold and when not exceeded, additional clustering is performed on additional records. From the resultant clusters, and/or other subtype results, a subset of records is obtained as the representative sampling subset.

    Automated data-generation for event-based system

    公开(公告)号:US11227208B2

    公开(公告)日:2022-01-18

    申请号:US15224489

    申请日:2016-07-29

    Applicant: Splunk, Inc.

    Abstract: Described herein is a technology that facilitates the production of and the use of automated datagens for event-based. A datagen (i.e., data-generator or data generation system) is a component, module, or subsystem of computer systems that searches, monitors, and analyzes machine data. A datagen produces events that are further processed in various ways for subsequent use (such as searching, monitoring, and analysis).

    Field extraction rules from clustered data samples

    公开(公告)号:US11216491B2

    公开(公告)日:2022-01-04

    申请号:US15143563

    申请日:2016-04-30

    Applicant: Splunk Inc.

    Abstract: The operation of an automatic data input and query system is controlled by well-defined control data. Certain control data may relate to data schemas and direct operations performed by the system to extract fields from machine data. Automatic methods may determine proper field extraction control information by analyzing a sample of data from a source, breaking the sample data into event segments, classifying the segments into groups based on a measure of similarity, determining an operable extraction rule for each group, and storing the resulting extraction model. Data patterns known by the system can be leveraged to perform the event breaking and field identification for the classifying. Embodiments may provide a user interface to view, interact with, and approve the computer-generated extraction model.

    Generating augmented process models for process analytics

    公开(公告)号:US11210622B2

    公开(公告)日:2021-12-28

    申请号:US15339787

    申请日:2016-10-31

    Applicant: Splunk Inc.

    Abstract: Embodiments of the present invention are directed to generating augmented process models for use in process analytics. In one embodiment, a process model, search indicators, composite attributes, and relationship indicators are received. The process model defines a process and includes a plurality of components of the process. Search indicators indicate a search that, when executed, provides data related to the corresponding component. Composite attributes indicate data to be captured by machine data searches associated with the corresponding component. Relationship indicators indicate relationships between components of the process. An augmented process model is generated based on the process model, the search indicators, the composite attributes, and the relationship indicators, wherein the augmented process model is used to manage process instances associated with the process.

Patent Agency Ranking