-
公开(公告)号:US10223437B2
公开(公告)日:2019-03-05
申请号:US14634199
申请日:2015-02-27
Applicant: Oracle International Corporation
Inventor: Boris Klots , Vikas Aggarwal , Nipun Agarwal , John Kowtko , Felix Schmidt , Kantikiran Pasupuleti
Abstract: A method and apparatus for adaptive data repartitioning and adaptive data replication is provided. A data set stored in a distributed data processing system is partitioned by a first partitioning key. A live workload comprising a plurality of data processing commands is processed. While processing the live workload, statistical properties of the live workload are maintained. Based on the statistical properties of the live workload with respect to the data set, it is determined to replicate and/or repartition the data set by a second partitioning key. The replicated and/or repartitioned data set is partitioned by the second partitioning key.
-
公开(公告)号:US12217136B2
公开(公告)日:2025-02-04
申请号:US16935313
申请日:2020-07-22
Applicant: Oracle International Corporation
Inventor: Felix Schmidt , Yasha Pushak , Stuart Wray
IPC: G06N20/00 , G06F16/901 , G06N5/04
Abstract: Techniques are described that extend supervised machine-learning algorithms for use with semi-supervised training. Random labels are assigned to unlabeled training data, and the data is split into k partitions. During a label-training iteration, each of these k partitions is combined with the labeled training data, and the combination is used train a single instance of the machine-learning model. Each of these trained models are then used to predict labels for data points in the k−1 partitions of previously-unlabeled training data that were not used to train of the model. Thus, every data point in the previously-unlabeled training data obtains k−1 predicted labels. For each data point, these labels are aggregated to obtain a composite label prediction for the data point. After the labels are determined via one or more label-training iterations, a machine-learning model is trained on data with the resulting composite label predictions and on the labeled data set.
-
公开(公告)号:US12182122B2
公开(公告)日:2024-12-31
申请号:US17964084
申请日:2022-10-12
Applicant: Oracle International Corporation
Inventor: Felix Schmidt , Matteo Casserini , Milos Vasic , Marija Nikolic
IPC: G06F16/00 , G06F16/2453 , G06F16/2458
Abstract: A method and one or more non-transitory storage media are provided to train and implement a one-hot encoder. During a training phase, computation of an encoder state is performed by executing a set of relational statements to extract unique categories in a first training data set, associate each unique category with a unique index, and generate a one-hot encoding for each unique category. The set of relational statements are executed by a query optimization engine. Execution of the set of relational statements is postponed until a result of each relational statement is needed, and the query optimization engine implements one or more optimizations when executing the set of relational statements. During an encoding phase, a set of categorical features in a second training data set are encoded based on the encoder state to form a set of encoded categorical features.
-
44.
公开(公告)号:US20240403153A1
公开(公告)日:2024-12-05
申请号:US18205076
申请日:2023-06-02
Applicant: Oracle International Corporation
Inventor: Arno Schneuwly , Aneesh Dahiya , Felix Schmidt
Abstract: In an embodiment, a computer generates a multi-sequence vector that contains a plurality of distinct sequences of distinct nodes of a parse tree of source logic. Based on the multi-sequence vector, the computer trains a logic encoder. After training and in a production environment, the logic encoder infers a fixed-size encoded logic from new source logic. Based on the fixed-size encoded logic, the new source logic is detected as anomalous by an anomaly detector. Both of the logic encoder and the anomaly detector are machine learning models and, herein, they may be separately trained. In an embodiment, the logic encoder is based on a natural language processing (NLP) language model architecture such as bidirectional encoder representations from transformers (BERT), or novel training herein may be self-supervised according to skip-gram for use with an unlabeled training corpus.
-
公开(公告)号:US20240345811A1
公开(公告)日:2024-10-17
申请号:US18202756
申请日:2023-05-26
Applicant: Oracle International Corporation
Inventor: Arno Schneuwly , Saeid Allahdadian , Pritam Dash , Matteo Casserini , Felix Schmidt , Eric Sedlar
IPC: G06F8/36 , G06F16/955 , G06F40/40
CPC classification number: G06F8/36 , G06F16/955 , G06F40/40
Abstract: Herein for each source logic in a corpus, a computer stores an identifier of the source logic and operates a logic encoder that infers a distinct fixed-size encoded logic that represents the variable-size source logic. At build time, a multidimensional index is generated and populated based on the encoded logics that represent the source logics in the corpus. At runtime, a user may edit and select a new source logic such as in a text editor or an integrated development environment (IDE). The logic encoder infers a new encoded logic that represents the new source logic. The multidimensional index accepts the new encoded logic as a lookup key and automatically selects and returns a result subset of encoded logics that represent similar source logics in the corpus. For display, the multidimensional index may select and return only encoded logics that are the few nearest neighbors to the new encoded logic.
-
公开(公告)号:US12026631B2
公开(公告)日:2024-07-02
申请号:US17131944
申请日:2020-12-23
Applicant: Oracle International Corporation
Inventor: Arno Schneuwly , Nikola Milojkovic , Felix Schmidt , Nipun Agarwal
IPC: G06F9/44 , G06F8/41 , G06F16/242 , G06F16/2455 , G06N5/025 , G06N5/04
CPC classification number: G06N5/04 , G06F8/427 , G06F8/43 , G06F16/2433 , G06F16/24564 , G06N5/025
Abstract: Herein is resource-constrained feature enrichment for analysis of parse trees such as suspicious database queries. In an embodiment, a computer receives a parse tree that contains many tree nodes. Each tree node is associated with a respective production rule that was used to generate the tree node. Extracted from the parse tree are many sequences of production rules having respective sequence lengths that satisfy a length constraint that accepts at least one fixed length that is greater than two. Each extracted sequence of production rules consists of respective production rules of a sequence of tree nodes in a respective directed tree path of the parse tree having a path length that satisfies that same length constraint. Based on the extracted sequences of production rules, a machine learning model generates an inference. In a bag of rules data structure, the extracted sequences of production rules are aggregated by distinct sequence and duplicates are counted.
-
公开(公告)号:US20230421528A1
公开(公告)日:2023-12-28
申请号:US18237853
申请日:2023-08-24
Applicant: Oracle International Corporation
Inventor: Renata Khasanova , Felix Schmidt , Stuart Wray , Craig Schelp , Nipun Agarwal , Matteo Casserini
IPC: H04L61/4511 , G06N20/00 , H04L41/16
CPC classification number: H04L61/4511 , G06F40/30 , H04L41/16 , G06N20/00
Abstract: Techniques are described herein for using machine learning to learn vector representations of DNS requests such that the resulting embeddings represent the semantics of the DNS requests as a whole. Techniques described herein perform pre-processing of tokenized DNS request strings in which hashes, which are long and relatively random strings of characters, are detected in DNS request strings and each detected hash token is replaced with a placeholder token. A vectorizing ML model is trained using the pre-processed training dataset in which hash tokens have been replaced. Embeddings for the DNS tokens are derived from an intermediate layer of the vectorizing ML model. The encoding application creates final vector representations for each DNS request string by generating a weighted summation of the embeddings of all of the tokens in the DNS request string. Because of hash replacement, the resulting DNS request embeddings reflect semantics of the hashes as a group.
-
公开(公告)号:US20230419169A1
公开(公告)日:2023-12-28
申请号:US17851120
申请日:2022-06-28
Applicant: Oracle International Corporation
Inventor: Kenyu Kobayashi , Arno Schneuwly , Renata Khasanova , Matteo Casserini , Felix Schmidt
IPC: G06N20/00
CPC classification number: G06N20/00
Abstract: Herein are machine learning (ML) explainability (MLX) techniques that perturb a non-anomalous tuple to generate an anomalous tuple as adversarial input to any explainer that is based on feature attribution. In an embodiment, a computer generates, from a non-anomalous tuple, an anomalous tuple that contains a perturbed value of a perturbed feature. In the anomalous tuple, the perturbed value of the perturbed feature is modified to cause a change in reconstruction error for the anomalous tuple. The change in reconstruction error includes a decrease in reconstruction error of the perturbed feature and/or an increase in a sum of reconstruction error of all features that are not the perturbed feature. After modifying the perturbed value, an attribution-based explainer automatically generates an explanation that identifies an identified feature as a cause of the anomalous tuple being anomalous. Whether the identified feature of the explanation is or is not the perturbed feature is detected.
-
公开(公告)号:US20230368054A1
公开(公告)日:2023-11-16
申请号:US17745103
申请日:2022-05-16
Applicant: Oracle International Corporation
Inventor: Marija Nikolic , Matteo Casserini , Arno Schneuwly , Nikola Milojkovic , Milos Vasic , Renata Khasanova , Felix Schmidt
Abstract: The present invention relates to threshold estimation and calibration for anomaly detection. Herein are machine learning (ML) and extreme value theory (EVT) techniques for normalizing and thresholding anomaly scores without presuming a values distribution. In an embodiment, a computer receives many unnormalized anomaly scores and, according to peak over threshold (POT), selects a highest subset of the unnormalized anomaly scores that exceed a tail threshold. Based on the highest subset of the unnormalized anomaly scores, parameters of a probability density function are trained according to EVT. After training and in a production environment, a normalized anomaly score is generated based on an unnormalized anomaly score and the trained parameters of the probability density function. Anomaly detection compares the normalized anomaly score to an optimized anomaly threshold.
-
公开(公告)号:US20220351023A1
公开(公告)日:2022-11-03
申请号:US17867552
申请日:2022-07-18
Applicant: Oracle International Corporation
Inventor: Pravin Shinde , Felix Schmidt , Onur Kocberber
Abstract: Embodiments use a hierarchy of machine learning models to predict datacenter behavior at multiple hardware levels of a datacenter without accessing operating system generated hardware utilization information. The accuracy of higher-level models in the hierarchy of models is increased by including, as input to the higher-level models, hardware utilization predictions from lower-level models. The hierarchy of models includes: server utilization models and workload/OS prediction models that produce predictions at a server device-level of a datacenter; and also top-of-rack switch models and backbone switch models that produce predictions at higher levels of the datacenter. These models receive, as input, hardware utilization information from non-OS sources. Based on datacenter-level network utilization predictions from the hierarchy of models, the datacenter automatically configures its hardware to avoid any predicted over-utilization of hardware in the datacenter. Also, the predictions from the hierarchy of models can be used to detect anomalies of datacenter hardware behavior.
-
-
-
-
-
-
-
-
-