Abstract:
In one embodiment, a method includes obtaining a set of samples, each of the set of samples including sample values for each of a plurality of variables in a variable space. The method includes receiving, for each of an initial subset of the set of samples, a label for the sample as being either malicious or legitimate; identifying one or more boundaries in the variable space based on the labels and sample values for each of the initial subset; selecting an incremental subset of the unlabeled samples of the set of samples, wherein the incremental subset includes at least one unlabeled sample including sample values further from any of the one or more boundaries than an unlabeled sample that is not included in the incremental subset; and receiving, for each of the incremental subset, a label for the sample as being either malicious or legitimate.
Abstract:
Techniques are presented to identify malware communication with domain generation algorithm (DGA) generated domains. Sample domain names are obtained and labeled as DGA domains, non-DGA domains or suspicious domains. A classifier is trained in a first stage based on the sample domain names. Sample proxy logs including proxy logs of DGA domains and proxy logs of non-DGA domains are obtained to train the classifier in a second stage based on the plurality of sample domain names and the plurality of sample proxy logs. Live traffic proxy logs are obtained and the classifier is tested by classifying the live traffic proxy logs as DGA proxy logs, and the classifier is forwarded to a second computing device to identify network communication of a third computing device as malware network communication with DGA domains via a network interface unit of the third computing device based on the trained and tested classifier.
Abstract:
In one embodiment, a method includes receiving packet flow data at a feature extraction hierarchy comprising a plurality of levels, each of the levels comprising a set of feature extraction functions, computing a first set of feature vectors for the packet flow data at a first level of the feature extraction hierarchy, inputting the first set of feature vectors from the first level of the feature extraction hierarchy into a second level of the feature extraction hierarchy to compute a second set of feature vectors, and transmitting a final feature vector to a classifier to identify malicious traffic. An apparatus and logic are also disclosed herein.
Abstract:
In one embodiment, a security device identifies, from monitored network traffic of one or more users, one or more suspicious domain names as candidate domains, the one or more suspicious domain names identified based on an occurrence of linguistic units used in discovered domain names within the monitored network traffic. The security device may then determine one or more features of the candidate domains, and confirms certain domains of the candidate domains as malicious domains using a parameterized classifier against the one or more features.
Abstract:
A computer-implemented data processing method comprises: executing a recurrent neural network (RNN) comprising nodes each implemented as a Long Short-Term Memory (LSTM) cell and comprising links between nodes that represent outputs of LSTM cells and inputs to LSTM cells, wherein each LSTM cell implements an input layer, hidden layer and output layer of the RNN; receiving network traffic data associated with networked computers; extracting feature data representing features of the network traffic data and providing the feature data to the RNN; classifying individual Uniform Resource Locators (URLs) as malicious or legitimate using LSTM cells of the input layer, wherein inputs to the LSTM cells are individual characters of the URLs, and wherein the LSTM cells generate feature representation; based on the feature representation, generating signals to a firewall device specifying either admitting or denying the URLs.
Abstract:
In an embodiment, a method, performed by processors of a computing device for creating and storing clusters of incident data records based on behavioral characteristic values in the records and origin characteristic values in the records, the method comprising: receiving a plurality of input incident data records comprising sets of attribute values; identifying two or more first incident data records that have a particular behavioral characteristic value; using a malicious incident behavioral data table that maps sets of behavioral characteristic values to identifiers of malicious acts in the network, and a plurality of comparison operations using the malicious incident behavioral data table and the two or more first incident data records, determining whether any of the two or more first incident data records are malicious; and if so, creating a similarity behavioral cluster record that includes the two or more first incident data records.
Abstract:
In one embodiment, a learning machine device initializes thresholds of a data representation of one or more data features, the thresholds specifying a first number of pre-defined bins (e.g., uniform and equidistant bins). Next, adjacent bins of the pre-defined bins having substantially similar weights may be reciprocally merged, the merging resulting in a second number of refined bins that is less than the first number. Notably, while merging, the device also learns weights of a linear decision rule associated with the one or more data features. Accordingly, a data-driven representation for a data-driven classifier may be established based on the refined bins and learned weights.
Abstract:
Techniques are presented to identify malware communication with domain generation algorithm (DGA) generated domains. Sample domain names are obtained and labeled as DGA domains, non-DGA domains or suspicious domains. A classifier is trained in a first stage based on the sample domain names. Sample proxy logs including proxy logs of DGA domains and proxy logs of non-DGA domains are obtained to train the classifier in a second stage based on the plurality of sample domain names and the plurality of sample proxy logs. Live traffic proxy logs are obtained and the classifier is tested by classifying the live traffic proxy logs as DGA proxy logs, and the classifier is forwarded to a second computing device to identify network communication of a third computing device as malware network communication with DGA domains via a network interface unit of the third computing device based on the trained and tested classifier.
Abstract:
Systems and methods of the present disclosure provide technology to identify when network-connected devices are likely infected with malware. Network communications are be monitored during a specific time window and a graph is created for a conditional random field (CRF) model. Vertices of the graph represent devices connected to the network and an edge between two vertices indicates that one or more network communications occurred between two devices represented by the two vertices during the time window. Network devices can report observations about network behavior during the time window and the observations can be used as input for the CRF model. The CRF model can then be used to determine infection-status values for the network devices.
Abstract:
Techniques are presented that identify malware network communications between a computing device and a server utilizing a detector process. Network traffic records are classified as either malware or legitimate network traffic records and divided into groups of classified network traffic records associated with network communications between the computing device and the server for a predetermined period of time. A group of classified network traffic records is labeled as malicious when at least one of the classified network traffic records in the group is malicious and as legitimate when none of the classified network traffic records in the group is malicious to obtain a labeled group of classified network traffic records. A detector process is trained on individual classified network traffic records in the labeled group of classified network traffic records and network communication between the computing device and the server is identified as malware network communication utilizing the detector process.