摘要:
Identification and prevention of email spam that originates from botnets may be performed by finding similarity in their host property and behavior patterns using a set of labeled data. Clustering models of host properties pertaining to previously identified and appropriately tagged botnet hosts may be learned. Given labeled data, each botnet may be examined individually and a clustering model learned to reflect upon a set of selected host properties. Once a model has been learned for every botnet, clustering behavior may be used to look for host properties that fit into a profile. Such traffic can be either discarded or tagged for subsequent analysis and can also be used to profile botnets preventing them from launching other attacks. In addition, models of individual botnets can be further clustered to form superclusters, which can help understand botnet behavior and detect future attacks.
摘要:
Dynamic IP addresses may be automatically identified and their dynamics patterns may be analyzed. Multi-user IP address blocks are determined as candidates for further analysis. An entropy score is determined for each IP address in every candidate block to distinguish between a dynamic IP and a static IP shared by multiple users. IP addresses with high entropy scores are grouped, and then analyzed, and may be used in various applications, such as spam filtering.
摘要:
Identification and prevention of email spam that originates from botnets may be performed by finding similarity in their host property and behavior patterns using a set of labeled data. Clustering models of host properties pertaining to previously identified and appropriately tagged botnet hosts may be learned. Given labeled data, each botnet may be examined individually and a clustering model learned to reflect upon a set of selected host properties. Once a model has been learned for every botnet, clustering behavior may be used to look for host properties that fit into a profile. Such traffic can be either discarded or tagged for subsequent analysis and can also be used to profile botnets preventing them from launching other attacks. In addition, models of individual botnets can be further clustered to form superclusters, which can help understand botnet behavior and detect future attacks.
摘要:
A framework may be used for generating URL signatures to identify botnet spam and membership. The framework may take a set of unlabeled emails as input that are grouped based on URLs contained within the emails. The framework may return a set of spam URL signatures and a list of corresponding botnet host IP addresses by analyzing the URLs within the emails that are contained within the groups. Each URL signature may be in the form of either a complete URL string or a URL regular expression. The signatures may be used to identify spam emails launched from botnets, while the knowledge of botnet host identities can help filter other spam emails also sent by them.
摘要:
Dynamic IP addresses may be automatically identified and their dynamics patterns may be analyzed. Multi-user IP address blocks are determined as candidates for further analysis. An entropy score is determined for each IP address in every candidate block to distinguish between a dynamic IP and a static IP shared by multiple users. IP addresses with high entropy scores are grouped, and then analyzed, and may be used in various applications, such as spam filtering.
摘要:
Mechanisms are disclosed for incorporating prototype information into probabilistic models for automated information processing, mining, and knowledge discovery. Examples of these models include Hidden Markov Models (HMMs), Latent Dirichlet Allocation (LDA) models, and the like. The prototype information injects prior knowledge to such models, thereby rendering them more accurate, effective, and efficient. For instance, in the context of automated word labeling, additional knowledge is encoded into the models by providing a small set of prototypical words for each possible label. The net result is that words in a given corpus are labeled and are therefore in condition to be summarized, identified, classified, clustered, and the like.
摘要:
A method for combining multiple probability of click models in an online advertising system into a combined predictive model, the method commencing by receiving a feature set slice (e.g. corresponding to demographics or taxonomies or clusters), and using the sliced data for training multiple slice-wise predictive models. The trained slice-wise predictive models are combined by overlaying a weighted distribution model over the trained slice-wise predictive models. The combined predictive model then is used in predicting the probability of a click given a query-advertisement pair in online advertising. The method can flexibly receive slice specifications, and can overlay any one or more of a variety of distribution models, such as a linear combination or a log-linear combination. Using an appropriate weighted distribution model, the combined predictive model reliably yields predictive estimates of occurrence of click events that are at least as good as the best predictive model in the slice-wise predictive model set.
摘要:
A computer-implemented method and system for selecting a subject advertisement in a sponsored search system based on a user's commercial intent (pertaining to the subject advertisement), using techniques for determining intent-driven clicks from a historical database. The method includes steps for aggregating a training model dataset wherein the training model dataset contains a selected history of clicks. Then, selecting from the training model dataset, a clicked slate (further selection of clicks), the clicked slate comprising a set of clicked ads, and calculating an intent-driven click feedback value for the subject advertisement. The method includes techniques for selecting a clicked slate using features corresponding to clicks received within a particular time period (the time period determined statically or dynamically). A system for implementing the method includes aggregating data from a historical database using selectors such as a position selector, a click feature selector, an impression-advertiser-campaign-creative selector, and a commercial intent selector.
摘要:
A method for combining multiple probability of click models in an online advertising system into a combined predictive model, the method commencing by receiving a feature set slice (e.g. corresponding to demographics or taxonomies or clusters), and using the sliced data for training multiple slice-wise predictive models. The trained slice-wise predictive models are combined by overlaying a weighted distribution model over the trained slice-wise predictive models. The combined predictive model then is used in predicting the probability of a click given a query-advertisement pair in online advertising. The method can flexibly receive slice specifications, and can overlay any one or more of a variety of distribution models, such as a linear combination or a log-linear combination. Using an appropriate weighted distribution model, the combined predictive model reliably yields predictive estimates of occurrence of click events that are at least as good as the best predictive model in the slice-wise predictive model set.
摘要:
A computer-implemented method and system for selecting a subject advertisement in a sponsored search system based on a user's commercial intent (pertaining to the subject advertisement), using techniques for determining intent-driven clicks from a historical database. The method includes steps for aggregating a training model dataset wherein the training model dataset contains a selected history of clicks. Then, selecting from the training model dataset, a clicked slate (further selection of clicks), the clicked slate comprising a set of clicked ads, and calculating an intent-driven click feedback value for the subject advertisement. The method includes techniques for selecting a clicked slate using features corresponding to clicks received within a particular time period (the time period determined statically or dynamically). A system for implementing the method includes aggregating data from a historical database using selectors such as a position selector, a click feature selector, an impression-advertiser-campaign-creative selector, and a commercial intent selector.