摘要:
A hierarchical document classification system is disclosed. The system includes a text-based document classifier model for classifying an input electronic document into one of a set of predefined document categories. The system further includes an image-based metadata identification model for classifying electronic documents of a particular document category into a set of metadata categories. The system further includes a fuzzy text matcher for supplementing classification accuracy of the image-based metadata identification model to obtain a metadata category for the input electronic document.
摘要:
Conventional methods of analyzing social media content involves performing sentimental analysis to understand related sentiment and effects of events on communities. However, such analysis may not be completely accurate and are prone to errors. Present disclosure provides system and method that identify and analyze risk events from data collected from various sources. Key phrases obtained from sources is received, pre-processed, and clustered accordingly. The clustering is performed based on frequency of incoming words. The clustered dataset obtained is classified into one or more categories based on a polarity score. Dataset of specific category (e.g., negative category dataset) is analysed to identify events and topics which are then grouped using an associated label to obtain grouped entities. Each entity is then ranked and assigned a risk score for identifying high-risk events which are then analyzed using simulation and optimization technique(s) and an explainability text for the analyzed risk events is generated.
摘要:
Systems and methods are disclosed for identifying associations between binary samples, such as e-mail files and their attachments or a document and an executable program associated with the document. In one implementation, the method includes receiving a plurality of binary samples, and extracting metadata from the plurality of binary samples. The metadata for a binary sample from the plurality of binary samples includes a set of attributes of the binary sample. The method further includes identifying a set of associations between the plurality of binary samples based on the extracted metadata. Each association is characterized by at least one attribute the associated binary samples have in common, and each association has a confidence level indicative of a strength of the association. The method also includes identifying associations with a confidence level that exceeds a predefined threshold.
摘要:
Systems and methods are disclosed for identifying associations between binary samples, such as e-mail files and their attachments or a document and an executable program associated with the document. In one implementation, the method includes receiving a plurality of binary samples, and extracting metadata from the plurality of binary samples. The metadata for a binary sample from the plurality of binary samples includes a set of attributes of the binary sample. The method further includes identifying a set of associations between the plurality of binary samples based on the extracted metadata. Each association is characterized by at least one attribute the associated binary samples have in common, and each association has a confidence level indicative of a strength of the association. The method also includes identifying associations with a confidence level that exceeds a predefined threshold.
摘要:
A device for categorizing data sets obtained from a number of sources comprises a symbol frequency determining unit (24) that determines the frequency of appearance of symbols in a first collection of data sets and the frequency of appearance of symbols in a second collection of data sets, a significance determining unit (26) that determines the most significant symbols for the second collection based on the frequency of appearance in the first collection and the frequency of appearance in the second collection, a grouping unit (28) that groups the most significant symbols into groups according to their appearance in the same data set and a ranking unit (30) that ranks the data sets in relation to the symbol groups according to a ranking scheme.