Abstract:
A data analysis system is proposed for providing fine-grained low latency access to high volume input data from possibly multiple heterogeneous input data sources. The input data is parsed, optionally transformed, indexed, and stored in a horizontally-scalable key-value data repository where it may be accessed using low latency searches. The input data may be compressed into blocks before being stored to minimize storage requirements. The results of searches present input data in its original form. The input data may include access logs, call data records (CDRs), e-mail messages, etc. The system allows a data analyst to efficiently identify information of interest in a very large dynamic data set up to multiple petabytes in size. Once information of interest has been identified, that subset of the large data set can be imported into a dedicated or specialized data analysis system for an additional in-depth investigation and contextual analysis.
Abstract:
Systems and methods are provided for analyzing entity performance. In one implementation, a method is provided that includes accessing a data structure comprising a plurality of interactions associated with multiple entities. The method also includes evaluating one or more interactions of the plurality of interactions associated with a consuming entity of the multiple entities. The method further includes determining whether the one or more interactions associated with the consuming entity comprise an identified location information of the consuming entity.
Abstract:
Computer implemented systems and methods are disclosed for automatically clustering and canonically identifying related data in various data structures. Data structures may include a plurality of records, wherein each record is associated with a respective entity. In accordance with some embodiments, the systems and methods further comprise identifying clusters of records associated with a respective entity by grouping the records into pairs, analyzing the respective pairs to determine a probability that both members of the pair relate to a common entity, and identifying a cluster of overlapping pairs to generate a collection of records relating to a common entity. Clusters may further be analyzed to determine canonical names or other properties for the respective entities by analyzing record fields and identifying similarities.