Abstract:
Techniques for optimizing program code through property merging are described. In an embodiment, a compiler identifies, from a plurality of properties of a particular data object that are referenced by the program code, one or more candidate sets of properties that are eligible for merging. For a respective candidate set of properties of the one or more candidate set of properties, the compiler determines whether to merge different properties of the particular data object that belong to the respective candidate set of properties. After determining to merge the different properties, a particular data structure is generated, within the memory of a computing device, that stores the different properties of the particular data object that belong to the respective candidate set.
Abstract:
Techniques are provided for a graph database system that accepts custom graph analytic programs that are written in a high-level graph-specific programming language and compiles the programs into executables that, when executed, directly access graph data of a graph that is stored in the graph database. In this way, a low-level data-access API is avoided. Also, a graph analytic program, which only describes an abstract description of an algorithm, does not include any details regarding data access. In one technique, a user is not required to include explicit parallelization in a graph analytic program in order for the graph analytic program to take advantage of parallelization. A compiler of the graph database system identifies portions of the graph analytic program that can benefit from parallelization and, in response, generates parallelized executable code that corresponds to those portions.
Abstract:
Techniques for identifying common neighbors of two nodes in a graph are provided. One technique involves performing a binary split search and/or a linear search. Another technique involves creating a segmenting index for a first neighbor list. A second neighbor list is scanned and, for each node indicated in the second neighbor list, the segmenting index is used to determine whether the node is also indicated in the first neighbor list. Techniques are also provided for counting the number of triangles. One technique involves pruning nodes from neighbor lists based on the node values of the nodes whose neighbor lists are being pruned. Another technique involves sorting the nodes in a node array (and, thus, their respective neighbor lists) based on the nodes' respective degrees prior to identifying common neighbors. In this way, when pruning the neighbor lists, the neighbor lists of the highly connected nodes are significantly reduced.
Abstract:
Embodiments perform real-time vertex connectivity checks in graph data representations via a multi-phase search process. This process includes an efficient first search phase using landmark connectivity data that is generated during a preprocessing phase. Landmark connectivity data maps the connectivity of a set of identified landmarks in a graph to other vertices in the graph. Upon determining that the subject vertices are not closely related via landmarks, embodiments implement a second search phase that performs a brute-force search for connectivity, between the subject vertices, among the graph's non-landmark vertices. This brute-force search prevents exploration of cyclical paths by recording the vertices on a currently-explored path in a stack data structure. The second search phase is automatically aborted upon detecting that the non-landmark vertices in the graph are over a threshold density. In this case, embodiments perform a third search phase involving either a modified breadth-first search or modified bidirectional search.
Abstract:
Techniques are described herein for automatic generation of multi-source breadth-first search (MS-BFS) from high-level graph processing language that can be executed in a distributed computing environment. In an embodiment, a method involves a computer analyzing original software instructions. The original software instructions are configured to perform multiple breadth-first searches to determine a particular result. Each breadth-first search originates at each of a subset of vertices of a graph. Each breadth-first search is encoded for independent execution. Based on the analyzing, the computer generates transformed software instructions configured to perform a MS-BFS to determine the particular result. Each of the subset of vertices is a source of the MS-BFS. In an embodiment, the second plurality of software instructions comprises a node iteration loop and a neighbor iteration loop, and the plurality of vertices of the distributed graph comprise active vertices and neighbor vertices. The node iteration loop is configured to iterate once per each active vertex of the plurality of vertices of the distributed graph, and the node iteration loop is configured to determine the particular result. The neighbor iteration loop is configured to iterate once per each active vertex of the plurality of vertices of the distributed graph, and each iteration of the neighbor iteration loop is configured to activate one or more neighbor vertices of the plurality of vertices for the following iteration of the neighbor iteration loop.
Abstract:
Techniques are described herein for automatic generation of multi-source breadth-first search (MS-BFS) from high-level graph processing language. In an embodiment, a method involves a computer analyzing original software instructions. The original software instructions are configured to perform multiple breadth-first searches to determine a particular result. Each breadth-first search originates at each of a subset of vertices of a graph. Each breadth-first search is encoded for independent execution. Based on the analyzing, the computer generates transformed software instructions configured to perform a MS-BFS to determine the particular result. Each of the subset of vertices is a source of the MS-BFS. In an embodiment, parallel execution of the MS-BFS is regulated with batches of vertices. In an embodiment, the original software instructions are expressed in Green-Marl graph analysis language. In an embodiment, the transformed software instructions are expressed in a general purpose programming language such as C, C++, Python, or Java.
Abstract:
Techniques for optimizing program code through property merging are described. In an embodiment, a compiler identifies, from a plurality of properties of a particular data object that are referenced by the program code, one or more candidate sets of properties that are eligible for merging. For a respective candidate set of properties of the one or more candidate set of properties, the compiler determines whether to merge different properties of the particular data object that belong to the respective candidate set of properties. After determining to merge the different properties, a particular data structure is generated, within the memory of a computing device, that stores the different properties of the particular data object that belong to the respective candidate set.
Abstract:
Techniques herein are for navigation data structures for graph traversal. In an embodiment, navigation data structures that a computer stores include: a source vertex array of vertices; a neighbor array of dense identifiers of target vertices terminating edges; a bidirectional map associating, for each vertex, a sparse identifier of the vertex with a dense identifier of the vertex; and a vertex array containing, when a dense identifier of a source vertex is used as an offset, a pair of offsets defining an offset range, for use with the neighbor array. The source vertex array, using the dense identifier of a particular vertex as an offset, contains an offset, into a neighbor array, of a target vertex terminating an edge originating at the particular vertex. The neighbor array contiguously stores dense identifiers of target vertices terminating edges originating from a same source vertex.
Abstract:
An analyzer (such as a compiler) searches for a program portion that matches a pattern that may suffer from workload imbalance due to nodes with high degrees (i.e., relatively many edges). Such a pattern involves iteration over at least a subset (or all) of the nodes in a graph. If a program portion that matches the pattern is found, then the analyzer determines whether the body of the iteration contains an iteration over edges or neighbors of each node in the subset. If so, then the analyzer transforms the graph analytic program by adding code and, optionally, modifying existing code so that high-degree nodes are processed differently than low-degree nodes. High-degree nodes are processed sequentially while low-degree nodes are processed in parallel. Conversely, edges of high-degree nodes are processed in parallel while edges of low-degree nodes are processed sequentially.
Abstract:
Techniques for efficiently loading graph data into memory are provided. A plurality of node ID lists are retrieved from storage. Each node ID list is ordered based on one or more order criteria, such as node ID, and is read into memory. A new list of node IDs is created in memory and is initially empty. From among the plurality of node ID lists, a particular node ID is selected based on the one or more order criteria, removed from the node ID list where the particular node ID originates, and added to the new list. This process of selecting, removing, and adding continues until no more than one node ID list exists, other than the new list. In this way, the retrieval of the plurality of node ID lists from storage may be performed in parallel while the selecting and adding are performed sequentially.