G-thinker: A Distributed Framework for Mining Subgraphs in a Big Graph
- Resource Type
- Conference
- Authors
- Yan, Da; Guo, Guimu; Rahman Chowdhury, Md Mashiur; Tamer Ozsu, M.; Ku, Wei-Shinn; Lui, John C. S.
- Source
- 2020 IEEE 36th International Conference on Data Engineering (ICDE) Data Engineering (ICDE), 2020 IEEE 36th International Conference on. :1369-1380 Apr, 2020
- Subject
- Computing and Processing
Task analysis
Data mining
Time complexity
Programming
Scheduling
Throughput
Social network services
graph mining
subgraph-centric
CPU-bound
compute-intensive
clique
triangle
subgraph matching
- Language
- ISSN
- 2375-026X
Mining from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems to scale them are all IO-bound in execution. We propose the first truly CPU-bound distributed framework called G-thinker that adopts a user-friendly subgraph-centric vertex-pulling API for writing distributed subgraph mining algorithms. To utilize all CPU cores of a cluster, G-thinker features (1) a highly-concurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the CPU idle time. Extensive experiments demonstrate that G-thinker achieves orders of magnitude speedup compared even with the fastest existing subgraph-centric system, and it scales well to much larger and denser real network data. G-thinker is open-sourced at http://bit.ly/gthinker with detailed documentation.