图随机游走作为图数据分析的一个重要工具和基础性组件,在图计算、图挖掘、图嵌入等领域得到了广泛应用。随着图随机游走成为新的研究热点,许多复杂的动态随机游走算法相继被提出,这些算法在针对不同的应用场景实现更为灵活多变的游走策略的同时,其动态的特性也极大地增加了计算的难度。对于这些动态随机游走极高的算法复杂度,目前缺乏一个通用且高效的计算方案。同时,面对信息时代大规模的图数据及随之而来的巨量计算任务,业界缺乏一个能将随机游走计算任务有效地扩展到分布式环境中的通用随机游走系统。此外,图随机游走作为一个数据密集型的计算任务,其游走的随机特性带来的大量不规则的内存访问,以及这些访问之间的强数据依赖性,让内存访问成为了计算的一大瓶颈。基于上述观察,本文从大规模图随机游走的算法复杂度、可扩展性和访存效率三个方向开展研究。本文的主要创新成果包括: (1)针对大图上动态随机游走算法复杂度过高的问题,本文提出了统一的边转移概率定义,继而从该定义出发设计了一个基于拒绝采样的通用算法框架,将动态随机游走中每步游走的算法复杂度从线性降为接近于常数,同时支持精确的采样。实验表明,本文的算法框架可以将常用的动态随机游走node2vec的每步所需计算转移概率的平均边数从数百至十万降低为1左右。 (2)针对大规模图随机游走的可扩展性问题,本文设计了支持分布式环境下图随机游走高效计算的通用计算框架KnightKing。针对大规模静态的图数据和大量不断动态迁移的游走者,本文提出了以游走者为中心的编程模型和一系列系统优化技术,在融合了上文提到的高效算法框架的同时,将复杂的算法优化和系统实现的细节隐藏起来,让用户能轻松地实现已有的或新的随机游走算法。实验表明,相比于基于现有算法在图计算系统Gemini上的实现,KnightKing达到了最多四个数量级的性能提升。 (3)针对大图上随机游走的访存效率问题,本文设计了图随机游走系统FlashMob,通过数据的分片、重排和批量处理,尽可能地挖掘随机游走计算中的时间和空间局部性。通过将数据访问模式变得更为顺序和规则,FlashMob实现了对cache和DRAM的充分利用。实验表明,在每步游走速度上,FlashMob处理58GB的真实世界大图的效率超过了现有系统处理能放进L2 cache的小图的效率。
As an important tool and a fundamental component of graph data analysis, graph random walk has been widely used in graph computation, graph mining, and graph embedding, attracting increasing interest from both industry and academia. Recently, many sophisticated dynamic random walk algorithms have been proposed to achieve more flexible, application-specific walks, at the cost of computation difficulty. Existing solutions deal with these random walk algorithms individually, suffering from the high computational complexity brought by their dynamic nature. Meanwhile, there lacks a general solution that can effectively scale the random walk to a distributed environment, so that the huge graphs and their massive random walks that are common in the information age can be processed within reasonable time. Moreover, as a data-intensive application, random walk incurs a lot of memory accesses. Their irregularity and strong data dependency, coming from the random nature of walking on large graphs, makes the memory accesses a major bottleneck in local computing. Based on the above observations, this paper improves random walk computing from three aspects: computational complexity, scalability, and memory access efficiency. The main innovations of this paper are listed below. (1) To address the high computational complexity of dynamic random walk, this paper proposes a unified edge transition probability definition that applies across popular known algorithms and a novel rejection-based algorithmic framework that dramatically reduces the cost of expensive dynamic random walk algorithms. Unlike previous solutions with linear complexity or approximation, it can achieve near ?(1) complexity in exact edge sampling. Our evaluation on node2vec, a popular dynamic random walk, demonstrates that with our algorithmic framework, the sampling cost, measured by the average number of calculating per-edge transition probabilities needed for walking one step, can be reduced from hundreds or even more than 100 thousand to about 1. (2) To address the scalability issue, this paper proposes KnightKing, the first generalpurpose distributed graph random walk engine. Centering around KnightKing is a walkercentric programming model that targets the unique interaction between a large static graph and many dynamic walkers and incorporates the efficient algorithmic framework mentioned above. The complicated details of algorithm optimization and system implementation are hidden underneath the intuitive APIs so that users can easily specify existing or new random walk algorithms. Our evaluation confirms that KnightKing brings up to 4 orders of magnitude improvement compared to an implementation based on a graph processing system, Gemini, with existing sampling algorithms. (3) To improve the memory access efficiency, this paper proposes FlashMob, a system that harvests the spatial and temporal locality of random walk by careful partitioning, rearranging, and batching of operations. FlashMob improves both cache and memory bandwidth utilization by making memory accesses more sequential and regular. Our evaluation shows that FlashMob processes a 58GB real world graph at a higher per-step speed than the existing system on a toy graph fitting in the L2 cache.