在这篇论文中,我们研究社交网络中最重要的问题之一:社交网络结构的形成。我们分别从微观层面和宏观层面对社交网络结构的形成进行分析和挖掘,如社会关系的形成,社团核心和结构洞节点的发现。从微观层面出发,我们研究单项关系变成双向关系(reciprocal)的过程,并进一步预测三合合成体(triadic closure)的形成。双向关系和三合合成体都是社交网络结构的基本组成部分。我们提出一个学习模型来分析双向关系和三合合成体的预测问题,模型能够把许多社会学理论结合到机器学习模型中。然后,我们进一步扩展该模型用于社会关系的预测。社会关系预测的重要意义在于,不同类型的社会关系对人与人之间之间会造成完全不同的影响。我们的模型能够在多个异构网络中进行传递学习。%我们在Twitter网络上进行了测试,试验结果表明,提出的模型能够准确地推断出90%的双向关系。对于三合合成体的预测,该模型相比其他方法也得到了很大的性能提升(+20-30%的F1-Measure)。我们使用了5个异构的网络来测试模型在异构网络上对社会关系的预测。例如,通过借助Coauthor网络上导师与学生的关系信息,在企业电子邮件网络中,对于上下级关系的预测能够达到90%(与其它方法相比提高8-28%)的精度。从宏观层面出发,我们定义了一个新问题:社团核心的发现,以在大型社交网络中发现社团结构。我们发现,有影响力的用户倾向于更密切地关注与自己相似的用户,从而表现成许多不同的社团核心。我们提出了两个高效算法来发现大型社交网络中的社团核心。我们在三个社交网络(Twitter,Wikipeida和Coauthor)上进行了实验,实验结果表明提出的算法超过当今最好算法15%--50%,同时,我们的算法达到平均6--2,000倍的提速。我们进一步研究社交网络中的结构洞,结构洞对不同社团之间的信息传播起着关键性作用。我们形式化定义了如何在大型社交网络中发现最重要的k个结构洞节点的问题,并定义了估价函数。我们实例化了两个计算模型。第一个模型从节点的重要性出发,把结构洞和节点的重要性相互递归定义。我们提出了一个精确算法解决这个问题,并证明了其收敛性。第二个模型从信息传播出发,我们证明了最优化问题是NP难题,并设计了一个高效的近似算法。我们在三个社交网络(Coauthor,Twitter和Inventor)上测试了提出的两个模型。实验结果表明,我们的算法明显优于已有算法。
In this thesis, we study the formation of social network structure, which is one major topic in analyzing of social networks. In particular, we focus on both micro-level and macro-level structural formations of social networks, such as mining links, community kernels and structural holes. For micro-level works, we study the extent to which the formation of a reciprocal (two-way) links, as the basis relationship in social networks, are developed from a parasocial (one-way) relationship, and how the relationships further develop into triadic closure, which is one of the fundamental processes of link formation. We propose a learning framework to formulate the problems of predicting reciprocity and triadic closure into a graphical model. The framework incorporates social theories into a machine learning model. We further generalize the model for inferring social ties over multiple heterogeneous networks, since different types of social ties have essentially different influence on people. Employing Twitter as a source for our experimental data, we demonstrate that it is possible to accurately infer 90% of reciprocal relationships in a dynamic network. For predicting the triadic closure formation, the proposed model also achieves better performance (+20-30% in terms of F1) than several alternative methods. For the social ties prediction problem, our empirical study on five different genres of networks validates the effectiveness of the proposed framework. For example, by leveraging information from a coauthor network with labeled advisor-advisee relationships, the proposed framework is able to obtain an F1-score of 90% (8-28% improvements over alternative methods) for inferring manager-subordinate relationships in an enterprise email network.For macro-level works, we define and explore a novel problem called community kernel detection in order to uncover the hidden community structure in large social networks. We discover that influential users pay closer attention to those who are more similar to them, which leads to a natural partition into different community kernels . We propose two efficient algorithms for finding community kernels in large social networks. We conduct experiments on three large social networks: Twitter, Wikipedia, and Coauthor, which show that the proposed algorithm achieves an average 15%--50% performance improvement over the other state-of-the-art algorithms. Meanwhile our algorithm is on average 6--2,000 times faster in detecting community kernels. We further mining structural holes in social networks, which play a key role in the information diffusion between different communities. We precisely define the problem of mining top-$k$ structural holes in large-scale network and provide an objective (quality) function to formalize the problem. Two instantiation models have been developed to implement the objective function. The first model considers nodes' importance. The structural hole degree and importance are defined in terms of one another in a mutual recursion. We present an exact algorithm to solve it and prove its convergence. The second model is based on the theory of information flow. The optimization is proved to be NP-hard, and we design an efficient algorithm with provable approximation guarantees. We test the proposed models on three different genres of networks: Coauthor, Twitter, and Inventor. Experiments show that our proposed algorithms clearly outperform several alternative algorithms. Our study also provides empirical evidences for the theory of structural holes, e.g., 1% of Twitter users who span structural holes control 25% of the information forwarding. Experiments also demonstrate that the detected structural holes can help other social network applications, such as kernel communities detection.