传统的数据库针对数据表的查询条件主要包括数值范围查询、点查询以及模糊匹配查询,但是这些查询只能支持准确查询。相似查询可以根据指定的相似函数(比如杰卡德相似度)查询数据集中的数据,具体包括基于阈值的查询、TopK 查询两种,其中每种查询又包括相似选择和连接两种常见算子。由于相似查询广泛应用于海量相似文本搜索、相似图片搜索、结构化实体去重以及多源数据融合等领域,所以高效的相似查询是最近国内外研究的重点。针对相似查询的关键技术,论文的主要研究目标和贡献如下:1. 基于分布式内存索引的相似查询:论文介绍了一款基于分布式内存的相似查询处理系统 Dima 。Dima 扩展了 SQL 语法来支持四种核心相似查询操作,以便让用户能够调用这些相似查询开展复杂数据分析任务。文章提出负载均衡感知的相似片段分布式索引来避免昂贵的数据传输并且缓解长尾效应,进而提高整体相似查询性能。由于 Spark 是被广泛使用的分布式内存计算系统,因此 Dima 无缝集成在 Spark 内核中。Dima 是第一个支持对大数据集进行复杂相似查询的成熟分布式内存系统。实验结果表明 Dima 比最新的方法性能高出 1-3 个数量级。2. 基于神经网络的相似查询基数估计:传统数据库查询优化质量很大程度上依赖于查询中间结果基数估计的准确度。而在相似查询系统中,基数估计对于复合谓词顺序选择以及相似连接顺序选择也是至关重要的。但是,针对相似查询的基数估计无法使用直方图技术,采样技术在高维环境下也会带来较大误差。本文提出使用神经网络来解决相似查询的基数估计。本文提出两种策略来提高基数估计准确度并且减少训练集规模:查询分片和数据分片。实验显示本文提出的方法能够高效学习到高维数据的距离分布并且能够对相似查询进行准确的基数估计。3. 相似实体融合规则生成:作为相似查询的重要应用,多源结构化数据中的实体融合技术被学术界广泛研究。实体融合的重要步骤包括实体分块(Blocking),匹配(Entity Matching)与实体合并(Entity Consolidation),这些步骤依赖于实体对之间的相似度特征以及实体分块规则,其中用户的参与是不可缺少的,比如训练实体匹配模型的训练集生成、数据转换规则的确定等。本文设计了几种用户交互的实体融合问题,并且提出一个问题调度框架,这个框架能够根据每种问题的收益/代价比选择不同种类的问题进行交叉询问来提高实体合并的准确度。
Traditional database can only support accurate queries including range queries, point queries and LIKE predicates. However, there are a surge of demands for similarity queries which search given dataset according to similarity functions (e.g. Jaccard Similarity, Edit Distance and Cosine Similarity). Similarity queries can be classified as threshold-based queries and \topk queries, and each of them includes both select operator and join operator. Similarity queries have been widely used in similar text searching, similar image searching, entity deduplication and data integration. This paper focus on the key techniques of similarity query processing, and contributions are as follows,\textbf{Distributed In-memory Similarity Query Processing. } Paper proposes a distributed similarity-based query processing system \Dima. \Dima supports four core similarity-based queries, including threshold-based similarity select, threshold-based similarity join, \topk-based similarity select and \topk-based similarity join. \Dima extends SQL so that users can call these similarity-based query interface to conduct complex data analysis. For distributed environment, paper proposes a balance-aware signature-based similarity index to avoid expensive data transmission and long tail effect. Due to that Spark is one of the most popular distributed system, \Dima is seamlessly integrated into Spark core and adopts effective query optimization techniques. To the best of our knowledge, \Dima is the first full-fledged distributed in-memory similarity-based query processing system. Experimental results show that \Dima outperforms state-of-the-art studies by 1-3 orders of magnitude and has good scalability.\textbf{Learned Similarity-based Cardinality Estimation. }Query optimization in traditional database relies on the accuracy of the cardinality estimation on intermediate results. For system supporting complex similarity-based query processing system, cardinality estimation is vital to conjunctive predicates order and join order selection. However, similarity-based cardinality estimation cannot use histograms, and the sampling method also produces large errors for high-dimensional data. To solve this problem, paper proposes a novel deep learning methods. Naive fully-connected neural network requires a large number of training data, and the training overhead is very high. Paper proposes two strategies to improve the accuracy of cardinality estimation and reduce the training overhead, including query segmentation and data segmentation. Experiment shows that segment-based learned cardinality estimation can capture distribution of high-dimensional datasets effectively, and produces accurate cardinality estimation.\textbf{Similarity-based Rule Generation for Entity Integration.}Key phases of entity integration are data blocking and entity matching which rely on similarity features of entity pairs and blocking rules. Another phase is data consolidation, i.e. transforming each record in an entity cluster into a representative form. Throughout the life cycle of data integration, human involvement is indispensable for training set generation, data transformation rule generation, etc. Paper designs interactive data integration questions, and proposes a question scheduling framework which can improve the data consolidation quality by interleaving different questions.