登录 EN

添加临时用户

众包数据库关键技术研究

Key Techniques of Crowd-powered Database

作者:柴成亮
  • 学号
    2015******
  • 学位
    博士
  • 电子邮箱
    cha******.cn
  • 答辩日期
    2020.05.21
  • 导师
    李国良
  • 学科名
    计算机科学与技术
  • 页码
    118
  • 保密级别
    公开
  • 培养单位
    024 计算机系
  • 中文关键词
    众包数据库,人机方法,代价控制,任务分配,质量控制
  • 英文关键词
    Crowd-powered Database,Human-Machine Method,Cost Control,Task Assignment, Quality Control

摘要

众包通过整合计算机和互联网上未知的大众来完成机器单独难以处理的任务, 其包工作模式含三个主要部分,任务发布者、众包平台和众包工人。在传统众包 技术中,以上三者的交互流程过于复杂,使得任务发布者无法很好地管理众包任 务。因此,众包数据库应运而生,其从系统层面出发整合三者之间复杂的交互流 程,使得任务发布者可以通过描述性语言轻松地利用众包工人操作数据。本文主 要的研究内容和贡献如下:1. 众包数据库 CDB: 为解决众包平台难使用、众包任务难优化、众包工人质 量难控制等问题,需要通过数据库的思想来封装众包任务处理的流程。与传统数 据库不同的是,众包数据库的难点不仅在于解决单一目标优化问题 (仅优化代价), 更重要的是建立细粒度的查询优化模型,实现代价、质量和延迟多目标优化。本 文提出了一种新型的众包数据库系统 CDB 。首先不同于传统的树优化模型,CDB 利用图模型来提供细粒度优化。其次,CDB 在该模型上建立统一的框架来进行多 目标优化。该研究价值在于帮助用户高质量、低成本、高效率地利用众包来处理数 据。然而 CDB 系统仅能很好地支持常用的选择、连接操作。对于复杂的连接操作(基于记录或者自连接)与收集操作,本文分别提出了以下两种算法框架。2. 基于众包的连接操作: 为了解决现实世界中脏数据的复杂连接问题,需要 引入基于众包的连接操作(实体匹配)。其难点在于代价较高,而寻求低代价的替 代方案时往往带来质量的降低。为此,本文提出一种低代价的众包实体匹配框架 Power,在保持高质量的同时大大降低了代价。本文首先在待匹配的记录对上定义 了一种偏序关系,然后基于该关系对众包工人的回答进行推理,接下来循环提问 直到所有实体对的答案都被推理出来。该研究的价值在于用户可以在相对于其他方法节省 100 倍的成本下进行高质量的数据连接。3. 基于众包的收集操作: 为了实现数据库中开放世界(Open World)的特点,众包数据库需要引入收集操作,其旨在收集数据库中缺失的实体。其难点在于如 何保证收集的实体的正确性;如何尽可能收集相关领域的全部实体;如何减少重 复实体的数量以减少代价。为此,本文提出了基于激励机制的众包实体收集框架 CrowdEC,其采用激励的方式鼓励工人提供不重复的实体以降低代价。该研究的价 值在于用户可实现低成本、高质量、高覆盖的收集。

Crowdsourcing aims to leverage human intelligence to tackle the machine-hard tasks. There are three parties involved in the crowdsourcing workflow, i.e., requester, crowdsourcing platform and workers. The technical difficulty of crowdsourcing is the complexity of interactions among the above three components, which makes the requesters hard to use and manage their tasks. As a result, crowdsourcing database system has been proposed by many researchers, which encapsulates the complexity of the above components from a system level, making it easy for requesters to leverage workers to manipulate data through declarative language. The challenges include how to easily use crowdsourcing platforms, how to design query optimization models to optimize crowdsourcing costs, quality and latency and how to support complex crowdsourcing operations. To summarize, the main contributions of this dissertation are as follows.1. Crowd-powered database system. Crowdsourcing database systems have been proposed to leverage the insight of traditional database to encapsulate the complexities of interacting with the crowd, which can be utilized to solve the problems that the crowdsourcing plarform is hard to use, the tasks are hard to optimize and the workers quality is hard to control. CDB has fundamental differences from existing systems. It not only solves the problem of single-goal optimization, but provides more fine-grained query optimization(including quality, cost and latency control) as well. To this end, we propose a novel crowdsourcing database system CDB. First, different from traditional tree model, CDB employs a graph-based query model that provides more fine-grained query optimization. Second, CDB adopts a unified framework to perform the multi-goal optimization based on the graph model. Therefore, the contribution of this work is to help the user to leverage the crowd to process data efficiently and effectively. However, CDB can only support commonly used operations like join and selection well. For complicated join or collection operations, we propose the following two frameworks.2. Crowd-powered join operation To address the problem of join operation on dirty data, crowdsourced join operation is incorporated into the crowdsourcing database system, which can also be regarded as crowdsourced entity resolution. The challenges lie in that saving cost always leads to low quality. To address this, we propose a cost- effective crowdsourced entity resolution framework, Power, which significantly reduces the monetary cost while keeping high quality. We first define a partial order on the pairs of records. Then we select a pair as a question and ask the crowd to check whether the records in the pair refer to the same entity. After getting the answer of this pair, we infer the answers of other pairs based on the partial order. Next we iteratively select pairs without answers to ask until we get the answers of all pairs with quality control methodology. The contribution of this work is to reduce 100 times cost compared with baseline methods without sacrificing the quality.3. Crowd-powered collection operation To achieve the open-world characteristic in database system, crowdsourcing collection operation is incorporated into the crowdsourcing database system, which aims to leverage human’s ability to collect entities which are missing in a database from the crowd. The challenges include how to guarantee the correctness of entities collected, how to collect as complete entities as possible in the given domain and how to save the cost by reducing duplicated entities. To this end, we propose an incentive-based crowdsourcing collection framework CrowdEC that encourages workers to provide more distinct items using an incentive strategy. One the one hand, considering both quality and latency, CrowdEC designs worker elimination method to block unqualified workers, which solves the open-world quality control problem. The contribution of this work is to help the user to leverage the crowd to collect high quality complete data with low cost.