登录 EN

添加临时用户

结构化数据自动可视化关键技术研究

Key Techniques for Automatic Visualization of Structured Data

作者:骆昱宇
  • 学号
    2021******
  • 学位
    博士
  • 电子邮箱
    luo******.cn
  • 答辩日期
    2023.05.15
  • 导师
    李国良
  • 学科名
    计算机科学与技术
  • 页码
    177
  • 保密级别
    公开
  • 培养单位
    024 计算机系
  • 中文关键词
    结构化数据, 自动可视化, 问答式可视化, 渐进式可视化, 智能可视分析
  • 英文关键词
    Structured Data, Automatic Visualization, Question-Answering Visualization, Progressive Visualization, Intelligent Visual Analytics

摘要

可视化将复杂数据映射为直观的图表形式,借助人类的视觉感知能力高效地捕捉其中的规律,已成为大数据分析的重要方法。然而,现有结构化数据的可视 化系统仍有以下三个问题:1.可视分析高门槛:现有系统高度依赖用户主动理解数据集和可视化,对用户要求高;2.用户意图难表达:现有系统难以支持普通用户准确表达可视化意图,易答非所问;3.分析结果不精准:现有系统容易忽视数据错误对可视化结果的影响,易误导用户。针对上述问题,本文主要的成果如下:1.领域知识指导的全自动可视化。针对现有交互式可视化系统高度依赖用户专业技能的问题,本文提出了领域知识指导的全自动可视化框架AutoVis。该框架结合领域知识,自动地为用户生成并选择一组能有效传达数据规律的高质量可视化。本文提出使用偏序关系来建模和组织可视化领域知识,并基于偏序图有效地选择top-k视化。本文证明了考虑多样性的top-k可视化选择是一个NP难问题,并提出了高效的启发式算法。实验表明,AutoVis在真实数据集的可视化任务的有效性和高效性均优于现有方法,且无需用户干预可视化过程,达到"以简驭繁"的效果。2.自然语言驱动的问答式可视化。针对现有可视化系统难以有效支持普通用户准确地表达可视化意图的不足,本文提出了自然语言驱动的问答式可视化模型ncNet,可以基于用户的自然语言查询自动且准确地生成满足其意图的可视化结果。为了促进该领域的发展,本文提出了一个面向问答式可视化的基准数据集构建框架,可以通过人机协作的方式实现低代价构建大规模高质量的基准数据集。基于此,本文构建了首个面向问答式可视化领域的公开的大规模基准数据集nvBench。3.数据质量感知的渐进式可视化。为缓解数据错误对可视化结果准确性的负面影响,本文提出了数据质量感知的渐进式可视化框架VisClean,通过交互式数据清洗,优先清洗严重影响可视化质量的数据子集,逐步提升可视化质量,从而达到"洞见症结"的效果。其优势是能在可视分析周期中动态提升可视化质量,而无需预先清洗整个数据集。此外,本文提出复合问题以提供更丰富的信息与用户交互,证明选择最优复合问题是NP难的,并提出启发式算法以高效地选择复合问题。实验表明VisClean通过较少的用户交互能显著提高可视化质量,优于现有方法。本文基于上述成果研制了智能数据可视化系统DeepEye,提供了全自动可视化、问答式可视化、渐进式可视化、可视化检索等功能。DeepEye已经提供230余万次的可视化服务,并在华为、浙江电网和和字节跳动的实际应用中取得良好效果。

Data visualization is an effective way to derive behind-the-scenes insights and empowers informed decision-making in modern business intelligence and data science. However, existing visualization systems for structured data still face the following three main challenges: (1) high barrier to visual analysis: existing systems heavily rely on users’ visualization skills and understanding of the dataset to create visualizations; (2) difficulty in expressing user intent: existing systems struggle to support ordinary users in accurately expressing their visualization intentions; (3) inaccurate analysis results: existing systems tend to overlook the impact of data errors on visualization, which may lead to erroneous analysis conclusions. To address the aforementioned problems, the main contributions of this dissertation are summarized as follows:1. Domain Knowledge-Guided Fully Automatic Visualization. Creating effective visualizations can be a time-consuming task that requires both domain-specific and data-specific expertise. To lower the barrier to exploring good visualizations for users, we first propose a fully automatic visualization framework called AutoVis that can create and recommend high-quality visualizations (i.e., perfectly conveying insights) from a dataset without any human involvement. To achieve this, we formulate the search space of visualizations, based on which we propose a new visualization language for effectively enumerating candidate visualizations. We devise a machine learning-based visualization recognition technique to decide which visualization is good by capturing human perception from plenty of examples. We present partial orders such that experts can declaratively specify their domain knowledge for ranking visualizations. We present a graph-based approach and rule-based optimizations to effectively and efficiently compute top-? good (and diversified) visualizations from a vast search space. Extensive experiments using real-life data and use cases verify the power of AutoVis.2. Natural Language-Driven Question-Answering Visualization. In response to the shortcomings of existing visualization systems in effectively supporting ordinary users to accurately express their visualization intentions, we propose a natural language-driven question-answering visualization model called ncNet, which can automatically and accurately generate visualizations that meet users’ intentions based on their natural language queries. Furthermore, to advance the field of question-answering visualization, we present a benchmark dataset construction framework for question-answering visualization, which can achieve low-cost construction of large-scale high-quality benchmark datasets through effective human-machine collaboration. Based on the framework, we produce the first publicly available large-scale benchmark dataset nvBench with good coverage, diversity, and quality for the field of question-answering visualization. We conduct both quantitative evaluations and case studies to demonstrate that ncNet significantly outperforms the state-of-the-art methods in both benchmark datasets and real-world cases.3. Data Quality-Aware Progressive Visualization. Real-life data is often dirty, containing duplicates, outliers, and missing values. When such data is used to create visualizations, the resulting visualizations may not accurately represent the underlying data and can be misleading to users. Therefore, it is essential to address this issue and improve the quality of visualizations generated from dirty data. We propose VisClean framework that leverages sophisticated data cleaning techniques to progressively improve the quality of poor visualizations with minimal human involvement. We first characterize different types of data errors and propose a distance-based method to quantify the impact of data errors on visualizations. We then model various data errors and their potential repairs as an undirected weighted graph called ERG and propose novel composite questions (i.e., a sub-graph) with more context to interact with the users. Next, we propose an estimation-based benefit model to quantify the visualization quality improvement from the user interaction. We prove that selecting the optimal composite questions from ERG is NP-hard and design an effective question selection algorithm to select the most beneficial composite questions in each iteration. Experiments on real-world datasets verify that composite questions are more effective than asking single questions in isolation w.r.t. the human cost.We have developed a well-designed, end-to-end, and open-source intelligent data visualization system called DeepEye to assist users, particularly novices, in generating meaningful visualizations from large and complex datasets with ease. DeepEye offers a wide range of visualization services and is widely adopted by enterprises such as Huawei, State Grid (Zhejiang), and ByteDance as an open-source tool.