数据是一项最具价值的企业资产,数据质量与企业的经营绩效之间存在着密切的联系,高质量的数据可以充分实现数据资产的商业价值,保障数据质量是大数据有效应用的前提和关键。数据质量问题严重影响着大数据的可用性,已经成为大数据应用的极大障碍,如何保障大数据的数据质量无论是在学术研究还是在商业应用领域均已成为大数据治理和大数据应用的一个重要研究课题。本论文对基于规则的数据质量管理关键技术进行研究,为解决大数据环境下的数据质量问题、提升大数据的数据质量提供理论指导与技术支撑,有着重大的理论意义和实践意义。本文在研究分析国内外数据质量管理现状及数据质量管理相关理论的基础上,首先提出大数据环境下基于全面数据质量管理(TDQM)的数据质量管理体系总体框架,从管理体制机制和数据生命周期的角度给出构建以数据质量目标为驱动的全方位、主动式数据质量管理体系的思路,并借鉴六西格玛管理的DMAIC方法详细阐述了数据质量评估与改进的步骤。其次结合大数据环境下的数据特征和数据质量需求,提出适用于多个领域的通用的数据质量指标体系,并给出获取度量指标测量结果的算法,为数据质量测量与改进提供指导和依据。构建基于规则的数据质量评估模型,给出数据质量评估模型形式化描述,提出数据质量评估算法设计的基本思路及具体算法。针对模糊综合评价法和层次分析法在多指标综合评价方面的不足,提出基于模糊层次分析法(FAHP)的数据质量评估方法,使得数据质量指标体系各层指标权重更加科学、客观、合理,为进行定量化多指标数据质量综合评估提供理论指导与技术支撑。再次将数据质量评估模型与方法运用于大数据环境下基于规则的数据质量管理,根据数据质量管理的信息化需求设计基于规则的数据质量管理系统,为大数据环境下的数据质量管理提供信息化工具支撑与保障。最后将基于规则的数据质量管理应用于某大型企业集团主数据质量管理中,通过进行主数据质量评估与改进,对本文提出的研究成果的有效性进行验证,取得良好的应用效果。
Data is one of the most valuable assets of an organization. There is a close relationship between data quality and business performance of enterprises. High-quality data can fully realize the business value of data assets. Ensuring data quality is the premise and key to the effective application of big data. Data quality problems seriously affect the availability of big data, which have become a great obstacle to the application of big data. How to ensure the data quality of big data has become an important research topic for big data governance and application of big data, both in academic research and in commercial application fields. This paper studies the key technologies of the rule-based data quality management, which provides theoretical guidance and technical support for solving data quality problems in the big data environment and improving data quality of big data. It has great theoretical and practical significance.In this paper, on the basis of the research and analysis of the current situation of data quality management at home and abroad and the related theories of data quality management, the following work is accomplished.Firstly, the overall framework of data quality management system based on total data quality management (TDQM) in big data environment are proposed. From the point of view of management system mechanism and data life cycle, the idea of building an all-round and proactive data quality management system driven by data quality goal is given. Based on the DMAIC method of Six Sigma Management, the steps of data quality evaluation and improvement are elaborated in detail.Secondly, combined with the big data characteristics and data quality requirements in the big data environment, an universal data quality indicators system suitable for many fields is proposed, and an algorithm for obtaining the measurement results of metrics is gived, which provides guidance and basis for data quality measurement and improvement. This paper builds a rule-based data quality assessment model, gives a formal description of the data quality assessment model, and puts forward the basic ideas of data quality assessment algorithms design and detail algorithms. Aiming at the shortcomings of fuzzy comprehensive assessment method and analytic hierarchy process in multi-indicator comprehensive assessment, a data quality assessment method based on fuzzy analytic hierarchy process (FAHP) is proposed, which makes the weight of the indicators at each level of the data quality indicators system more scientific, objective and reasonable, and provides theoretical guidance and technical support for the comprehensive assessment of quantitative multi-indicator data quality.Thirdly, the model and method of data quality assessment are applied to rule-based data quality management in big data environment. This paper designs a rule-based data quality management system according to the information requirement of data quality management, which provides IT tools support and guarantee for data quality management in big data environment.Finally, the rule-based data quality management is applied to the master data quality management of a large enterprise group. Through the assessment and improvement of the master data quality, the effectiveness of the research results proposed in this paper is verified, and good application results are obtained.