时间序列数据库被广泛地应用于大规模时间序列数据的管理,然而时间序列的对齐问题可能会影响时间序列数据库的预处理、存储、查询等模块,并对下游的数据分析、异常检测、机器学习等任务造成严重影响。现有的时间序列对齐方法主要集中于较为简单的场景和较为有限的应用。然而,随着物联网的发展,时间序列源源不断地产生,也带来了更复杂、规模更大的时间序列对齐问题。上述问题影响了时间序列数据库的性能以及时序数据的应用效果,也带来了全新的挑战。本文面向复杂的时间序列对齐问题及其对时间序列数据库所造成的影响,系统地研究时间序列对齐技术。具体来说,时间序列对齐问题及其影响主要包括:(1)单序列时间戳不对齐导致数据质量低;(2)多序列不对齐导致数据存储效率低;(3)多序列不对齐导致数据查询效果差。本文针对上述问题开展以下研究:第一,在时序数据进入数据库前,本文针对数据库中的预处理模块,设计单序列时间对齐修复方法,通过动态规划的方式对数据进行对齐匹配,并证明了相应方法所满足的界限,用界限进行剪枝并提高执行效率。本文进一步设计了更为高效的近似算法,并证明了该近似算法的近似比。上述方法不仅能够提高写入数据库的时序数据的质量,也可以提高编码压缩方法和下游分析方法的效果。第二,在时序数据被写入数据库后,本文针对数据库中的存储模块,提出列组存储的方案,并形式化了多序列自动对齐存储问题。基于该问题,本文设计了相应的分组算法,该方法基于列组存储的文件格式,根据自底向上的对齐策略,自动对齐多时间序列。本文进一步设计了近似估计方法,通过提取时间列的特征来加速算法。上述方法能够优化时间序列的存储空间,提高数据库的存储效率。第三,当用户对时序数据库中存储的多序列数据进行查询时,本文针对数据库中的查询模块,形式化了多序列相似对齐查询问题。本文进一步提出多序列相似对齐查询方法,基于时间约束和模型约束对多序列数据进行高效准确的相似对齐。上述方法能够提高数据库的相似对齐查询效果,并提高下游学习任务的准确率。最后,本文将提出的方法和时序数据库紧密地结合,在时序数据库ApacheIoTDB中进行了相应的系统实现,分别在数据预处理、存储引擎、查询引擎等相应模块进行应用,提高了整个数据库中的数据质量、数据存储效率与数据查询效果,并通过具体案例阐述了方法的应用效果。
Time series databases are widely used for the management of large-scale time series data. However, the issues related to time series alignment can affect various modules of time series databases, including preprocessing, storage, and querying. They may also severely impact the effectiveness of downstream tasks such as data analysis, anomaly detection and machine learning. Existing methods for time series alignment mainly focus on relatively simple scenarios and limited applications. However, with the development of the Internet of Things, time series data are being generated continuously, leading to more complex and larger-scale time series alignment problems. These issues can affect the performance of time series databases and the effectiveness of time series data applications, thus presenting new challenges. Therefore, this thesis focuses on complex time series alignment issues and their application in time series databases, systematically studying time series alignment techniques. Specifically, time series may face the following alignment issues: (1) Misalignment of the timestamps in single time series leading to low data quality? (2) Misalignment of multiple series leading to low data storage efficiency? (3) Misalignment of multiple series leading to poor data querying results. This paper addresses these issues through the following research:First, before the time series data enters the database, this thesis designs a single time series alignment repair method for the preprocessing module of the database. The method aligns data through dynamic programming, theoretically proves the bounds of the method, and further designs an efficient approximate algorithm. The approximate ratio of the approximate algorithm is proved. This method can improve the quality of time series data written into the database and enhance the effects of encoding compression methods and downstream analysis methods.Second, after data is written into the time series database, this thesis formalizes the problem of multi-sequence automatic alignment storage. It proposes a column-group storage scheme and a multi-sequence automatic alignment storage method based on the column-group storage file format. A bottom-up alignment strategy is devised to automatically align multiple time series. The thesis further designs an approximate estimation method to accelerate the algorithm by extracting features from the time series. This method can reduce the space cost of the time series and improve the efficiency of the storage.Third, when users query the multi-sequence data stored in the time series database, this thesis formalizes the problem of multi-sequence approximate alignment querying. It proposes a multi-sequence approximate alignment querying method, which can perform efficient and accurate approximate alignment of multi-sequence data based on time constraints and model constraints. This method can improve the effect of approximate alignment querying in the database and increase the accuracy of downstream learning tasks.Finally, the methods proposed in this paper are closely integrated with the time series database. A corresponding system implementation has been carried out in the time series database Apache IoTDB, applying them in data preprocessing, storage engine, and query engine modules. It thereby improves the data quality, data storage efficiency, and data querying results in the entire database. We also present real-world applications to validate the effectiveness of the proposal.