近年来,大数据分析在商业、工业等领域得到了迅猛发展,数据的信息提取方法是当前大数据分析的研究热点之一。序列数据是指按照时间先后顺序排列而成的数据,包括商业领域的非结构化文本序列、工业装置传感器的时间序列等数据。对这一类包含时间属性的序列数据进行学习、信息提取的过程称为序列学习。在序列学习中,序列数据前后信息的提取尤为重要。长短期记忆(Long-Short Term Memory, LSTM)网络作为一种递归神经网络模型,在学习并记忆序列长期信息、提取长期依赖性等方面具有显著优势。本文以LSTM网络为基础,着重研究文本序列的关系抽取方法,提出改进的算法来提升序列学习效果,并将基于LSTM网络的时间序列分析方法应用于异常检测。论文主要成果包括: 1.针对文本序列学习存在的长距离关系抽取难的问题,采用LSTM网络为基础框架,提出了基于LSTM的序列标注模型的联合实体与关系抽取算法。在模型中引入自注意力机制来捕捉实体间的长期依赖性,然后采用多任务的方式进行模型训练,提升实体信息的利用率,从而提高模型在长距离关系抽取的性能。与现有方法相比,该方法在两个标准测试数据集上取得了较好的效果,降低了长距离关系的识别误差。 2.针对多重关系抽取问题,以LSTM为文本编码基础,提出了基于位置序列标注机制的联合多重关系抽取算法。首先利用一种新的标注机制来同时表示实体类型和多重关系,然后采用基于位置的注意力机制,通过查询位置来产生与不同位置相关的句子表示,用于不同的标签序列解码,从而实现联合多重关系的抽取。该算法在标准测试数据集上进行了验证,与现有方法相比,在抽取较困难的长距离关系上表现良好。 3.针对工业传感器产生的时间序列数据,采用了一种基于两层LSTM网络的时间序列重建方法,并应用于生产装置的异常检测。采用底层LSTM 网络提取数据信息,用固定长度向量表示,然后在上层LSTM网络按照逆序方式对数据进行重建。通过重建值与实际值间的绝对误差,利用极大似然估计方法估计序列数据发生异常的概率,从而实现异常检测。在标准测试数据和实际生产数据上进行了算法的仿真实验,结果表明该算法较好地实现了时间序列数据的异常检测。
In recent years, big data analysis has been developed rapidly in various fields, such as commerce and industry. Data extraction is one of the research focuses in big data analysis. Data arranged in the chronological order are called as sequential data, including unstructured text data on the Internet, time-series data of sensors in the industrial field, and so on. Sequence learning refers to the process of learning and extracting information from this kind of sequential data with time attribute. In the sequence learning, it is particularly important to extract the information of the front and back segments of the sequence data. As a kind of recursive neural network model, Long-Short Term Memory (LSTM) network has significant advantages in learning and memorizing long-term information of sequential data and extracting long-distance dependence. Based on the LSTM network, this paper focuses on the relationship extraction method of text sequence, proposes improved algorithms to improve the performance of sequence learning, and analyzes the time series data and applies to anomaly detection. The main achievements of this paper are as follows: \begin{itemize} 1.To address the problem of extracting long-distance relationship in text sequence learning, a joint entity and relationship extraction algorithm based on multi-task sequence annotation model is proposed by using LSTM as the basic framework. In the model, self-attention mechanism is introduced to capture the long-distance dependencies between entities, and then the model is training by multi-task mode in order to utilize those entity information in the joint extraction model. Therefore, the performance of model extraction in long-distance relationship can be improved. The proposed algorithm is verified on two standard test data sets. Compared with the existing methods, this method achieves better results and reduces the recognition error of long distance relationship. 2.In order to solve the problem of extracting multiple relationships in the text sequence data, a joint multiple relationship extraction algorithm based on LSTM text encoding is proposed by using annotation mechanism of position sequence. Firstly, a new annotation mechanism is used to represent both entity types and multiple relationships in the text sequence data. Then, a position-based attention mechanism is used to generate sentence representations related to different locations by querying positions, which is used to decode different tag sequences. Therefore, the extraction of joint multiple relationships can be realized. The proposed algorithm is also validated on standard test data sets. Compared with the existing methods, this method performs well in extracting long-distance relationships. 3.For the time-series data generated by industrial sensors, the model of time-series data reconstruction based on two layers of LSTM networks is applied to anomaly detection of industrial equipment. The sensor data information is extracted at first by the underlying LSTM network, which is expressed by fixed length vectors. Then, the time-series data are reconstructed by the upper LSTM network in reverse order. By calculating the absolute errors between the reconstructed values and the actual values, the maximum likelihood estimation method is used to estimate the probability of anomaly occurrence of the time-series data, and then anomaly detection can be achieved. The experiments results on the standard test data and the real sensor data show that the proposed algorithm can achieve better anomaly detection of time-series data.