登录 EN

添加临时用户

基于深度学习的红外光谱数据建模及应用研究

Deep Learning-Based Data Modeling and Applications for Infrared Spectral Analysis

作者:王天怡
  • 学号
    2021******
  • 学位
    硕士
  • 电子邮箱
    wan******.cn
  • 答辩日期
    2024.05.09
  • 导师
    谭春燕
  • 学科名
    生物医学工程
  • 页码
    83
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    红外光谱;深度学习;有机官能团;模型解释
  • 英文关键词
    Infrared spectroscopy; Deep learning; Organic functional groups; Model interpretation

摘要

红外光谱在化学研究中常用于分子结构解析。通过测量分子对红外光的吸收强度及波数,可以得到与分子中化学键相对应的红外光谱。红外光谱技术由于测量范围广、分析速度快且操作简便,在化学、食品、医药、材料和环境等多领域得到了广泛应用。然而,红外光谱的解析仍主要依赖于人工经验和数据库检索,不仅耗时耗力,且缺乏分析未知分子结构的能力。目前使用的大部分机器学习方法,需要在对红外光谱做出分析前进行复杂的预处理,增加了试错成本。而由于红外光谱的测量范围广、分辨率高,具有较多的光谱特征,使得机器学习方法在分析时具有较高的复杂度,因此研究人员通常会选择光谱中的特定区域或通过数据降维提取部分光谱特征,这可能对原始光谱的信息造成损失。本文旨在构建一种新的红外光谱分析方法,通过深度学习降低红外光谱解析的复杂度,充分利用光谱特征内在的相关性,以实现高效准确的光谱解析。为此,本文主要进行了以下工作:(1)探索适用于红外光谱的二维特征表示方法,利用层次聚类和统一流形逼近与投影法将红外光谱的原始数据转换为多通道的二维特征图,对红外光谱特征的内在相关性进行充分利用。通过可视化对比,阐明特征图的转换策略及其与红外光谱间的对应关系。(2)基于红外光谱实现对有机分子官能团的准确识别。通过化学结构描述符SMARTS对分子的官能团进行描述,利用程序化语言对所有红外光谱的官能团标签做批量化处理。以卷积神经网络为基础构建分类网络,通过红外光谱特征图实现对最多21个常见官能团的多标签识别,平均准确率达到0.969,平均F1值达到0.874。在与已发表研究的对比中,达到超越或相近的结果。此外,使用解释性方法提取分类网络在识别每个官能团时所关注的重要光谱特征,得到的结果与人工经验非常接近。(3)分析复杂样本的红外光谱,实现包括对细菌菌属、食品产地、成分和种类进行区分的多种预测任务。本研究构建的模型可以直接以原始红外光谱为输入,避免复杂的光谱预处理和特征选择,预测结果能够超过或接近原文中的结果,具有很好的通用性。

Infrared spectroscopy is an essential tool used in chemical research to determine molecular structures. It measures the frequency and intensity of infrared light absorbed by molecules, producing infrared spectra that indicate molecular chemical bonding. Infrared spectroscopy is widely used in various fields, including chemistry, food science, pharmaceuticals, materials, and environmental studies because of its broad applicability, fast analysis time, and easy operation. However, analyzing infrared spectra typically relies on manual interpretation and database searches, which can be time-consuming and labor-intensive, and often fails when dealing with unknown molecular structures.Many machine learning techniques require complex data preprocessing before spectral analysis, increasing the costs of trial-and-error. Considering the abundance of spectral features in infrared spectra, the complexity of machine learning analysis is substantial. As a result, researchers often focus on specific spectral regions or reduce the dimensionality of the data, which can result in the loss of valuable information from the original spectrum.This study aims to develop a new method for analyzing infrared spectra that uses deep learning to simplify the analysis process and utilize the intrinsic correlations within spectral features for efficient and precise spectral interpretation. The study focuses on the following tasks:(1) Investigating feature representation techniques suitable for infrared spectra. Raw infrared spectral data is transformed into multi-channel, two-dimensional feature maps using hierarchical clustering and the unified manifold approximation and projection (UMAP) method. This process fully exploits the inherent correlations of infrared spectral features.(2) Accurately identifying organic molecular functional groups using infrared spectroscopy. Functional groups are characterized by the chemical structure descriptor SMARTS, and a programming language is used to batch process the functional group labels for all infrared spectra. A classification network based on convolutional neural networks is constructed to achieve multi-label identification of up to 21 common functional groups through infrared spectral feature maps. The average accuracy achieved is 0.969, with an average F1 score of 0.874, results that are comparable or superior to those found in the published literature. Additionally, an interpretive method is applied to extract key spectral features that the classification network focuses on for each functional group identification, with outcomes closely mirroring manual expertise.(3) Analyzing the infrared spectra of complex samples to perform a range of prediction tasks, such as differentiating bacterial genera, food origins, ingredients, and types. The model developed in this study can directly process the raw infrared spectra as inputs, eliminating the need for intricate spectral preprocessing and feature selection. The predictive outcomes of this model are either comparable to or exceed those reported in the original literature, demonstrating its strong adaptability.