登录 EN

添加临时用户

疾病分类引导的多模态医学影像报 告自动生成

Automated Report Generation for Multi-modal Medical Images Guided by Disease Classification

作者:伍诗彬
  • 学号
    2020******
  • 学位
    硕士
  • 电子邮箱
    wus******.cn
  • 答辩日期
    2024.05.15
  • 导师
    王好谦
  • 学科名
    电子信息
  • 页码
    88
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    医学影像;疾病分类;报告生成;检索增强;多模态大模型
  • 英文关键词
    Medical Imaging;Disease Classification;Report Generation;Retrieval Augmented Generation;Multi-modal Foundation Models

摘要

随着视觉编码网络、生成式大模型和多模态学习等新兴技术的兴起,人工智能为基于医学影像的智能分析与诊断提供了新的机遇,在医疗影像的病理分类、分割和跨模态理解等任务中,展现了巨大潜力。在当前智能诊断技术的研究与应用落地进程中,依然存在着诸多挑战。例如医学影像数据的稀缺性、模态多样性、解剖结构和病变特征的复杂性,以及文本生成中存在的跨模态对齐和幻觉问题。本文针对上述问题,从多种医学影像的疾病检测与分类、诊断报告的自动生成两个方向展开研究。本文在基于医学影像的疾病诊断场景下,提出了一种快速、轻量级、准确的疾病检测与分类算法。该方法融合了图像的局部特征与全局感知,在Transformer主干网络共享全局注意力的基础上,利用残差卷积模块提取局部特征,并引入空间注意力机制,提高对有限数据集的适应性。在胃肠道内窥镜图像数据集上的实验结果表明,所提出方法展现了几乎超过所有当前模型的疾病分类准确性,且达到了每秒1.64 万张图像的处理效率。 针对医学影像的报告自动化生成任务,本文基于多模态预训练架构和大语言模型开展研究。为缓解大语言模型中常见的幻觉现象,本文将基于图像的疾病分类与文本生成模型融合进同一个框架,提出由疾病类别引导、结合多尺度检索增 强策略的医学影像报告生成方法。具体地,本方法采用两阶段训练流程:在预训练阶段,进行多粒度对齐的医学视觉-文本表征学习;在跨模态生成学习阶段,提出了由疾病分类引导的多尺度跨模态检索增强方案。该方法在大规模放射学影像-报告数据集MIMIC-CXR上验证了其有效性,不仅能有效提升多标签分类性能,且在生成诊断报告的质量上优于现有方法,减少了幻觉现象的产生。最后,面向多模态类型医学影像的报告生成任务,本文提出了一种医学自适应知识增强网络。该方法使用针对图像编码器的高效化参数微调策略,提升了模型的训练效率。此外,引入了医学知识增强策略和基于概念的隐式监督学习方法,加强模型对相关医学术语的理解,使其能够生成更具针对性的专业医学描述。本文在多模态医学影像-文本数据集ImageCLEFmedical上开展实验,结果表明,所提出的方法超越了多个同类工作,在生成文本的自然语言评价指标方面获得了最佳的平均结果,展现了生成报告的连贯性和专业性,以及与医学影像之间的一致性,为推动自动化的医学报告生成和临床场景下的智能辅助诊断提供了有效方案。

Recent advancements in visual encoding networks, generative models, and multi-modal learning have provided new opportunities for intelligent analysis and diagnosis based on medical imaging. These technologies have demonstrated significant potential in tasks such as pathological classification, segmentation, and cross-modal understanding. However, several challenges remain in the research and application of intelligent diagnostic technologies. These include the scarcity of medical imaging data, the diversity of modalities, the complexity of anatomical structures and pathological features, and issues of cross-modal alignment and hallucinations in text generation.This thesis addresses these challenges through two primary research directions: disease detection and classification in various medical images, and automatic generation of diagnostic reports. For disease classification in medical images, we propose a fast, lightweight, and accurate disease detection and classification algorithm. This method combines local and global features, utilizing a Transformer backbone network that shares global attention and a residual convolution module to extract local features, along with a spatial attention mechanism to improve adaptability to limited datasets. Experimental results on a gastrointestinal endoscopic image dataset demonstrate that the proposed method achieves superior disease classification accuracy compared to current models, processing up to 16,400 images per second.For the task of automated medical report generation, this thesis investigates multi-modal pre-training architectures and large language models. To mitigate common hallucinations in large language models, we integrate disease classification and text generation models within a unified framework, proposing a disease-guided, multi-scale retrieval-enhanced medical image report generation method. This approach employs a two-stage training process: during pre-training, it learns multi-granular medical vision-text representations, and in the cross-modal generation learning stage, it introduces a disease-guided multi-scale cross-modal retrieval enhancement scheme. The method‘s effectiveness is validated on the large-scale radiology image-report dataset MIMIC-CXR, showing improvements in multi-label classification performance and generating higher-quality diagnostic reports with reduced hallucinations.Finally, for the report generation task from multi-modal medical images, this thesis proposes a medical adaptive knowledge enhancement network. This method uses an efficient parameter fine-tuning strategy for the image encoder, improving training efficiency. Additionally, it incorporates a medical knowledge enhancement strategy and a concept-based implicit supervision learning method, strengthening the model‘s understanding of relevant medical terms to generate more targeted professional medical descriptions. Experiments on the ImageCLEFmedical multi-modal medical image-text dataset show that the proposed method surpasses several contemporary approaches, achieving the best average results in natural language evaluation metrics, demonstrating coherence and professionalism in generated reports, and ensuring consistency with medical images. This provides an effective solution for advancing automated medical report generation and intelligent-assisted diagnosis in clinical scenarios.