在自然环境中,声音通常以混合的形式存在,人们在沟通和信息获取时,往往需要从杂音中分辨和提取特定的声源。人类的听觉系统能够专注于某一目标声源,例如在喧闹的人群中聆听某人的谈话。本研究旨在赋予机器类似的能力,使其能从多声源的混合声音中分离出特定的语音或音频信号。这项技术对于提高远程会议质量、音乐编辑都具有显著的意义,而且对于人工智能生成内容(AIGC)的进步具有推动作用,使其能够生成更加定制化、个性化的音频内容。本文的主要工作和贡献如下: 一、在主流的时域方法基础上,提出了一种基于频域校正的语音分离框架TFCNet,解决了时域分离方法的噪声和说话人消音问题。DPRNN作为近年来性能最好的时域模型之 一,已展现出不错的效果。然而,分析分离后的语音数据时,不难发现在频谱图上存在幅度和相位误差,这在听感上表现为噪声和说话人消音。对此,本文提出了一个工作在频域的校正模块来进行改善,通过双路径网络同时兼顾局部与全局特征,准确修复误差。实验结果表明,TFCNet大幅提升了语音分离的质量,相比基线时域方法,尺度不变信号失真率改善(SI-SDRi)提升2.8dB。 二、考虑到音乐信号频谱图的稀疏性,提出了一种基于稀疏性压缩的音乐源分离框架SCNet。音乐信号作为超宽带信号,能量通常集中在低频段,如果忽略频段差异进行建模,往往会造成冗余计算。本文创新性的提出了一种稀疏性压缩方法,逐步对频谱图进行压缩,从而减少冗余信息。同时,为了减少压缩过程的关键信息损失,引入了跳跃连接和融合层。实验结果表明,相比之前的工作,SCNet在更低参数量、更快推理速度下,平均信号失真率(SDR)提升0.9dB。 三、提出了一种基于文本查询的通用音频源分离框架ASSNet,解决了音频查询方法中,同类音频难于获取的问题。本文使用预训练语言模型T5提取文本特征,与混合音频一同输入分离网络提取目标源的音频,相比于使用音频作为提示的方法,文本更易于获取。另一方面,通用音频信号和音乐信号类似,都属于宽带信号,受音乐源分离技术中稀疏性压缩框架有效性的鼓舞,本文也使用了相似的框架作为分离网络,并引入了交叉注意力机制来融合不同模态的特征。实验结果表明,ASSNet能够显著提升分离准确度,相比基线模型,SDR提升0.84dB。
In natural environments, sounds often exist in a mixed form, and people frequently need to discern and extract specific sound sources from background noise when communicating and gathering information. The human auditory system is capable of focusing on a particular target sound source, such as listening to someone‘s conversation in a noisy crowd. This work aims to endow machines with similar capabilities, enabling them to separate specific speech or audio signals from a mixture of multiple sound sources. This technology has significant importance for enhancing the quality of remote meetings and music editing. Furthermore, it contributes to the advancement of Artificial Intelligence Generated Content (AIGC), facilitating the creation of more customized and personalized audio content. The main work and contributions of this work are as follows: Firstly, building on the foundations of mainstream time-domain methods, a frequency-domain correction-based speech separation framework, TFCNet, is proposed to address the issues of noise and speaker silencing in time-domain separation methods.} DPRNN, as one of the best performing time-domain models in recent years, has shown promising results. However, when analyzing the separated speech data, it is not difficult to observe amplitude and phase errors in the spectrogram, which audibly manifest as noise and speaker silencing. To this end, this work introduces a correction module that operates in the frequency domain to make improvements by simultaneously addressing local and global features through a dual-path network, accurately correcting these errors. Experimental results indicate that TFCNet significantly enhances the quality of speech separation, achieving a 2.8 dB improvement in Scale-Invariant Signal-to-Distortion Ratio (SI-SDRi) compared to baseline time-domain methods. Secondly, considering the sparsity of musical signal spectrogram, a music source separation framework based on sparsity compression, SCNet, is proposed. As ultra-wideband signals, musical signals typically have their energy concentrated in the low frequency range. Modeling without considering the differences in frequency bands often leads to redundant computations. This work innovatively proposes a method of sparsity compression that progressively compresses the spectrogram to reduce redundant information. Additionally, to minimize the loss of critical information during the compression process, skip connections and fusion layers are introduced. Experimental results show that SCNet, with a lower parameter number and faster inference speed, achieves a 0.9 dB improvement in the average Signal-to-Distortion Ratio (SDR) compared to previous work. Finally, a language-queried universal audio source separation framework, ASSNet, is proposed to address the difficulty of acquiring similar audio types in audio-queried methods. This work utilizes the pre-trained language model T5 to extract text features, which are input into the separation network along with the mixed audio to extract the target audio source. Compared to methods that use audio features as cues, text is more readily available. On the other hand, like musical signals, general audio signals are also broadband signals. Inspired by the effectiveness of sparsity compression frameworks in music source separation technology, this work also employs a similar framework for the separation network and introduces a cross-attention mechanism to integrate features from different modalities. Experimental results indicate that ASSNet significantly enhances separation accuracy, with a 0.84 dB improvement in Signal-to-Distortion Ratio (SDR) compared to the baseline model.