财务舞弊行为影响企业正常经营,严重损害投资者利益,破坏证券市场正常秩序,应及时予以惩处。而证监会对于财务舞弊行为的调查和处罚具有很强的滞后性。因此,构建财务舞弊自动识别方法具有较强的现实需求。然而,现有财务舞弊自动识别方法主要依赖财务数据等结构化数据,缺乏对文本信息的提取与利用,且总数据量较少,模型泛化性较差。为解决上述问题,本文综合使用上市公司财务信息与文本信息,分别构建了基于基础机器学习模型和深度学习模型的上市公司财务舞弊识别模型,本文主要研究工作及研究成果如下: 1. 构建了舞弊标注数据集及量化特征指标体系。本文以国泰安违规处理数据库为基础,结合财务舞弊定义,手动筛选建立A股财务舞弊标注数据集,并根据舞弊风险因子理论和舞弊三角理论搭建舞弊识别量化特征指标体系。使用随机欠采样和SMOTE过采样结合的方法缓解类不平衡问题后,对比6种基于基础机器学习的财务舞弊识别模型,发现SVM整体效果最佳。进一步测试不同过采样方法优劣,发现SMOTE-ENN过采样方法效果更优。 2. 提取了文本统计学特征并证明了其有效性。同时引入年度报告管理层讨论与分析部分和年度业绩说明会问答文本构建统计学特征,搭建特征融合的基于基础机器学习的财务舞弊识别模型。实验结果显示特征融合的财务舞弊识别模型在分类精度上均超过单量化特征的财务舞弊识别模型,说明文本统计学特征等文本信息可以提供额外的信息以提高财务舞弊识别效率,文本统计学特征有效。 3. 提出了一种基于深度学习的财务舞弊识别模型。分别基于层次化深度学习模型HAN和预训练模型BERT从年度报告管理层讨论与分析部分提取文本特征,融合前述工作所得量化特征指标体系和文本统计学特征,构建多特征融合的财务舞弊识别模型Fin_HAN和Fin_BERT,并进一步使用RoBERTa、DeBERTa和特征聚合策略对Fin_BERT模型进行优化。最终,综合上述模型结果提出财务舞弊识别模型Fin_DEEP,获得了0.8568的F2值。现有基于深度学习的财务舞弊识别研究较少,该项工作具有一定的创新性。Fin_DEEP模型预测精度超过同类研究0.8361的F2值,可以实现对财务舞弊行为较为准确的捕捉。
Financial fraud affects the normal operation of enterprises, seriously harms the interests of investors and destroys the normal order of the securities market. Therefore, financial fraud should be punished in a timely manner. However, the CSRC (China Securities Regulatory Commission) has a strong lag in the investigation and punishment of financial fraud. Therefore, there is a strong practical demand for the construction of automatic identification methods for financial fraud. However, the existing automatic identification methods of financial fraud mainly rely on structured data such as financial data and lack the extraction and utilization of text information. Meanwhile, the total amount of data is small, and the generalization of the model is poor. In order to solve the above problems, this paper comprehensively uses the financial information and text information of listed companies to construct an identification model of financial fraud of listed companies based on basic machine learning model and deep learning model. The main research work and research results of this paper are as follows: 1. Constructed a fraud labeling data set and a quantitative feature index system. Based on the violation processing database of Guotai‘an, and combined with the definition of financial fraud, a labeling data set of the A-share financial fraud is established in this paper with manual screening, and the fraud identification quantitative characteristic index system is built according to the fraud risk factor theory and fraud triangle theory. After alleviating the class imbalance problem through the combination of random undersampling and SMOTE oversampling, and compared with six financial fraud recognition models based on basic machine learning, it is found that the overall effect of SVM is the best. By further testing the advantages and disadvantages of different oversampling methods, it is found that the SVM-based financial identification model using SMOTE-ENN oversampling method is most effective. 2. Extracted statistical features of text and proved its effectiveness. Groundbreakingly based on both the management discussion and analysis part of the annual report and the Q & A text of the annual performance illustration meeting, the statistical features are constructed, and built a financial fraud identification model based on basic machine learning for feature fusion. The experimental results show that the classification accuracy of the financial fraud recognition model with multi-feature fusion is higher than that of the financial fraud recognition model with single quantitative feature, which indicates that text information such as text statistical features can provide additional information to improve the efficiency of financial fraud recognition and the text statistical features are effective. 3. Proposed a deep learning-based financial fraud identification model. Based on HAN, a hierarchical deep learning model, and BERT, a pre-training model, the text features are extracted from the management discussion and analysis part of the annual report. Integrate the quantitative feature index system and text statistical features from the previous work to construct the multi-feature fusion identification models of financial fraud, namely Fin_HAN and Fin_BERT. Furthermore, RoBERTa, DeBERTa and feature aggregation strategy are used to optimize the Fin_BERT model. Finally, based on the results of the above models, a financial fraud identification model Fin_DEEP is proposed, and obtained an F2 value of 0.8568. There are few existing researches on the identification of financial fraud based on deep learning, and this work has a certain degree of innovation. The prediction performance of the Fin_DEEP model exceeds the F2 value of 0.8361 in similar studies, and it can achieve a more accurate capture of financial fraud.