登录 EN

添加临时用户

基于深度学习的互联网金融个人信贷逾期风险的预测研究

Research on Individual Credit Overdue Risk Prediction in Internet Finance Based on Deep Learning

作者:王海宁
  • 学号
    2019******
  • 学位
    硕士
  • 电子邮箱
    car******com
  • 答辩日期
    2023.12.07
  • 导师
    周杰
  • 学科名
    工程管理
  • 页码
    96
  • 保密级别
    公开
  • 培养单位
    025 自动化系
  • 中文关键词
    逾期风险预测, 个人信贷, 深度学习, 过采样, 模型组合
  • 英文关键词
    Overdue Risk Prediction, Personal Credit Loan, Deep Learning, Oversampling, model combination

摘要

互联网金融个人信贷逾期风险预测是金融科技公司面临的重大课题,近年来统计机器学习已经广泛应用到了个人信用评估方面,本论文将深度学习算法应用到信贷逾期风险预测场景,采用自注意力机制和胶囊网络进行高阶特征交叉自动学习,并对深度学习算法的预测能力同统计学习算法进行实验对比,后将统计机器学习模型的中间产出与深度学习模型相组合,分析组合模型针对逾期风险的预测能力。 本论文主要从四个方面展开研究: 第一, 针对美国LendingClub和国内Paipaidai两个个人信贷逾期表现真实数据集进行描述性统计分析、数据清洗、特征挖掘,并使用过滤法对特征变量进行筛选,然后搭建统计机器学习和深度学习模型,探究深度学习方法的可行性。 第二, 针对互联网金融领域个人信贷领域数据集不平衡和变量维度相关的特点,使用马氏距离替代欧式距离进行KNN临近节点筛选,提出改进的SMOTE-ENC$_m$算法,应用于连续变量和离散变量混合不平衡数据集处理。 第三, 针对数据集变量维度较多,且多为异质标量类型,采用胶囊网络和自注意力机制搭建AutoInt\_C模型,对低维特征进行高阶自动组合,利用胶囊网络从数据层面直接抽取信息,并进行实验与统计机器学习和相关深度学习算法进行效果对比。 第四, 对XGBoost、LightGBM和AutoInt\_C模型进行组合构建AutoInt\_CXL模型,实验结果显示,在两个数据集上组合模型AUC、AP指标均取到了最高值,说明XGBoost、LightGBM 叶子节点特征是一种有效的特征,可以显著提升深度学习模型的预测能力。 本研究的主要结论是:其一,基于胶囊网络和自注意力机制构建的深度学习模型能比较好的进行特征自动组合学习,并能捕捉到隐形信息,可以应用到互联网金融个人信贷逾期风险的识别预测。其二,决策树模型与深度学习模型组合可以提高深度学习模型的预测能力。其三,针对互联网金融个人信贷数据集连续变量和离散变量混合不平衡的情况,采用马氏距离计算比欧式距离更适合数据集的过采样学习。

Personal financial loan‘s overdue risk prediction over internet is a major issue facing financial technology companies. In recent years, statistical machine learning has been widely used in personal credit evaluation. This paper applies deep learning algorithms to credit overdue risk prediction scenarios, using self-attention mechanism and The capsule network to perform high-order feature interaction automatically, and compares the performance of deep learning and statistical machine learning algorithms. Then, the intermediate output of the statistical machine learning model is combined with the deep learning model to analyze the predictive ability of the combined model for overdue risk. The main work consists of four parts. 1. Two data sets of personal credit loan overdue performance from real world, LendingClub in the United States and Paipaidai in China, are analyzed. Perform descriptive statistical analysis, data cleaning, and feature mining, and use filtering methods to select relative features, and then build statistical machine learning and deep learning models to explore the feasibility of deep learning models on overdue risk prediction. 2. Considering the characteristics of unbalanced data sets and variable dimensions in personal credit loan, The Mahalanobis distance is used instead of the Euclidean distance to filter KNN adjacent samples, and the SMOTE-ENC$_m$ algorithm is proposed to rebalance mixed unbalanced datasets of continuous variables and discrete variables. 3. Based on the scalar variables in the data set are enormous and heterogeneous, the Self- Attention Mechanism and The Capsule Network are used to build the AutoInt\_C model, to perform high-order combination of low-dimensional features automatically. The capsule network directly captures information from the data level. And experiments are conducted to compare the effects of statistical machine learning and related deep learning algorithms. 4. the AutoInt\_CXL model is constructed by combining XGBoost, LightGBM and AutoInt\_C models. The experimental results show that the combined model achieves the highest AUC and AP score on the two datasets, indicating that the XGBoost、LightGBM leaf nodes features are quite effective that can significantly improve the performance of deep learning models. The conclusions are as follows: First, the Self-Attention Mechanism and Capsule Network can automatically combine features better, and can capture representations from the data level, and deep learning can be applied to overdue risk prediction. Second, the combination of the decision tree model and the deep learning model can improve the prediction effect of the deep learning model. Third, the dataset of personal loans in internet finance contains both continuous and discrete variables, and are also unbalanced. The use of Mahalanobis distance calculation is more suitable for dataset oversampling than Euclidean distance.