互联网和信息技术的迅速发展为人们提供便利,但同时也带来了信息过载问题。推荐系统帮助在线平台为用户提供个性化服务和产品,进而有效缓解信息过载问题。近年来,序列推荐问题受到越来越多的关注。传统序列推荐算法或深度学习序列推荐算法将其作为一个预测问题,训练模型提高对正确标签预测的准确率,但是用户与推荐系统交互的过程在本质上是一个序列决策过程,这些方法虽然能够捕捉用户的动态兴趣变化,但却难以优化长期收益。强化学习是一类被用于解决马尔科夫决策过程的机器学习算法,智能体与环境进行交互来学习策略,目标是优化过程中的长期收益,得益于强化学习在处理序列决策问题上的强大性能,将强化学习应用于序列推荐成为研究热点,这不仅对学术研究具有科学意义,而且对现实推荐场景具有应用价值。与其他典型强化学习环境不同,直接在真实推荐系统中使用强化学习训练智能体将会带来风险,引起损失。针对这个问题,本文设计了成对样本离线强化学习序列推荐框架,训练智能体从离线数据集中学习,优化推荐策略。由于推荐系统中存在大量的隐式反馈数据,难以获知用户的真实喜恶,为该框架设计了基于动作价值的动态负采样策略,根据动作价值函数从数据集中选择负样本,并计算正负样本对级别的贝叶斯个性化排序损失。将四种先进的深度学习序列推荐模型与该框架进行集成,在两个真实世界电商数据集进行了序列推荐模型的离线训练与评估,实验结果表明该框架在相关指标上取得了领先效果,并具有良好的通用性,有效提升了序列推荐模型的性能。针对显式反馈场景,本文对所提出框架进行了进一步改进,将用户的反馈序列信息引入状态表征,在动态负采样策略的基础上设计了序列内部负采样策略,基于新的采样机制设计了新的排序损失函数。将一种先进的深度学习序列推荐模型与改进后的框架进行集成,在两个真实世界评分数据集上进行了序列推荐模型的离线训练与评估,实验结果表明改进后的框架具有优秀的推荐性能,验证了针对显式反馈数据改进的有效性。本文提出的成对样本离线强化学习序列推荐框架基于用户和项目的交互序列数据训练推荐模型,结合了强化学习与负采样策略的优势,在捕捉用户兴趣动态变化的同时有效优化了长期收益。
The rapid development of the Internet and information technology provides convenience for people, but at the same time brings the problem of information overload. Recommender systems help online platforms to provide personalized services and products for users, and then effectively alleviate the information overload problem. In recent years, the sequential recommendation problem has received more and more attention. Traditional sequential recommendation algorithms or deep learning sequential recommendation algorithms treat it as a prediction problem, and train the model to improve the accuracy of the prediction of correct labels. However, the process of user interaction with the recommender system is essentially a sequential decision-making process, and although these methods can capture the dynamic changes in users' interests, they are difficult to optimize the long-term benefits. Reinforcement learning is a class of machine learning algorithms that have been used to solve the Markov decision process, in which the agent interacts with the environment to learn the policy, and the goal is to optimize the long-term benefits of the process. Due to the powerful performance of reinforcement learning in dealing with the sequential decision problem, it has become a research hotspot to apply reinforcement learning to sequential recommendation, which is not only scientifically significant for academic research, but also has application value for real recommendation scenarios. This is not only of scientific significance for academic research, but also of application value for real recommendation scenarios.However, unlike other typical reinforcement learning environments, directly using reinforcement learning to train the agent in real recommendation systems will bring risks and cause losses. To address this problem, this thesis designs a pair-wise samples offline reinforcement learning sequential recommendation framework to train the agent to learn from offline dataset and optimize the recommendation policy. Due to the existence of a large amount of implicit feedback data in recommender systems, it is difficult to know the user's true likes and dislikes, a dynamic negative sampling strategy based on action value is designed for the framework, which selects negative samples from the dataset according to the action value function, and the bayesian personalized sorting loss of the level of the positive-negative sample pairs is computed. Four state-of-the-art deep learning sequential recommendation models are integrated with the framework, and offline training and evaluation of the sequential recommendation models are carried out in two real-world e-commerce datasets, and the experimental results show that the framework achieves leading results in the relevant metrics and has good generalizability, which effectively improves the performance of the sequential recommendation models.For the explicit feedback scenario, this thesis further improves the proposed framework by introducing the user's feedback sequential information into the state representation, designing the in-sequence negative sampling strategy on the basis of the dynamic negative sampling strategy, and designing a new ranking loss function based on the new sampling mechanism. An advanced deep learning sequential recommendation model is integrated with the improved framework, and the offline training and evaluation of the sequential recommendation model is carried out on two real-world rating datasets, and the experimental results show that the improved framework has excellent recommendation performance, verifying the effectiveness of the improvement for explicit feedback data.The pair-wise samples offline reinforcement learning sequential recommendation framework proposed in this thesis trains recommendation models based on interactive sequential data of users and items, combines the advantages of reinforcement learning and negative sampling strategies, and efficiently optimizes long-term benefits while capturing dynamic changes in user interests.