本文试图证明东方财富网股吧的发帖文本对个股与指数收益率具有预测能力,这种预测能力包括截面上的选股能力和时序上的择时能力,前者指从文本中提取的信号与个股收益的截面相关性,后者指信号与个股或指数收益的时序相关性。为证明上述命题,本文利用发帖文本信息构建选股因子与择时因子,检验因子特质预测能力;同时利用上述因子构建策略,检验策略表现。本文主要研究对象为发帖数和发帖情绪,以及两个变量的日内高频分布、两个变量与常见技术指标的相关性等。本文采用网络爬虫法获取发帖数据共500万余条,此后利用情绪词典匹配法分析发帖文本的情绪倾向,过程中通过句法依赖修正提升了标注准确度。由所得结果可知:网民集中在交易时段发帖;文本情绪大致呈正态分布;不同种类的帖子中,评论类情绪得分最低,研报类最高。本文主要创新点为:(1)用发帖文本信息构建新因子,包括变量间的日内时序相关性、变量间的截面相关性、情绪分散度等;(2)利用不同时段的信息构建因子并检验其预测效果差别,为因子作用机制的研究提供更多角度;(3)对选股因子做更充分中性化,以检验因子的特质选股效果;(4)对发帖数和情绪因子的作用机制做更深入的分析;(5)组合因子构建策略以说明文本信息整体选股效果。本文主要结论为:(1)发帖数过高的个股,短期股价将相对高估并快速回落,该表现符合过度关注理论;(2)发帖情绪越积极的个股后续相对表现越好,但这种影响力的持续时长短于一天;(3)发帖情绪分歧越小,个股后续表现越好,但该结论与交易时点有关;(4)个股发帖数和换手率在日内的时序相关性越高,股票后续表现越差。(5)时序上看,全样本的隔夜总发贴数越高,指数次日隔夜收益率越低;(6)发帖数与换手率的横截面相关性过低可能是指数下跌的领先信号;(7)利用文本信息构建选股策略能够在常见技术信息之外带来显著的alpha。总而言之,本文证明了,利用网络文本信息能够构建更多具有特质选股效果和择时效果的因子,且根据其作用机制,这些因子在日内不同时点的效果或有差异。组合后,这些因子能够在常见技术因子的基础上提供额外的,风险调整后依然显著的alpha。
This thesis attempts to prove that the posting text of Guba has the ability to predict the return of stocks. Prediction ability means cross-sectional and time-series correlation of signal extracted from texts and the return of stocks. To prove the proposition, this thesis construct factors based on post text, and test their idiosyncratic predictive ability. Factors are used to construct strategies. Information of text includes number of posts, sentiment of posts, high-frequency distribution of variables, correlation between variables, etc.This thesis uses the web crawler to obtain more than 5 million posting data, and then uses sentiment dictionary to label the sentiment of the posting text. In the process, syntax dependency analysis is used to improve the accuracy. It can be seen that netizens post mainly during trading; the text sentiment is normally distributed; among different types of posts, sentiment of ‘comment’ is the lowest, and that of ‘report’ is the highest.Innovations include: (1) propose new factors, like time-series and cross-sectional correlations between variables, etc.; (2) use information in different time periods and test their perfromance difference; (3) neutralize factors to test their idiosyncratic effect; (4) further research on the prediction mechanism of number of posts and sentiment factors (5) construct strategies to illustrate the overall stock-selection effect of text information.Conclusions include: (1) too many posts means overvaluation and falling back, which is in line with over-focus theory; (2) more positive posts means higher return, but this effect lasts shorter than one day; (3) at certain trading time, smaller sentiment dispersion of posting means higher return; (4) higher intraday time-series correlation between the number of posts and turnover means lower return. (5) larger total number of overnight posts means lower next overnight return of the index; (6) low cross-sectional correlation between the number of posts and the turnover may be a leading signal for the index to fall; (7) text information can bring significant alpha besides technical information.All in all, online text information can be used to construct more factors with idiosyncratic prediction effects, and the effects of factors may be different at different time points within a day. Combined, these factors can provide additional, risk-adjusted significant alpha besides common technical factors.