[1]郭 卡,王 芳.TS-Aug架构的半监督自训练情感分类算法[J].南京师范大学学报(工程技术版),2024,24(01):045-52.[doi:10.3969/j.issn.1672-1292.2024.01.007]
 Guo Ka,Wang Fang.Semi-Supervised Self-Training Sentiment Classification Algorithm Based on TS-Aug Architecture[J].Journal of Nanjing Normal University(Engineering and Technology),2024,24(01):045-52.[doi:10.3969/j.issn.1672-1292.2024.01.007]
点击复制

TS-Aug架构的半监督自训练情感分类算法
分享到:

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
24卷
期数:
2024年01期
页码:
045-52
栏目:
计算机科学与技术
出版日期:
2024-03-15

文章信息/Info

Title:
Semi-Supervised Self-Training Sentiment Classification Algorithm Based on TS-Aug Architecture
文章编号:
1672-1292(2024)01-0045-08
作者:
郭 卡王 芳
(安徽外国语学院信息与数学学院,安徽 合肥 231200)
Author(s):
Guo KaWang Fang
(School of Information and Mathematics,Anhui International Studies University,Hefei 231200,China)
关键词:
少样本学习半监督训练数据增广情感分类
Keywords:
few-shot learningsemi-supervised trainingdata augmentationsentiment classification
分类号:
TP18; TP391
DOI:
10.3969/j.issn.1672-1292.2024.01.007
文献标志码:
A
摘要:
网络教学资源的普及使得资源评价的文本数据规模逐步增大. 传统的有监督学习文本分类对标注数据的依赖度较高,需要足够的数据量和高质量数据才能得到良好的结果. 在网络教学资源的评价文本工作中,由于标注数据难以获取且质量参差不齐,使得这一任务的难度越来越高. 针对这一困难,提出一种TS-Aug半监督自训练方案,通过添加无标签数据并进行伪标签训练,能在强力数据增广的作用下大幅扩充样本集,解决数据增广中的过拟合风险. 首先利用标注数据和弱增广策略进行初始化监督训练,然后利用无标注数据和强增广策略进行半监督训练,最后使用标注数据进行微调监督训练. 在自建的在线课程评论数据中,能将分类F1-Score从 0.88 提升至0.95,表明TS-Aug半监督自训练方案在文本分类任务中具有较好的应用前景.
Abstract:
With the popularity of online teaching resources,the text data size for resource evaluation has gradually increased. Traditional supervised text classification heavily relies on labeled data and requires sufficient and high-quality data to achieve good results. The difficulty of this task has become increasingly high due to the difficulty in obtaining and uneven quality of labeled data. To address this difficulty,this paper proposes a semi-supervised self-training scheme named TS-Aug. By adding unlabeled data and pseudo-labels for training,we can significantly expand the sample set through the aggressive data augmentation,and also solved the overfitting risk in data augmentation. Specifically,the process involves initializing supervised training using labeled data and weak augmentation strategies,followed by semi-supervised training using unlabeled data and strong augmentation strategies,and finally fine-tuning the model with supervised training using labeled data. In our self-built online course comment data,we can improve the classification F1-score from 0.88 to 0.95. This indicates that the TS-Aug semi-supervised self-training scheme has good applied prospects in text classification tasks.

参考文献/References:

[1]余游,冯林,王格格,等. 一种基于伪标签的半监督少样本学习模型[J]. 电子学报,2019,47(11):2284-2291.
[2]LEE D H. Pseudo-label:The simple and efficient semi-supervised learning method for deep neural networks[C]//Proceedings of the CML 2013 Workshop on Challenges in Representation Learning. Atlanta,USA:ICML,2013.
[3]FINI E,ASTOLFI P,ALAHARI K,et al. Semi-supervised learning made simple with self-supervised clustering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). Vancouver,Canada:IEEE,2023.
[4]CHEN B X,JIANG J G,WANG X M,et al. Debiased self-training for semi-supervised learning[J/OL]. arXiv Preprint arXiv:2202.07136,2022.
[5]鲍兆强,王立宏. 基于伪标签纠正的半监督深度子空间聚类[J]. 烟台大学学报(自然科学与工程版),2023,36(4):442-450.
[6]YANG H F. Contrastive self-supervised learning as a strong baseline for unsupervised hashing[C]//Proceedings of the 2022 IEEE 24th International Workshop on Multimedia Signal Processing(MMSP). Shanghai,China:IEEE,2022.
[7]DUAN Y,QI L,WANG L,et al. RDA:Reciprocal distribution alignment for robust semi-supervised learning[C]//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv,Israel:ECCV,2022.
[8]廖凌湘,冯林,刘鑫磊,等. 基于信息对齐的半监督少样本学习方法[J]. 计算机工程与设计,2023,44(2):582-589.
[9]SOHN K,BERTHELOT D,LI C L,et al. Fixmatch:Simplifying semi-supervised learning with consistency and confidence[J]. Advances in Neural Information Processing Systems,2020,33:596-608.
[10]宋雨,肖玉柱,宋学力. 基于伪标签回归和流形正则化的无监督特征选择算法[J]. 南京大学学报(自然科学版),2023,59(2):263-272.
[11]XIE Q Z,DAI Z H,HOVY E,et al. Unsupervised data augmentation for consistency training[J]. Advances in Neural Information Processing Systems,2020,33:6256-6268.
[12]WEI J,ZOU K. Eda:Easy data augmentation techniques for boosting performance on text classification tasks[J/OL]. arXiv Preprint arXiv:1901.11196,2019.
[13]SUGIYAMA A,YOSHINAGA N. Data augmentation using back-translation for context-aware neural machine translation[C]//Proceedings of the 4th Workshop on Discourse in Machine Translation(DiscoMT 2019). Hong Kong,China:DiscoMT,2019.
[14]TARVAINEN A,VALPOLA H. Mean teachers are better role models:Weight-averaged consistency targets improve semi-supervised deep learning results[C]//Proceedings of the 31st International Conference on Neural Information Processing system. Long Beach,USA:NIPS,2017.
[15]REN Z Z,YEH R A,SCHWING A G. Not all unlabeled data are equal:Learning to weight data in semi-supervised learning[J/OL]. arXiv Preprint arXiv:2007.01293,2020
[16]SUN Z J,FAN C,SUN X F,et al. Neural semi-supervised learning for text classification under large-scale pretraining[J/OL]. arXiv Preprint arXiv:2011.08626,2020.
[17]DEVLIN J,CHANG M W,LEE K,et al. Bert:Pre-training of deep bidirectional transformers for language understanding[J/OL]. arXiv Preprint arXiv:1810.04805,2018.

备注/Memo

备注/Memo:
收稿日期:2023-07-01.
基金项目:安徽省高校自然科学研究项目(KJ2021A1197)、安徽省省级质量工程课程思政教学团队项目(2020kcszjxtd34)和安徽外国语学院校级质量工程教学创新团队项目(aw2023jxcxtd06).
通讯作者:郭卡,讲师,研究方向:深度学习与人工智能. E-mail:409337713@qq.com
更新日期/Last Update: 2024-03-15