[1]李小鹏,尹传环,钞 萌.基于RoBERTa和超球体空间的日志异常检测研究[J].南京师范大学学报(工程技术版),2024,24(04):017-27.[doi:10.3969/j.issn.1672-1292.2024.04.002]
 Li Xiaopeng,Yin ChuanHuan,Chao Meng.Study on Log Anomaly Detection Based on RoBERTa and Hypersphere Space[J].Journal of Nanjing Normal University(Engineering and Technology),2024,24(04):017-27.[doi:10.3969/j.issn.1672-1292.2024.04.002]
点击复制

基于RoBERTa和超球体空间的日志异常检测研究
分享到:

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
24卷
期数:
2024年04期
页码:
017-27
栏目:
计算机科学与技术
出版日期:
2024-12-15

文章信息/Info

Title:
Study on Log Anomaly Detection Based on RoBERTa and Hypersphere Space
文章编号:
1672-1292(2024)04-0017-11
作者:
李小鹏12尹传环12钞 萌3
(1.北京交通大学计算机科学与技术学院,北京 100044)
(2.交通数据分析与挖掘北京市重点实验室,北京 100044)
(3.中国人寿保险股份有限公司上海数据中心,上海 201201)
Author(s):
Li Xiaopeng12Yin ChuanHuan12Chao Meng3
(1.School of Computer Science and Technology,Beijing Jiaotong University,Beijing 100044,China)
(2.Beijing Key Lab of Traffic Data Analysis and Mining,Beijing 100044,China)
(3.China Life Insurance Company Shanghai Data Center,Shanghai 201201,China)
关键词:
日志异常检测稳健优化的BERT预训练方法变换器超球体空间
Keywords:
logs anomaly detectionRoBERTatransformerhypersphere space
分类号:
TP391
DOI:
10.3969/j.issn.1672-1292.2024.04.002
文献标志码:
A
摘要:
通过监控和分析大量日志数据,日志异常检测能够及时识别入侵攻击、恶意操作等异常行为,是现代系统管理人员的一项关键工具. 针对标注数据稀少的问题,提出基于RoBERTa和超球体空间的无监督日志异常检测算法. 首先,为充分学习日志文本的语义特征,提出多层次语义提取网络,有效从多个层面学习日志的上下文信息. 先使用日志语料库对稳健优化的BERT预训练方法(robustly optimized BERT pretraining approach,RoBERTa)进行预训练,再使用RoBERTa和Transformer编码器分别在词语层面和句子层面挖掘日志条目的语义特征. 其次,为增加类差异和挖掘日志的正常模式,在特征空间引入超球体损失. 通过对模型不断优化,在仅使用正常样本进行训练的前提下,正常样本的特征表示能够聚集于超球体空间的中心,而异常样本则远离该中心,最终达到分离异常样本的目的. 最后,该模型在HDFS日志数据集和BGL日志数据集上分别取得了0.94和0.93的F1分数,验证了该模型的有效性.
Abstract:
By monitoring and analyzing large volumes of log data,log anomaly detection can promptly identify abnormal behaviors such as intrusions and malicious operations,making it a critical tool for modern system administrators. To address the issue of limited labeled data,this paper proposes an unsupervised log anomaly detection algorithm based on RoBERTa and hyperspherical space. Firstly,to fully capture the semantic features of log texts,a multi-level semantic extraction network is proposed to effectively learn the contextual information of logs from multiple perspectives. Specifically,the robustly optimized BERT pretraining approach(RoBERTa)is pretrained on a log corpus. And then both RoBERTa and Transformer encoders are used to extract semantic features of log entries at the word and sentence level,respectively. Additionally,to enhance class differentiation and uncover normal patterns in logs,hyperspherical loss is introduced in the feature space. By continuously optimizing the model and training with only normal samples,the feature representations of normal samples converge toward the center of the hyperspherical space,while anomalous samples are pushed away from the center,effectively separating the anomalies. The model achieved F1 scores of 0.94 and 0.93 on the HDFS and BGL log datasets,respectively,demonstrating its effectiveness.

参考文献/References:

[1]LE V H,ZHANG H. Log-based anomaly detection with deep learning:How far are we?[J]. arXiv Preprint arXiv:2202.04301,2022.
[2]VASWANI A,SHAZEER N,PARMAR N,et al. Attention is all you need[J]. 31st Conference on Neural Information Processing Systems. Long Beach,CA,USA,2017.
[3]LIU Y,OTT M,GOYAL N,et al. Roberta:A robustly optimized bert pretraining approach[J]. arXiv Preprint arXiv:1907.11692,2019.
[4]ZHU J M,HE S L,HE P J,et al. Loghub:A large collection of system log datasets for AI-drive log analytics[C]//2023 IEEE 34th International Symposium on Software Reliability Engineering. Florence,Italy,2023.
[5]HOCHREITER S,SCHMIDHUBER J. Long short-term memory[J]. Neural Computation,1997,9(8):1735-1780.
[6]ZHANG X,XU Y,LIN Q,et al. Robust log-based anomaly detection on unstable log data[C]//Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Tallinn,Estonia,2019.
[7]HE P,ZHU J,ZHENG Z,et al. Drain:An online log parsing approach with fixed depth tree[C]//2017 IEEE International Conference on Web Services. Honolulu,HI,USA:IEEE,2017.
[8]SALTON G,BUCKLEY C. Term-weighting approaches in automatic text retrieval[J]. Information Processing & Management,1988,24(5):513-523.
[9]HUANG Z,XU W,YU K. Bidirectional LSTM-CRF models for sequence tagging[J]. arXiv Preprint arXiv:1508.01991,2015.
[10]HUANG S,LIU Y,FUNG C,et al. Hitanomaly:Hierarchical transformers for anomaly detection in system log[J]. IEEE Transactions on Network and Service Management,2020,17(4):2064-2076.
[11]LE V H,ZHANG H. Log-based anomaly detection without log parsing[C]//2021 36th IEEE/ACM International Conference on Automated Software Engineering. Melbourne,Australia:IEEE,2021.
[12]DEVLIN J,CHANG M W,LEE K,et al. BERT:Pre-training of deep bidirectional transformers for language understanding[J]. arXiv Preprint arXiv:1810.04805,2018.
[13]NEDELKOSKI S,BOGATINOVSKI J,ACKER A,et al. Self-attentive classification-based anomaly detection in unstructured logs[C]//2020 IEEE International Conference on Data Mining. Sorrento,Italy,2020.
[14]WANG Y,WONG J,MINER A. Anomaly intrusion detection using one class SVM[C]//Proceedings from the Fifth Annual IEEE SMC Information Assurance Workshop. West Point,NY,USA:IEEE,2004.
[15]VAARANDI R,PIHELGAS M. Logcluster-A data clustering and pattern mining algorithm for event logs[C]//2015 11th International Conference on Network and Service Management. Barcelona,Spain,2015.
[16]DU M,LI F F,ZHENG G N,et al. Deeplog:Anomaly detection and diagnosis from system logs through deep learning[C]//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. Dallas,Texas,USA,2017.
[17]GUO H X,YUAN S L,WU X T. LogBERT:Log anomaly detection via BERT[C]//2021 International Joint Conference on Neural Networks. Shenzhen,China,2021.
[18]GILLIOZ A,CASAS J,MUGELLINI E,et al. Overview of the transformer-based models for NLP tasks[C]//2020 15th Conference on Computer Science and Information Systems. Sofia,Bulgaria,2020.
[19]SHIN H J,EOM D H,KIM S S. One-class support vector machines—an application in machine fault detection and classification[J]. Computers & Industrial Engineering,2005,48(2):395-408.
[20]XU W,HUANG L,FOX A,et al. Detecting large-scale system problems by mining console logs[C]//Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. Big Sky,Montana,USA,2009.
[21]MENG W B,LIU Y,ZHU Y C,et al. Loganomaly:Unsupervised detection of sequential and quantitative anomalies in unstructured logs[C]//IJCAI. Macau,China,2019.

备注/Memo

备注/Memo:
收稿日期:2024-05-12.
基金项目:国家自然科学基金项目(U23B2062).
通讯作者:尹传环,博士,副教授,研究方向:深度学习、网络安全、异常检测、数据挖掘. E-mail:chyin@bjtu.edu.cn
更新日期/Last Update: 2024-12-15