[1]林巧民,许建真,许棣华,等.基于贝叶斯算法的垃圾邮件过滤技术[J].南京师范大学学报(工程技术版),2005,05(04):061-64.
 LIN Qiaomin~,XU Jianzhen~,XU Dihua~,et al.Research on Bayes-Based Spam Filtering[J].Journal of Nanjing Normal University(Engineering and Technology),2005,05(04):061-64.
点击复制

基于贝叶斯算法的垃圾邮件过滤技术
分享到:

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
05卷
期数:
2005年04期
页码:
061-64
栏目:
出版日期:
2005-12-30

文章信息/Info

Title:
Research on Bayes-Based Spam Filtering
作者:
林巧民1 许建真1 许棣华1 王 诚2
1. 南京邮电大学信息网络中心, 江苏南京210003;
2. 南京邮电大学信息工程系, 江苏南京210003
Author(s):
LIN Qiaomin~1XU Jianzhen~1XU Dihua~1WANG Cheng~2
1.Campus Network Center,Nanjing University of Posts and Telecommunications,Jiangsu Nanjing 210003,China;2.Department of Information Engineering,Nanjing University of Posts and Telecommunications,Jiangsu Nanjing 210003,China
关键词:
垃圾邮件 文本分类 向量空间模型 贝叶斯算法
Keywords:
spam tex t categor ization vector spacem ode l Bayes a lgor ithm
分类号:
TP393.098
摘要:
对基于朴素贝叶斯算法的垃圾邮件过滤技术进行了研究分析和实验验证.介绍了向量空间模型(VSM)方法以及特征向量抽取方法,推导和研究了引入“特征之间互相独立”假设的朴素贝叶斯分类算法.采用K次交叉验证的方法,以收集的一些邮件为语料,应用朴素贝叶斯分类算法,通过训练集计算得到类别的先验概率和特征项的类条件概率,并以此为基础对测试集中的邮件进行归属判断,以正确率和召回率为指标给出了实验结果.
Abstract:
E-m a il comm un ications betw een people have been g rea tly affected by spam prob lem. In th is paper, N ave Bayesian categor ization algor ithm is deduced and ana lyzed as we ll as its application and va lidation in the exper im ents of spam filter ing. F irstly, the paper introduces Tex t categor ization techn ique, inc luding comm on ly used vector space m ode l to represent the tex t and feature extraction m ethods, such as inform ation g ain and docum en t frequency based m ethod. W hat is mo re, the behav io r of inform a tion ga in m ethod in the exper im ents is explained. Secondly, it deduces and analyzes Nave Bayesian w ith the prem ise o f independence w ith in fea tures. Then, it uses m a ils co llected before as co rpus, utilize k- fold cross-va lida tion, and app lys the nav e Bayes ian in exper im ents. Based on probab ilities and tha t of term s belong ing to som e ca tego ry w hich are ga ined through tra in ing corpus, the paper catego rizes m ails from test co rpus respectively. Fina lly, experim enta l resu lt is show n by tw o ind ica to rs, precision and recall.

参考文献/References:

[ 1] 许洪波, 程学旗, 王斌, 等. 文本挖掘与机器学习[ J]. 信息技术快报, 2005, 3( 2) : 1- 14.
[ 2] Androutsopou los I, Paliouras G, M iche lakis E. Learning to F ilte rUnso licited Comm erc ia l E-M a il [ R] . Technical Report 2004 /2, NCSR / Dem okritos0, 2004.
[ 3] M cCa llum, Andrew Kach ites. Bow: A too lk it fo r statist-i cal languag e modeling, text retr ieva,l classification and c luste ring [ EB /OL ]. http: / /www. cs. cm u. edu /~ m ccallum /bow, 1996.
[ 4] Androutsopou los I, Koutsias J, Chandrinos K V, et al. An eva luation of naive bayesian ant-i spam filter ing [ C ] / / Potam ias G, M oustak is V, Som e ren Van M, et al. Proceed ing s of the Wo rkshop on M ach ine Learn ing in the N ew Inform ation Age. Barcelona: 11th European Conference onM ach ine Lea rn ing ( ECML 2000), 2000: 9 -17.
[ 5] Saham iM. Us ing M ach ine Lea rning to Im prove Inform ation Access [ EB /OL]. http: / / a.i stanford. edu /~ saham i/bio. htm l,1998.
[ 6] Saham iM, Dum a is S, H eckerman D, et al. A bayesian approach to filtering junk e-m a il[ C ] / / Saham iM ehran, CravenM ark, Joach im s Thorsten, et al. Lea rning fo rTex t Categor ization: Papers from the 1998W orkshop. [ s. .l ]: AAA I, 1998.
[ 7] Friedm an N, Ge ig erD, Go ldszm idtM. Bayesian netw ork c lassifiers [ J] . M ach ine Learn ing, 1997, 29: 131- 163

相似文献/References:

[1]肖旻,刘晓璐,屠立忠,等.基于贝叶斯分类的邮件过滤方法及模型研究[J].南京师范大学学报(工程技术版),2006,06(02):086.
 XIAO Min,LIU Xiaolu,TU Lizhong.Research in a Method and Model of Spam Filtering based on Bayesian Classifier[J].Journal of Nanjing Normal University(Engineering and Technology),2006,06(04):086.

备注/Memo

备注/Memo:
基金项目: 江苏省自然科学基金资助项目( 01K JD520005) .
作者简介: 林巧民( 1979-) , 助教, 主要从事计算机网络安全以及嵌入式可配置实时操作系统的研究. E-m ail:qm lin@ n jup t. edu. cn
更新日期/Last Update: 2013-04-29