Research on Bayes-Based Spam Filtering
LIN Qiaomin~1XU Jianzhen~1XU Dihua~1WANG Cheng~2
1.Campus Network Center,Nanjing University of Posts and Telecommunications,Jiangsu Nanjing 210003,China;2.Department of Information Engineering,Nanjing University of Posts and Telecommunications,Jiangsu Nanjing 210003,China
spam tex t categor ization vector spacem ode l Bayes a lgor ithm
E-m a il comm un ications betw een people have been g rea tly affected by spam prob lem. In th is paper, N ave Bayesian categor ization algor ithm is deduced and ana lyzed as we ll as its application and va lidation in the exper im ents of spam filter ing. F irstly, the paper introduces Tex t categor ization techn ique, inc luding comm on ly used vector space m ode l to represent the tex t and feature extraction m ethods, such as inform ation g ain and docum en t frequency based m ethod. W hat is mo re, the behav io r of inform a tion ga in m ethod in the exper im ents is explained. Secondly, it deduces and analyzes Nave Bayesian w ith the prem ise o f independence w ith in fea tures. Then, it uses m a ils co llected before as co rpus, utilize k- fold cross-va lida tion, and app lys the nav e Bayes ian in exper im ents. Based on probab ilities and tha t of term s belong ing to som e ca tego ry w hich are ga ined through tra in ing corpus, the paper catego rizes m ails from test co rpus respectively. Fina lly, experim enta l resu lt is show n by tw o ind ica to rs, precision and recall.


