[1]沈志斌,白清源.文本分类中特征权重算法的改进[J].南京师范大学学报(工程技术版),2008,08(04):095-98.
 Shen Zhibin,Bai Qingyuan.Improvement of Feature Weighting Algorithm in Text Classification[J].Journal of Nanjing Normal University(Engineering and Technology),2008,08(04):095-98.
点击复制

文本分类中特征权重算法的改进
分享到:

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
08卷
期数:
2008年04期
页码:
095-98
栏目:
出版日期:
2008-12-30

文章信息/Info

Title:
Improvement of Feature Weighting Algorithm in Text Classification
作者:
沈志斌;白清源;
福州大学数学与计算机科学学院, 福州350002
Author(s):
Shen ZhibinBai Qingyuan
College of Mathematics and Computer Science,Fuzhou University,Fuzhou 350002,China
关键词:
文本分类 特征权重 TF IDF 类别区分 BOR-TFIDF
Keywords:
tex t c lassifica tion feature we ight TFIDF class difference BOR-TFIDF
分类号:
TP391.1
摘要:
TFIDF是文档特征权重表示常用方法.该方法简单易行,但忽略了特征词在各个类别中的分布情况,不能真正地反映特征词对区分每个类的贡献.针对这个不足,本文提出了BOR-TFIDF,来重新调整每个特征词对各个类别的区分度,即修正各个特征词的权重,并用分类器来验证其有效性.该方法优于原来的TFIDF算法,实验表明了改进的策略是可行的.
Abstract:
TFIDF is a k ind of comm on m ethods used to m easure the te rm s in a docum en t. The me thod is easy but ignores the distr ibution of the feature in each class. So, it can not rea lly re flect each fea ture’ s contribution to each class. A im ing at th is sho rtage, w e put forwa rd the BOR-TFIDF and use it to readjust each feature’ s d ifferentiation to each class, .i e. , mod ifies each feature’ s w e ight. Then the classifier is used to check its validaty. The m ethod is be tter than trad itional TFIDF and proves that the BOR-TFIDF m ethod is feasible.

参考文献/References:

[ 1] 张玉芳, 彭时名, 吕佳. 基于文本分类TFIDF方法的改进与应用[ J]. 计算机工程, 2006, 32( 19) : 76-78.
Zhang Yufang, Peng Sh im ing, LÜ Jia. Improvem ent and application o fTFIDF m ethod based on tex t classification[ J]. Computer Eng ineering, 2006, 32( 19): 76-78. ( in Chinese)
[ 2] Sebastiani F. M ach ine learn ing in au tom ated tex t ca tego rization[ J]. ACM Computing Surveys, 2002, 34( 1): 1-47.
[ 3] Lew is D D, Naïve Bayes. The independence assum ption in in fo rm ation re trieval[ C ] / / The 10 th European Con f onM achine
Learning. N ew York: Springer-Verlag, 1998.
[ 4] Y im ingY ang, X in L iu. A re-ex am ination o f text ca tego rization m e thods[ C ] / / S IGIR’ 99. New York: ACM Press, 1999: 42-49.
[ 5] Yang Y, Chute C G. An exam ple-based mapp ingm e thod for tex t categor ization and re trieval[ J]. ACM T rans on Inform ation System s, 1994, 12( 3): 252-277.
[ 6] H an E H, Karyp is G. Centro id-based docum ent c lassifica tion: analysis and experim enta l results[ C] / / Proc of PKDD’ 00. London: Springer-Ver lag, 2000: 424-431.
[ 7] Schapire R E, SingerY. Im proved boosting algorithm s using confidence-rated pred ica tions[ C ] / / Proc of the 11 th Annual Conf on Computational Learn ing Theory. M adison: ACM Press, 1998: 80-91.
[ 8] Joach im s T. Tex t categor ization w ith support vecto rm ach ines: learn ing w ith m any re levant featu res[ C ] / / The 10th European Confon Machine Learn ing. B erlin: Spr ing er, 1998: 137-142.
[ 9] 徐凤亚, 罗振声. 文本自动分类中特征权重算法的改进研究[ J]. 计算机工程与应用, 2005( 1): 181-184.
Xu Fengya, Luo Zhensheng. An improved approach to term we ighting in autom ated tex t classification[ J]. Com puter Eng ineering and App lica tions, 2005( 1): 181-184. ( in Ch inese)
[ 10] 张云涛, 龚玲, 王永成. 文本分类中TFIDF方法的改进[ J]. 浙江大学学报, 2005, 6A( 1): 49-55.
Zhang Yuntao, Gong Ling, W ang Yong cheng. An im proved TF- IDF approach for text class ification[ J]. Journal of Zhe jiang University, 2005, 6A( 1): 49-55. ( in Ch inese)
[ 11] 寇莎莎, 魏振军. 自动文本分类中权值公式的改进[ J]. 计算机工程与设计, 2005, 26( 6): 1 616-1 618.
Kou Shasha, W e i Zhenjun. Im proved w eigh ting fo rmu la in auto tex t c lassifica tion[ J]. Computer Eng ineer ing and Des ign,2005, 26( 6): 1 616-1 618. ( in Ch inese)
[ 12] 李荣陆. 文本分类系统[ DB /OL]. http: / /www. nlp. org. cn /docs/dow nload. php? doc- id= 102. 2004- 08- 19.
L iRong lu. Tex t c lassica tion system [ DB /OL ]. Data Se t, http: / /www. nlp. org. cn /docs/download. php? doc- id= 102.2004- 08- 19. ( in Chinese)
[ 13] Dav id D, Lew is. Reuters- 21578, Test Co llections[ R /OL] . h ttp: / /www. dav iddlew is. com / resources/ testco llections/ reuters21578/. 1996.

相似文献/References:

[1]高洁,吉根林.一种增量式Bayes文本分类算法[J].南京师范大学学报(工程技术版),2004,04(03):049.
 GAO Jie,JI Genlin.Incremental Bayes Text Categorization Algorithm[J].Journal of Nanjing Normal University(Engineering and Technology),2004,04(04):049.
[2]张永军,刘金岭.一种改进的高效贝叶斯短信文本分类器[J].南京师范大学学报(工程技术版),2014,14(03):070.
 Zhang Yongjun,Liu Jinling.An Improved Efficient Bayesian Short Message Text Classifier[J].Journal of Nanjing Normal University(Engineering and Technology),2014,14(04):070.

备注/Memo

备注/Memo:
基金项目: 教育部留学回国人员启动基金、中科院软件所开放课题基金( SYSKF0701)、福州大学科技发展基金( 2005-XQ-13)和福建省教育厅基金( JB06023)资助项目.
通讯联系人: 白清源, 教授, 研究方向: 数据库技术和数据挖掘. E-m ail:baiqy@ fzu. edu. cn
更新日期/Last Update: 2013-04-24