[1]孙春红,杨明.一种嵌入分布信息的Web文档相似性度量[J].南京师范大学学报(工程技术版),2008,08(03):066-70.
 Sun Chunhong,Yang Ming.A Novel Similarity Measurement for Web Pages by Incorporating Distribution Information[J].Journal of Nanjing Normal University(Engineering and Technology),2008,08(03):066-70.
点击复制

一种嵌入分布信息的Web文档相似性度量
分享到:

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
08卷
期数:
2008年03期
页码:
066-70
栏目:
出版日期:
2008-09-30

文章信息/Info

Title:
A Novel Similarity Measurement for Web Pages by Incorporating Distribution Information
作者:
孙春红;杨明;
南京师范大学数学与计算机科学学院, 江苏南京210097
Author(s):
Sun ChunhongYang Ming
School of Mathematics and Computer Science,Nanjing Normal University,Nanjing 210097,China
关键词:
W eb 网页的相似性度量 VSM 分布信息 Web 网页分类
Keywords:
s im ilar itym easurem en t ofW eb pag es VSM d istr ibution in fo rm ation W eb pag e catego riza tion
分类号:
TP391.1
摘要:
Web文档间的相似性度量是Web文本分类的关键,有效的相似性度量策略可改进Web文本分类的精度.经典的向量空间模型(VSM)仅考虑网页中单词的出现频率,未有效利用单词的分布信息,因而影响了网页的分类精度.论文计算了网页中单词分布位置的均值和方差,并将之引入到网页的相似性计算中,提出了一种直接嵌入分布信息的新的网页相似性度量方法.该方法因合理利用单词的出现频率及其分布信息,可有效改进和拓展经典的网页相似性度量策略.实验结果表明,该网页相似性度量方法是有效可行的.
Abstract:
The sim ilar ity m easurem ent for W eb pag es is a key issue forW eb pages categor ization. E ffective sim ila rity m easurement strateg ies can effic iently im prove the accuracy o fW eb pag es c lassification. T raditiona lVecto r SpaceM odel ( VSM ) only uses the frequency o f each se lec ted w ord in the pag es, do es not m ake effic ient use o f the distribution inform ation such as the average po sition and b ias o f thew ord, hence them ethod has a g rea t im pact on the accuracy of the pages class ification. Therefore, in th is paper, the m eans and v ariances o f the w ords in the docum ent, wh ich are app lied into the sim ilarity m easu rem ent me thod, are com puted, and a novel m ethod fo r the sim ilar ity measurem ent o fW eb pages, that is directly embedded by the d istr ibu tion info rma tion, is present. This appro ach can effectively improve and extend the classically sim ilarity m easurem ent strateg ies fo rW eb pages, w hich proper ly incorpo ra tes the d istr ibu tion inform ation into the s im ilar itym easurem ent ofW eb pages. Exper im enta l resu lts show that them ethod o f this paper is effic ient and flex ible.

参考文献/References:

[ 1] Cui Z ifeng, Xu Baowen, ZhangW e ifeng, et a.l W eb do cum en ts cluster ing w ith interest links[ C] / / Serv ice-Or iented System Eng ineer ing. IEEE Internationa lW orkshop, 2005: 111-116.
[ 2] Zeng H uajun, H eQ ica,i Chen Zhen, et a.l Learn ing to c lusterw eb sea rh resu lts[ C] / / Proceed ings o f SIGIR-04. Sheffield,2004: 210-217.
[ 3] Sebastiani F. M ach ine learn ing in au tom ated tex t ca tego rization[ J]. ACM Computing Survey, 2002, 34( 1): 1-47.
[ 4] Joach im s T. Tex t categor ization w ith support vec to rm ach ines: Lea rning w ith m any relevan t fea tures[ C ] / / Proceed ing s o f ECML-98. Chemn itz, 1998: 137-142.
[ 5] Schapire R E, S inger Y. Boo stexter: a boosting-based sy stem for tex t ca tego rization[ J] . M achine Lea rning, 2000, 39( 2 /3):135-168.
[ 6] Lu Yuchang, LuM ingyu, L i Fan. Analysis and construc tion of w ord w e ighing function in VSM [ J] . Journa l o f Computer Research& Deve lopm en t, 2002, 39( 10): 1 205-1 210.
[ 7] Xue X iaob ing, Zhou Zh ihua. Distributional fea tures for tex t categor ization[ C ] / / Pro ceedings o f the 17 th European ConferenceonM ach ine Learn ing ( ECML-06). Berlin: LNAI 4212, 2006: 497-508.
[ 8] Lew is D D. N aive( B ayes) at forty: The independence assum ption in inform ation retriev al[ C ] / / Proceed ings of 10th European Con f onM achine Learn ing. Berlin: Spr inger, 1998: 4-15.
[ 9] SaubanM, Pfahr ing er B. Tex t categor ization using docum ent pro filing [ C ] / / Pro ceedings o f PKDD-2003. B erlin: Springer-Ve rlag, 2003: 411-412.
[ 10] C ravenM, D iPasquo D, Fre itag D, et a.l Lea rning to ex trac t sym bo lic know ledg e from theW or ldW ideW eb[ C] / / Proceeding s o fAAA I-98. M ad ison: W I, 1998: 509-516.

备注/Memo

备注/Memo:
基金项目: 国家自然科学基金( 40771163)资助项目.
通讯联系人: 杨 明, 教授, 博士, 研究方向: 数据挖掘、机器学习和粗糙集理论及应用研究. E-m ail:myang@ n jnu. edu. cn
更新日期/Last Update: 2013-04-24