A Novel Similarity Measurement for Web Pages by Incorporating Distribution Information
南京师范大学数学与计算机科学学院, 江苏南京210097
Sun ChunhongYang Ming
School of Mathematics and Computer Science,Nanjing Normal University,Nanjing 210097,China
W eb 网页的相似性度量 VSM 分布信息 Web 网页分类
s im ilar itym easurem en t ofW eb pag es VSM d istr ibution in fo rm ation W eb pag e catego riza tion
The sim ilar ity m easurem ent for W eb pag es is a key issue forW eb pages categor ization. E ffective sim ila rity m easurement strateg ies can effic iently im prove the accuracy o fW eb pag es c lassification. T raditiona lVecto r SpaceM odel ( VSM ) only uses the frequency o f each se lec ted w ord in the pag es, do es not m ake effic ient use o f the distribution inform ation such as the average po sition and b ias o f thew ord, hence them ethod has a g rea t im pact on the accuracy of the pages class ification. Therefore, in th is paper, the m eans and v ariances o f the w ords in the docum ent, wh ich are app lied into the sim ilarity m easu rem ent me thod, are com puted, and a novel m ethod fo r the sim ilar ity measurem ent o fW eb pages, that is directly embedded by the d istr ibu tion info rma tion, is present. This appro ach can effectively improve and extend the classically sim ilarity m easurem ent strateg ies fo rW eb pages, w hich proper ly incorpo ra tes the d istr ibu tion inform ation into the s im ilar itym easurem ent ofW eb pages. Exper im enta l resu lts show that them ethod o f this paper is effic ient and flex ible.


