|Table of Contents|

A Novel Similarity Measurement for Web Pages by Incorporating Distribution Information(PDF)

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

Issue:
2008年03期
Page:
66-70
Research Field:
Publishing date:

Info

Title:
A Novel Similarity Measurement for Web Pages by Incorporating Distribution Information
Author(s):
Sun ChunhongYang Ming
School of Mathematics and Computer Science,Nanjing Normal University,Nanjing 210097,China
Keywords:
s im ilar itym easurem en t ofW eb pag es VSM d istr ibution in fo rm ation W eb pag e catego riza tion
PACS:
TP391.1
DOI:
-
Abstract:
The sim ilar ity m easurem ent for W eb pag es is a key issue forW eb pages categor ization. E ffective sim ila rity m easurement strateg ies can effic iently im prove the accuracy o fW eb pag es c lassification. T raditiona lVecto r SpaceM odel ( VSM ) only uses the frequency o f each se lec ted w ord in the pag es, do es not m ake effic ient use o f the distribution inform ation such as the average po sition and b ias o f thew ord, hence them ethod has a g rea t im pact on the accuracy of the pages class ification. Therefore, in th is paper, the m eans and v ariances o f the w ords in the docum ent, wh ich are app lied into the sim ilarity m easu rem ent me thod, are com puted, and a novel m ethod fo r the sim ilar ity measurem ent o fW eb pages, that is directly embedded by the d istr ibu tion info rma tion, is present. This appro ach can effectively improve and extend the classically sim ilarity m easurem ent strateg ies fo rW eb pages, w hich proper ly incorpo ra tes the d istr ibu tion inform ation into the s im ilar itym easurem ent ofW eb pages. Exper im enta l resu lts show that them ethod o f this paper is effic ient and flex ible.

References:

[ 1] Cui Z ifeng, Xu Baowen, ZhangW e ifeng, et a.l W eb do cum en ts cluster ing w ith interest links[ C] / / Serv ice-Or iented System Eng ineer ing. IEEE Internationa lW orkshop, 2005: 111-116.
[ 2] Zeng H uajun, H eQ ica,i Chen Zhen, et a.l Learn ing to c lusterw eb sea rh resu lts[ C] / / Proceed ings o f SIGIR-04. Sheffield,2004: 210-217.
[ 3] Sebastiani F. M ach ine learn ing in au tom ated tex t ca tego rization[ J]. ACM Computing Survey, 2002, 34( 1): 1-47.
[ 4] Joach im s T. Tex t categor ization w ith support vec to rm ach ines: Lea rning w ith m any relevan t fea tures[ C ] / / Proceed ing s o f ECML-98. Chemn itz, 1998: 137-142.
[ 5] Schapire R E, S inger Y. Boo stexter: a boosting-based sy stem for tex t ca tego rization[ J] . M achine Lea rning, 2000, 39( 2 /3):135-168.
[ 6] Lu Yuchang, LuM ingyu, L i Fan. Analysis and construc tion of w ord w e ighing function in VSM [ J] . Journa l o f Computer Research& Deve lopm en t, 2002, 39( 10): 1 205-1 210.
[ 7] Xue X iaob ing, Zhou Zh ihua. Distributional fea tures for tex t categor ization[ C ] / / Pro ceedings o f the 17 th European ConferenceonM ach ine Learn ing ( ECML-06). Berlin: LNAI 4212, 2006: 497-508.
[ 8] Lew is D D. N aive( B ayes) at forty: The independence assum ption in inform ation retriev al[ C ] / / Proceed ings of 10th European Con f onM achine Learn ing. Berlin: Spr inger, 1998: 4-15.
[ 9] SaubanM, Pfahr ing er B. Tex t categor ization using docum ent pro filing [ C ] / / Pro ceedings o f PKDD-2003. B erlin: Springer-Ve rlag, 2003: 411-412.
[ 10] C ravenM, D iPasquo D, Fre itag D, et a.l Lea rning to ex trac t sym bo lic know ledg e from theW or ldW ideW eb[ C] / / Proceeding s o fAAA I-98. M ad ison: W I, 1998: 509-516.

Memo

Memo:
-
Last Update: 2013-04-24