|Table of Contents|

Information Extraction for Web Forum Based on Repeated Pattern(PDF)

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

Issue:
2010年03期
Page:
74-77
Research Field:
Publishing date:

Info

Title:
Information Extraction for Web Forum Based on Repeated Pattern
Author(s):
Han Pu12Wang Ze2
1.Department of Information Management,Nanjing University,Nanjing 210093,China;2.Department of Educational Technology,Nanjing Normal University,Nanjing 210097,China
Keywords:
repea ted pa ttern fo rum extraction inform ation extraction
PACS:
TP393.094
DOI:
-
Abstract:
A im ing at the lim ita tion o f the currentm e thod to ex tract the web forum inform ation, this paper introduces an inform ation ex traction m ethod forw eb fo rum based on repeated pa ttern discovery a lgo rithm. Thism e thod used Sgm lReade r parser to convert the HTML do cum ent to XHTML docum ent firstly, and then calcu lated the sim ilarity betw een the DOM trees that is in the XHTML docum ent, and autom atica lly found the repea ted pattern from the forum pages. The m ethod so lv ed the prob lem that people have to m anua lly locate the repea ted pattern or m anua lly ana ly sis page source code for the ex traction rules. The experim ental result show s tha t th is m ethod has h igh accuracy, good un iv ersality and practica lity.

References:

[ 1] 王海明, 韩瑞霞. 目前国内BBS研究现状评述[ J] . 兰州石化职业技术学院学报, 2004( 4): 25-29. W ang H aim ing, H an Ruix ia. Rev iew o f present condition of domestic resea rches on BBS[ J]. Journa l o f Lanzhou Petrochem ical Co llege o f Techno logy, 2004( 4): 25-29. ( in Ch inese)
[ 2] Ca iR, Yang JM, La iW, et a.l iRobot: An Inte lligent C raw ler forW eb Fo rum s[ C] / / In Proc 17thWWW. Be ijing: ACM, 2008: 447-456.
[ 3] Guo Yan, L iKu,i ZhangK a,i e t a.l Boa rd forum craw ling: A w eb craw lingm e thod fo rw eb forum [ C ] / / In Pro c 2006 IEEE / W IC /ACM Int ConfW eb Intellig ence. H ong Kong: IEEE, 2006: 745-748.
[ 4] W ang Y, Yang JM, La iW, e t a.l Explor ing trave rsal stra tegy fo r w eb forum craw ling [ C] / / In Proc of S IGIR S ing apore: ACM, 2008: 459-466.
[ 5] 奚伟鹏, 李昕, 蒋凯. 面向网上论坛的信息抽取技术[ J]. 计算机工程, 2005, 31( 4): 66-68. X iW e ipeng, L i Xin, Jiang Ka.i Inform ation extrac tion technology fo r web fo rum s[ J]. Computer Eng inee ring, 2005, 31( 4): 66-68. ( in Chinese)
[ 6] 陈挺, 刘嘉勇, 夏天, 等. 基于平板型W eb论坛的信息抽取研究[ J] . 成都信息工程学院学报, 2009, 24( 2) : 1-4. Chen Ting, L iu Jiayong, X ia T ian, e t a.l Inform ation ex traction research based on pane-l structuredW eb BBS[ J]. Journa l o f Chengdu University of Inform a tion Technology, 2009, 24( 2): 1-4. ( in Ch inese)
[ 7] Duda R O, H art P E, Stork D G. Pattern C lassification[M ]. 2nd ed. Hoboken: JohnW iley and Sons, 2000: 27-29.
[ 8] 杨少华, 林海略, 韩燕波. 针对模板生成网页的一种数据自动抽取方法[ J] . 软件学报, 2008, 19( 2): 209-223. Yang Shaohua, Lin H a ilue, H an Yanbo. Autom atic data ex traction from tem plate-generated w eb pages[ J] . Journa l o f Softw are, 2008, 19( 2): 209-223. ( in Chinese)

Memo

Memo:
-
Last Update: 2013-04-02