[1]韩普,王泽.基于重复模式的论坛信息抽取研究[J].南京师范大学学报(工程技术版),2010,10(03):074-77.
 Han Pu,Wang Ze.Information Extraction for Web Forum Based on Repeated Pattern[J].Journal of Nanjing Normal University(Engineering and Technology),2010,10(03):074-77.
点击复制

基于重复模式的论坛信息抽取研究
分享到:

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
10卷
期数:
2010年03期
页码:
074-77
栏目:
出版日期:
2010-03-01

文章信息/Info

Title:
Information Extraction for Web Forum Based on Repeated Pattern
作者:
韩普1 2 王泽2
1. 南京大学信息管理系, 江苏南京210093; 2. 南京师范大学教育技术系, 江苏南京210097
Author(s):
Han Pu12Wang Ze2
1.Department of Information Management,Nanjing University,Nanjing 210093,China;2.Department of Educational Technology,Nanjing Normal University,Nanjing 210097,China
关键词:
重复模式 论坛抽取 信息抽取
Keywords:
repea ted pa ttern fo rum extraction inform ation extraction
分类号:
TP393.094
摘要:
针对现有网络论坛信息抽取的不足,提出了一种基于重复模式发现算法的论坛信息抽取方法.该方法首先利用Sgm-lReader解析器将HTML文档转换为格式规范的XHTML文档,然后通过计算XHTML文档结构中DOM子树相似度,自动发现论坛页面结构的重复模式.该方法通过自动定位重复模式进行论坛信息抽取,较好地解决了在论坛信息抽取过程中需要人工查找、定位重复模式或者通过人工分析论坛页面代码定制抽取规则的问题.试验结果表明,该方法具有较好的准确性、通用性和实用性.
Abstract:
A im ing at the lim ita tion o f the currentm e thod to ex tract the web forum inform ation, this paper introduces an inform ation ex traction m ethod forw eb fo rum based on repeated pa ttern discovery a lgo rithm. Thism e thod used Sgm lReade r parser to convert the HTML do cum ent to XHTML docum ent firstly, and then calcu lated the sim ilarity betw een the DOM trees that is in the XHTML docum ent, and autom atica lly found the repea ted pattern from the forum pages. The m ethod so lv ed the prob lem that people have to m anua lly locate the repea ted pattern or m anua lly ana ly sis page source code for the ex traction rules. The experim ental result show s tha t th is m ethod has h igh accuracy, good un iv ersality and practica lity.

参考文献/References:

[ 1] 王海明, 韩瑞霞. 目前国内BBS研究现状评述[ J] . 兰州石化职业技术学院学报, 2004( 4): 25-29. W ang H aim ing, H an Ruix ia. Rev iew o f present condition of domestic resea rches on BBS[ J]. Journa l o f Lanzhou Petrochem ical Co llege o f Techno logy, 2004( 4): 25-29. ( in Ch inese)
[ 2] Ca iR, Yang JM, La iW, et a.l iRobot: An Inte lligent C raw ler forW eb Fo rum s[ C] / / In Proc 17thWWW. Be ijing: ACM, 2008: 447-456.
[ 3] Guo Yan, L iKu,i ZhangK a,i e t a.l Boa rd forum craw ling: A w eb craw lingm e thod fo rw eb forum [ C ] / / In Pro c 2006 IEEE / W IC /ACM Int ConfW eb Intellig ence. H ong Kong: IEEE, 2006: 745-748.
[ 4] W ang Y, Yang JM, La iW, e t a.l Explor ing trave rsal stra tegy fo r w eb forum craw ling [ C] / / In Proc of S IGIR S ing apore: ACM, 2008: 459-466.
[ 5] 奚伟鹏, 李昕, 蒋凯. 面向网上论坛的信息抽取技术[ J]. 计算机工程, 2005, 31( 4): 66-68. X iW e ipeng, L i Xin, Jiang Ka.i Inform ation extrac tion technology fo r web fo rum s[ J]. Computer Eng inee ring, 2005, 31( 4): 66-68. ( in Chinese)
[ 6] 陈挺, 刘嘉勇, 夏天, 等. 基于平板型W eb论坛的信息抽取研究[ J] . 成都信息工程学院学报, 2009, 24( 2) : 1-4. Chen Ting, L iu Jiayong, X ia T ian, e t a.l Inform ation ex traction research based on pane-l structuredW eb BBS[ J]. Journa l o f Chengdu University of Inform a tion Technology, 2009, 24( 2): 1-4. ( in Ch inese)
[ 7] Duda R O, H art P E, Stork D G. Pattern C lassification[M ]. 2nd ed. Hoboken: JohnW iley and Sons, 2000: 27-29.
[ 8] 杨少华, 林海略, 韩燕波. 针对模板生成网页的一种数据自动抽取方法[ J] . 软件学报, 2008, 19( 2): 209-223. Yang Shaohua, Lin H a ilue, H an Yanbo. Autom atic data ex traction from tem plate-generated w eb pages[ J] . Journa l o f Softw are, 2008, 19( 2): 209-223. ( in Chinese)

备注/Memo

备注/Memo:
通讯联系人: 韩?? 普, 博士研究生, 研究方向: 信息抽取, w eb挖掘. E-mail:hanpu0725@ 163.
更新日期/Last Update: 2013-04-02