Information Extraction for Web Forum Based on Repeated Pattern
韩普1 2 王泽2
1. 南京大学信息管理系, 江苏南京210093; 2. 南京师范大学教育技术系, 江苏南京210097
Han Pu12Wang Ze2
1.Department of Information Management,Nanjing University,Nanjing 210093,China;2.Department of Educational Technology,Nanjing Normal University,Nanjing 210097,China
重复模式 论坛抽取 信息抽取
repea ted pa ttern fo rum extraction inform ation extraction
A im ing at the lim ita tion o f the currentm e thod to ex tract the web forum inform ation, this paper introduces an inform ation ex traction m ethod forw eb fo rum based on repeated pa ttern discovery a lgo rithm. Thism e thod used Sgm lReade r parser to convert the HTML do cum ent to XHTML docum ent firstly, and then calcu lated the sim ilarity betw een the DOM trees that is in the XHTML docum ent, and autom atica lly found the repea ted pattern from the forum pages. The m ethod so lv ed the prob lem that people have to m anua lly locate the repea ted pattern or m anua lly ana ly sis page source code for the ex traction rules. The experim ental result show s tha t th is m ethod has h igh accuracy, good un iv ersality and practica lity.


