[1]朱颖雯,吉根林.基于最大频繁Induced子树的GML文档结构聚类[J].南京师范大学学报(工程技术版),2008,08(04):050-55.
 Zhu Yingwen,Ji Genlin.Clustering GML Documents by Structure Based on Maximal Frequent Induced Subtrees[J].Journal of Nanjing Normal University(Engineering and Technology),2008,08(04):050-55.
点击复制

基于最大频繁Induced子树的GML文档结构聚类
分享到:

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
08卷
期数:
2008年04期
页码:
050-55
栏目:
出版日期:
2008-12-30

文章信息/Info

Title:
Clustering GML Documents by Structure Based on Maximal Frequent Induced Subtrees
作者:
朱颖雯1;吉根林2
1. 三江学院计算机基础部, 江苏南京210012; 2. 南京师范大学数学与计算机科学学院, 江苏南京210097
Author(s):
Zhu Yingwen1Ji Genlin2
1.Department of Computer Elementary Training,Sanjiang University,Nanjing 210012,China;2.School of Mathematics and Computer Science,Nanjing Normal University,Nanjing 210097,China
关键词:
GML 结构聚类 最大频繁Induced 子树 闭合频繁Induced子树
Keywords:
GML c lustering by structure m ax im al frequent sub trees c lo sed frequent subtrees
分类号:
TP391.1
摘要:
提出了一种基于最大频繁Induced子树的GML文档结构聚类新算法TBCClustering.通过挖掘GML文档集合中的最大频繁Induced子树构造特征空间,并对特征空间进行优化;采用CLOPE聚类算法聚类GML文档,可自动生成最小支持度与聚类簇的个数,无需用户设置;不仅减少了特征的维数,而且得到了较高的聚类精度.实验结果表明算法TBCClustering是有效的,且性能优于PBClustering算法.
Abstract:
Th is paper presen ts an a lgor ithm TBCC lustering fo r c lustering GM L do cum en t structu re based on m ax ima l frequent sub tree patterns. Dur ing them ax im a l frequent subtreem ining pro cess, it optim izes character istic spaces, g ets the m inim um suppo rt automa tica lly, chooses som e subtree patte rn to form the optim istic cluster ing features, and uses CLOPE a lgo rithm to cluster do cum en ts by cluster ing fea tures w ithout g iv ing the number of cluster. Not only the dim ensions o f features are reduced, but a lso the higher c luste ring prec is ion is obtained. Experim ent resu lts show that TBCC lustering is m ore effec tive and effic ient than PBC lustering.

参考文献/References:

[ 1] Guillaume D, M urtagh F. C lustering of XML documents[ J] . Computer Phy sics Comm un ications, 2000, 127( 2 /3): 215-227.
[ 2] Doucet A, Ahonen-M ykaH. Na ve C lustering o f a Larg e XML Document Co llection[ C] / / Proc 1st Annua lWo rkshop o f the In itiativ e for the Eva luation of XML retrieval( INEX) . Germ any: ACM Press, 2002: 81-88.
[ 3] N ierm an A, Jagad ish H V. Ev aluating structura l sim ilarity in XML docum ents [ C ] / / Proceed ings of the 5th Inte rnational Workshop on theW eb and Database(W ebDB). M adison, 2002: 61-66.
[ 4] Zhang K, Shasha D. S imp le fast algor ithm s for the editing distance be tw een trees and related problem s[ J]. SIAM Journa l on Com puting, 1989, 18( 6): 1245-1 262.
[ 5] W ang L, Cheung D W, M am oulis N, et a.l An E ffic ient and Scalab leA lgor ithm for C luster ing XML Docum ents by Structure[ J]. IEEE TKDE, 2004, 16( 1): 82-96.
[ 6] LeungH P, Chung F L, Chan S C F. On the use o f h iera rchical info rma tion in sequentialm in ing-based XML document similarity compu tation[ J]. Know ledg e and Inform a tion System s, 2005, 7( 4): 476-498.
[ 7] LeungH P, Chung F L, Chan S C F, et a.l XML do cum ent c luste ring using comm on Xpath[ C] / / 2005 Internationa lWo rkshop on Challeng es inW eb Inform ation Retrieva l and Integ ration. Tokyo: IEEE Com puter Soc iety Press, 2005: 91-96.
[ 8] Nayak R, Xu S. XCLS: a fast and e ffective c lustering a lgor ithm fo r he terogenous XML do cum ents[ C] / / Proceeding of the 10 th Pac ific-Asia Conference on Know ledg e Discovery and Da taM ining. S ingapore: ACM Press, 2006.
[ 9] ChehredhaniM H, RahgozarM, Lucas C, et a.l C lustering roo ted ordered trees[ C ] / / Com puta tiona l Inte lligence and Data M in ing. H ono lulu, H aw ai:i IEEE Press, 2007: 450-455.
[ 10] Francesca F D, Gordano G, O rta le R, et a.l A genera l fram ework fo r XML do cum ent c luster ing [ R ]. Consig lio Nazionale de lle R icerche Istituto di Ca lco lo e Reti ad A lte Prestazion ,i 2003.
[ 11] Guha S, Rastog iR, Sh im K. ROCK: A robust cluster ing a lgo rithm for categor ica l attr ibutes[ C] / / Proceed ing s o f ICDE99.Sydney: IEEE Com puter So ciety Press, 1999.
[ 12] Yang Y, Guan X, You J. CLOPE: a fast and effec tive cluster ing algor ithm fo r transac tion data[ C] / / Proceed ings of the 8 th ACM SIGKDD In ternational Confe rence on Know ledge D iscovery and DataM in ing. Edm onton: ACM Press, 2002.
[ 13] Zak iM J. E fficientlym in ing frequent trees in a forest: a lgor ithm s and app lication[ J] . IEEE Transaction on Know ledge and
Data Eng inee ring: Spec ia l Issue onM ining B io log ica l Data, 2005, 17( 8): 1 021-1 035.
[ 14] Dalam agas T, Cheng T, W inkel K J, et a.l C luster ing XML do cum ents using structura l summ ar ies[ C ] / / Proceed ings of the EDBT W orkshop on C luster ing Info rm ation ove r theW eb. H e ide lberg: Spr inger Berlin, 2004: 547-556.
[ 15] Ch iY, X ia Y, Yang Y, e t a .l M in ing c losed and m ax im a l frequent subtrees from databases o f labe led roo ted trees[ J] . IEEE Transac tions on Know ledge and Data Eng ineering, 2005, 17: 190-202.

备注/Memo

备注/Memo:
基金项目: 国家自然科学基金( 40771163)资助项目.
通讯联系人: 朱颖雯, 助教, 研究方向: 数据挖掘. E-m ail: zhu- ying- w en@ 163. com
更新日期/Last Update: 2013-04-24