[1]米允龙,李金海,米春桥,等.基于区间模糊匹配函数的数据清洗算法研究及其在问卷调查中的应用[J].南京师范大学学报(工程技术版),2017,17(03):070.[doi:10.3969/j.issn.1672-1292.2017.03.011]
 Mi Yunlong,Li Jinhai,Mi Chunqiao,et al.Reasearch into Data Cleaning Algorithm Based on Interval FuzzyMatching Functions and Its Application to Questionnaire Data[J].Journal of Nanjing Normal University(Engineering and Technology),2017,17(03):070.[doi:10.3969/j.issn.1672-1292.2017.03.011]
点击复制

基于区间模糊匹配函数的数据清洗算法研究及其在问卷调查中的应用
分享到:

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
17卷
期数:
2017年03期
页码:
070
栏目:
计算机工程
出版日期:
2017-09-30

文章信息/Info

Title:
Reasearch into Data Cleaning Algorithm Based on Interval FuzzyMatching Functions and Its Application to Questionnaire Data
文章编号:
1672-1292(2017)03-0070-10
作者:
米允龙1李金海2米春桥13刘文奇2刘 佳1王 添3
(1.怀化学院计算机科学与工程学院,湖南 怀化 418000)(2.昆明理工大学理学院,云南 昆明 650500)(3.武陵山片区生态农业智能控制技术湖南省重点实验室,湖南 怀化 418000)
Author(s):
Mi Yunlong1Li Jinhai2Mi Chunqiao13Liu Wenqi2Liu Jia1Wang Tian3
(1.School of Computer Science and Engineering,Huaihua University,Huaihua 418000,China)(2.Faculty of Science,Kunming University of Science and Technology,Kunming 650500,China)(3.Hunan Provincial Key Laboratory of Ecological Agriculture Intelligent Control Technology,Huaihua 418000,China)
关键词:
数据清洗匹配函数区间模糊集区间模糊匹配函数问卷调查数据
Keywords:
data cleaningmatching functioninterval-valued fuzzy setinterval-valued fuzzy matching functionquestionnaire data
分类号:
TP311
DOI:
10.3969/j.issn.1672-1292.2017.03.011
文献标志码:
A
摘要:
数据清洗是保证数据质量的重要步骤. 由于人类的活动通常带有一定的主观性与情绪性,因此现实中部分数据往往存在不合理性甚至错误. 而此类不合理数据常具有不确定性、模糊性与隐藏性,这给数据清洗带来了困难. 传统的数据清洗方法对此类数据难以充分发挥作用. 结合区间值模糊集理论与匹配函数提出一种区间模糊匹配函数方法,构建区间模糊匹配算法来清洗数据、提高数据质量,并将其应用在问卷调查数据中. 实验结果表明本算法具有较高的准确度及运行效率,适应处理数据中的不合理数据.
Abstract:
Data cleaning is a very important step to ensure data quality. The real-world data often has some unreasonable data even error because of human activites usually with subjectivity and emotionality,such as the questionare data. However,there are some difficulties to process data cleaning due to these unreasonable data often being uncertainty,ambiguity and hidding. For this type of data,the traditional data cleaning methods have difficulty in handling the unreasonable data. Therefore,by combining the basic theories of interval-valued fuzzy set and mathcing function,we propose an interval fuzzy matching function method. Based on this method we construct a new algorithm to clean data and improve data quality,and then apply it to questionaire data. Experiments show that our algorithm have a good precision and running efficiency,and that it is adaptable to process the unreasonable data.

参考文献/References:

[1] KUMAR R,CHADRASEKARAN D R. Attribute correction-data cleaning using association rule and clustering methods[J]. International journal of data mining and knowledge management process,2011,1(2):22-32.
[2]RAHM E,HONG H D. Data cleaning:problems and current approaches[J]. IEEE data engineering bulletin,2000,23(4):3-13.
[3]GARDEZI J,BERTOSSI L,KIRINGA I. Matching dependencies:semantics and query answering[J]. Frontiers of computer science,2012,6(3):278-292.
[4]LOW W L,LEE M L,LING T W. A knowledge-based approach for duplicate elimination in data cleaning[J]. Information systems,2001,26(8):585-606.
[5]FAN W,JIA X,LI J,et al. Reasoning about record matching rules[J]. Proceedings of the VLDB endowment,2010,2(1):407-418.
[6]FAN W,MA S,TANG N,et al. Interaction between record matching and data repairing[J]. Journal of data and information quality,2014,4(4):1-38.
[7]BERTOSSI L,KOLAHI S,LAKSHMANAN L V S. Data cleaning and query answering with matching dependencies and matching functions[J]. Theory of computing systems,2013,52(3):441-482.
[8]GRAHAM J W. Missing data analysis:making it work in the real world[J]. Annual review of psychology,2009,60:549-576.
[9]WENG C H,CHEN Y L. Mining fuzzy association rules from uncertain data[J]. Knowledge and information systems,2010,23(2):129-152.
[10]CHANG S E,CHANGCHIEN S W,HUANG R H. Assessing users’ product-specific knowledge for personalization in electronic commerce[J]. Expert systems with applications,2006,30(4):682-693.
[11]DOHERTY N,ELLIS-CHADWICK C F,HART C. An analysis of the factors affecting the adoption of the Internet in the UK retail sector[J]. Journal of business research,2003,56(11):887-897.
[12]CHEN Y L,WENG C H. Mining fuzzy association rules from questionnaire data[J]. Knowledge-based systems,2009,22(1):46-56.
[13]MARSHALL G. The purpose,design and administration of a questionnaire for data collection[J]. Radiography,2005,11(2):131-136.
[14]BURTON S H,MORRIS R G,GIRAUD-CARRIER C G,et al. Mining useful association rules from questionnaire data[J]. Intelligent data analysis,2014,18(3):479-494.
[15]YAMANISHI K,LI H. Mining open answers in questionnaire data[J]. IEEE intelligent systems,2002,17(5):58-63.
[16]BROECK J V D,CUNNINGHAM S A,EECKELS R,et al. Data cleaning:detecting,diagnosing,and editing data abnormalities[J]. Plos medicine,2005,2(10):e267.
[17]BOYNTON P M. Administering,analysing,and reporting your questionnaire[J]. BMJ,2004,328(7 452):1 372-1 375.
[18]SAMBUC R. Fonctions and floues:application a l’aide au diagnostic en pathologie thyroidienne[D]. Marseille:University of Marseille,1975.
[19]ZADEH L A. The concept of a linguistic variable and its application to approximate reasoning[J]. Information sciences,1975,8(3):199-249.
[20]SANZ J,FERNáNDEZ A,BUSTINCE H,et al. A genetic tuning to improve the performance of Fuzzy Rule-Based Classification Systems with Interval-Valued Fuzzy Sets:Degree of ignorance and lateral position[J]. International journal of approximate reasoning,2011,52(6):751-766.
[21]DESCHRIJVER G. Triangular norms which are meet-morphisms in interval-valued fuzzy set theory[J]. Fuzzy sets and systems,2008,181(1):88-101.
[22]WU Z G,SHI P,SU H,et al. Network-based robust passive control for fuzzy systems with randomly occurring uncertainties[J]. IEEE transactions on fuzzy systems,2013,21(5):966-971.
[23]ZHANG H,YAN H,YANG F,et al. Quantized control design for impulsive fuzzy networked systems[J]. IEEE transactions on fuzzy systems,2011,19(6):1 153-1 162.
[24]ATANASSOV K. Interval valued intuitionistic fuzzy sets[J]. Fuzzy sets and systems,1989,31(3):343-349.
[25]曾文艺,李洪兴,施煜. 区间值模糊集合的分解定理[J]. 北京师范大学学报(自然科学版),2003,39(2):171-177.
ZENG W Y,LI H X,SHI Y. Decomposition theorem of interval-value fuzzy sets[J]. Journal of Beijing normal university(natural science),2003,39(2):171-177.(in Chinese)
[26]金澈清,刘辉平,周傲英. 基于函数依赖与条件约束的数据修复方法[J]. 软件学报,2016,27(7):1 671-1 684.
JIN C Q,LIU H P,ZHOU A Y. Functional dependency and conditional constraints based data repair[J]. Journal of software,2016,27(7):1 671-1 684.(in Chinese)
[27]钟评,李战怀,陈群. 关系数据中函数依赖检测方法[J]. 计算机学报,2017,40(1):207-222.
ZHONG P,LI Z H,CHEN Q. A functional dependecies checking method in relational data[J]. Chinese journal of computers,2017,40(1):207-222.(in Chinese)
[28]ZADEH L A. Fuzzy sets[J]. Information and control,1965,8(3):338-353.
[29]刘文奇. 中国公共数据库数据质量控制模型体系及实证[J]. 中国科学:信息科学,2014,44(7):836-856.
LIU W Q. Modeling data quality control system for Chinese public database and its empirical analysis[J]. Scientia sinica(informationis),2014,44(7):836-856.(in Chinese)

备注/Memo

备注/Memo:
收稿日期:2017-05-23.
基金项目:湖南省教育科学规划课题(XJK016QXX003)、湖南省自然科学基金项目(2017JJ3252)、国家自然科学基金项目(41301084)、怀化学院一般项目(HHUY2016-05).
通讯联系人:米春桥,博士,副教授,研究方向:数据挖掘与分析、地理信息系统、农业与教育信息化. E-mail:michunqiao@163.com
更新日期/Last Update: 2017-09-30