«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1672-1292.2021.01.006]
点击复制

基于ALBERT的中文医疗病历命名实体识别

分享到：

南京师范大学学报（工程技术版）[ISSN:1006-6977/CN:61-1281/TN]

卷:: 21卷
期数:: 2021年01期

页码:: 036-43

栏目:: 计算机科学与技术

出版日期:: 2021-03-15

文章信息/Info

Title:: ALBERT-Based Named Entity Recognition of Chinese Medical Records

文章编号:: 1672-1292(2021)01-0036-08

作者:: 陈杰¹; 奚雪峰¹; 2; 皮洲¹; 盛胜利³; 崔志明¹; 2; (1.苏州科技大学电子与信息工程学院,江苏苏州 215009)(2.苏州智慧城市研究院,江苏苏州 215009)(3.Computer Science Department,Texas Tech University,Texas 79431,USA)

Author(s):: Chen Jie¹; Xi Xuefeng¹; 2; Pi Zhou¹; Victor S Sheng³; Cui Zhiming¹; 2; (1.School of Electronic and Computer Engineering,Suzhou University of Science and Technology,Suzhou 215009,China)(2.Suzhou Smart City Research Institute,Suzhou 215009,China)(3.Computer Science Department,Texas Tech University,Texas 79431,USA)

关键词:: ALBERT; 命名实体识别; 电子医疗病历; 双向长短记忆网络; 条件随机场

Keywords:: ALBERT; named entity recognition; clinical electronic medical records; BiLSTM; CRF

分类号:: TP181

DOI:: 10.3969/j.issn.1672-1292.2021.01.006

文献标志码:: A

摘要:: 医疗病历命名实体识别的主要任务是将临床电子病历中的非结构化文本转化为结构化数据,进而为面向医疗领域任务开展的数据挖掘提供基础支撑. 提出一种基于ALBERT模型融合学习的中文医疗病历命名实体识别模型. 首先,采用人工标注方式扩展样本数据集,结合ALBERT模型对数据集进行微调; 其次,采用双向长短记忆网络(BiLSTM)提取文本的全局特征; 最后,基于条件随机场模型(CRF)命名实体的序列标记. 在标准数据集上的实验结果表明,该方法进一步提高了医疗文本命名识别精度,减少了时间开销.

Abstract:: The main task of named entity recognition on medical record is to convert unstructured text into structured data,and then provide an important fundamental support for data mining for medical field tasks. This paper proposes a named entity recognition method for Chinese medical records based on ALBERT and fusion model. Firstly,we use manual labeling to expand the sample dataset,and fine-tune the dataset in conjunction with the ALBERT. Secondly,the Bi-directional Long Short-Term Memory(BiLSTM)is used to extract the global features of the text. Finally,on the basis of the conditional random field model(CRF),sequence tags for named entities are made. The experimental results on the standard dataset show that the proposed method further improves the accuracy of name entity recognition on medical text and greatly reduces the time overhead.

参考文献/References:

[1] BIKEL D M,SCHWARTA R,WEISCHEDEL R M. An algorithm that learns what’s in a name[J]. Machine Learning,1999,34(1/2/3):211-231.
[2]LIAO W H,VEERAMACHANENI S. A simple semi-supervised algorithm for named entity recognition[C]//The Proceedings of NAACL HLT 2009. Boulder,USA:ASL,2009:58-65.
[3]RATINOV L,ROTH D. Design challenges and misconceptions in named entity recognition[C]//Proceedings of the Thirteenth Conference on Computational Natural Language Learning(CoNLL-2009). Boulder,USA:ASL,2009:147-155.
[4]TSAI T H,WU S H,LEE C W,et al. Mencius:a Chinese named entity recognizer using the maximum entropy-based hybrid model[J]. International Journal of Computational Linguistics and Chinese Language Processing,2004,9(1):65-82.
[5]陈钰枫,宗成庆,苏克毅. 汉英双语命名实体识别与对齐的交互式方法[J]. 计算机学报,2011,34(9):1688-1696.
[6]张海楠,伍大勇,刘悦,等. 基于深度神经网络的中文命名实体识别[J]. 中文信息学报,2017,31(4):28-35.
[7]杨锦锋,关毅,何彬,等. 中文电子病历命名实体和实体关系语料库构建[J]. 软件学报,2016,27(11):2725-2746.
[8]YOUNG T,HAZARIKA D,PORIA S,et al. Recent trends in deep learning based natural language processing[J]. IEEE Computational Intelligence Magazine,2018,13(3):55-75.
[9]ASAHARA M,MATSUMOTO Y. Japanese named entity extraction with redundant morphological analysis[C]//Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Association for Computational Linguistics. Sapporo,Japan:ACL,2003:8-15.
[10]CHEN A,PENG F,SHAN R,et al. Chinese named entity recognition with conditional probabilistic models[C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. Sydney,Australia:ACL,2006:173-176.
[11]CHEN Y,ZHOU C J,LI T X,et al. Named entity recognition from Chinese adverse drug event reports with lexical feature based BiLSTM-CRF and tri-training[J]. Journal of Biomedical Informatics,2019,96:103252.
[12]HUANG Z H,XU W,YU K. Bidirectional LSTM-CRF models for sequence tagging[C]//ACL. Beijing,China:ACL,2015:13-16.
[13]STRUBELL E,VERGA P,BELANGER D,et al. Fast and accurate entity recognition with iterated dilated convolutions[C]//EMNLP. Copenhagen,Denmark:ACL,2017:2670-2680.
[14]LIU K X,HU Q C,LIU J W. Named entity recognition in Chinese electronic medical records based on CRF[C]//2017 14th Web Information Systems and Applications Conference(WISA). Jeju,Korea:IEEE,2017:105-110.
[15]LIU Z J,YANG M,WANG X L,et al. Entity recognition from clinical texts via recurrent neural network[J]. BMC Medical Informatics and Decision Making,2017,17:53-61.
[16]QIU J,QI W,ZHOU Y,et al. Fast and accurate recognition of Chinese clinical named entities with residual dilated convolutions[C]//2018 IEEE International Conference on Bioinformatics and Biomedicine(BIBM). Madrid,Spain:IEEE,2018:935-942.
[17]PETERS M E,NEUMANN M,IYYER M,et al. Deep contextualized word representations[C]//Proceedings of NAACL-HLT. New Orleans,USA:ACL,2018:2227-2237.
[18]DEVLIN J,CHANG M W,LEE K,et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Minneapolis,USA:ACL,2019:278-286.
[19]LAN Z,CHEN M,GOODMAN S,et al. ALBERT:a lite BERT for self-supervised learning of language representations[C]//International Conference on Learning Representations. New Orleans,USA:Elsevier,2019:12-17.
[20]HOCHREITER S,SCHMIDHUBER J. Long short-termmemory[J]. Neural Computation,1997,9(8):1735-1780.
[21]LAMPLE G,BALLESTEROS M,SUBRAMANIAN S,et al. Neural architectures for named entity recognition[C]//NAACL-HLT. San Diego,USA:ACL,2016:260-270.
[22]LUO L,YANG Z,YANG P,et al. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition[J]. Bioinformatics,2018,34(8):1381-1388.
[23]VASWANI A,SHAZEER N,PARMAR N,et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. Long Beach,USA:NeurIPS,2017:6000-6010.

备注/Memo

备注/Memo:: 收稿日期:2020-08-08.
基金项目:国家自然科学基金项目(61673290、61876217)、江苏省“六大人才高峰”高层次人才项目(XYDXX-086)、苏州市科技发展计划产业前瞻性项目(SYG201817)、2020年江苏省研究生科研创新计划项目(KYCX20_2762).
通讯作者:奚雪峰,副教授,研究方向:自然语言处理、高性能并行计算、面向对象技术应用. E-mail:xfxi@usts.edu.cn

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1688
全文下载/Downloads2510
评论/Comments

更新日期/Last Update: 2021-03-15