Computer Software and Database Research
LI Li-shuang,HE Hong-lei,LIU Shan-shan,HUANG De-gen
Journal of Chinese Computer Systems.
2016, 37(2):
302-307.
Biomedical named entity recognition is the prerequisite for biomedical information extraction.The current entity recognition methods,which are based on machine learning,mainly depend on manually summarizing features,according to the domain knowledge and experience,and need to do experiments repeatedly for selecting the appropriate features.And these features rarely utilize the deep semantic information.To investigate the effect of semantic information on Named Entity Recognition,this paper attempts to obtain semantic information automatically from the large-scale unlabeled corpus,which can be downloaded from public database,such as PubMed,and get three kinds of word representation approaches,including word embeddings,cluster based on word embeddings,and Brown cluster.The three kinds of word representation are adopted as the features of CRF model and SVM model for semi-supervised learning.Comparative experiments are conducted under the same conditions:the dimension of word embeddings and the number of clusters.The experimental results show that the word representation approaches can learn the latent semantic information effectively and thus improve the performance of existing entity recognition systems based on machine learning.Experimental results (Precision,Recall,F-score) on public evaluation corpus BioCreative II GM reaches 9124%,8580%,and 8844% respectively without the dictionary or any other external resources.