A hybrid approach for Persian Named Entity Recognition

Document Type : Regular Paper

Authors

1 Department of Information Technology, Urmia University of Technology, Urmia, Iran

2 Department of Computer Science, University of Tabriz, Tabriz, Iran

Abstract

Named Entity Recognition (NER) is an information extraction subtask that attempts to recognize and categorize named entities in unstructured text into predefined categories such as the names of people, organizations, and locations. Recently, machine learning approaches, such as Hidden Markov Model (HMM) as well as hybrid methods, are frequently used to solve Name Entity Recognition. Since the absence of publicly available data sets for NER in Persian, as our knowledge does not exist any machine learning base Persian NER system. Because of HMM innate weaknesses, in this paper, we have used both Hidden Markov Model and rule-based method to recognize named entities in Persian texts. The combination of rule-based method and machine learning method results in a high accurate recognition. The proposed system in it's machine learning section uses from HMM and Viterbi algorithms; and in it's rule-based section employs a set of lexical resources and pattern bases for the recognition of named entities including the names of people, locations and organizations. During this study, we annotate our own training and testing data sets to use in the related phases. Our hybrid approach performs on Persian language with 89.73% precision, 82.44% recall, and 85.93% F-measure using an annotated test corpus including 32,606 tokens.

Keywords

Main Subjects