Asian Language Parsing Evaluated by Hummingbird SearchServer™ at NTCIR-3

Stephen Tomlinson. To appear in Keizo Oyama, Emi Ishida and Noriko Kando, editors, NTCIR Workshop 3: Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering (September 2001 - October 2002). National Institute of Informatics (NII), Tokyo, Japan.

Abstract

Hummingbird submitted ranked result sets for the Chinese, Japanese and Korean Single Language Information Retrieval tracks of the Cross-Language Retrieval Task of the 3rd NII-NACSIS Test Collection for IR Systems Workshop (NTCIR-3). SearchServer 5.3's segmenter for Asian text, compared to an overlapping n-gram approach, was found to modestly increase precision scores for Japanese, to have a neutral impact for Chinese, and to be detrimental for Korean. SearchServer's option to case normalize Hiragana and Katakana n-grams increased precision substantially for one Japanese query and was of neutral impact for the others. Newline suppression was found to be of only minor benefit for n-gram parsing. Normalizing Han characters to Hangul had almost no effect on the Korean test collection.

Full Paper

Related Information


Last Updated: 2003 Feb 25

Comments are welcome at comments@stephent.com.

Copyright © 2003 Stephen Tomlinson http://www.stephent.com/ir/papers/ntcir3.html