Stephen Tomlinson. To appear in Keizo Oyama, Emi Ishida and Noriko Kando, editors, NTCIR Workshop 3: Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering (September 2001 - October 2002). National Institute of Informatics (NII), Tokyo, Japan.
Hummingbird submitted ranked result sets for the Chinese, Japanese and Korean Single Language Information Retrieval tracks of the Cross-Language Retrieval Task of the 3rd NII-NACSIS Test Collection for IR Systems Workshop (NTCIR-3). SearchServer 5.3's segmenter for Asian text, compared to an overlapping n-gram approach, was found to modestly increase precision scores for Japanese, to have a neutral impact for Chinese, and to be detrimental for Korean. SearchServer's option to case normalize Hiragana and Katakana n-grams increased precision substantially for one Japanese query and was of neutral impact for the others. Newline suppression was found to be of only minor benefit for n-gram parsing. Normalizing Han characters to Hangul had almost no effect on the Korean test collection.
Last Updated: 2003 Feb 25
Comments are welcome at comments@stephent.com.