CJK Experiments with Hummingbird SearchServer™ at NTCIR-4

Stephen Tomlinson. To appear in Noriko Kando, Haruko Ishikawa, editors, NTCIR Workshop 4: Proceedings of the Fourth NTCIR Workshop on Research in Information Access Technologies Information Retrieval, Question Answering and Summarization (April 2003 - June 2004). National Institute of Informatics (NII), Tokyo, Japan.

Abstract

Hummingbird submitted ranked result sets for the Chinese, Japanese, Korean and English Single Language Information Retrieval subtasks of the Cross-Lingual Information Retrieval Task of the 4th NII-NACSIS Test Collection for IR Systems Workshop (NTCIR-4). SearchServer's experimental option of splitting compound words (decompounding) was found to significantly increase mean average precision for Korean and modestly increase it for Japanese and Chinese. After decompounding, the differences between segmenting into words and an overlapping n-gram approach were not statistically significant for any of the 3 languages. Per-topic analysis suggested that segmentation sometimes separates proper names into unrelated shorter words while n-grams may overweight words of length greater than ‘n’.

Full Paper

Related Information


Last Updated: 2005 July 1

Comments are welcome at comments@stephent.com.

Copyright © 2004-2005 Stephen Tomlinson http://www.stephent.com/ir/papers/ntcir4.html