Comparison of different lemmatization approaches through the means of information retrieval performance

Kanis, Jakub; Skorkovská, Lucie

Full metadata record

DC pole	Hodnota	Jazyk
dc.contributor.author	Kanis, Jakub
dc.contributor.author	Skorkovská, Lucie
dc.date.accessioned	2016-01-08T06:11:38Z
dc.date.available	2016-01-08T06:11:38Z
dc.date.issued	2010
dc.identifier.citation	KANIS, Jakub; SKORKOVSKÁ, Lucie. Comparison of different lemmatization approaches through the means of information retrieval performance. In: Text, speech and dialogue. Berlin: Springer, 2010, p. 93-100. (Lectures notes in computer science; 6231). ISBN 978-3-642-15759-2.	en
dc.identifier.isbn	978-3-642-15759-2
dc.identifier.uri	http://www.kky.zcu.cz/cs/publications/JakubKanis_2010_Comparisonof
dc.identifier.uri	http://hdl.handle.net/11025/17172
dc.description.abstract	Tento článek prezentuje kvantitativní porovnání dvou různých přístupů k lematizaci českého textu. První přístup je založen na použití ručně vytvořeného slovníku lemmat a množiny derivačních pravidel a druhý pak na automatickém odvození slovníku a pravidel z trénovacích dat. Porovnání je provedeno vyhodnocením míry střední zobecněné průměrné přesnosti (angl. mean Generalized Average Precision - mGAP) lematizovaných dokumentů a hledaných dotazů v sérii experimentů zaměřených na vyhledávání informací. Taková to metoda je vhodná pro efektivní a spolehlivé porovnání výkonnosti lematizace, neboť jak bylo prokázáno, správná lematizace je rozhodujícím faktorem při efektivním vyhledávání informací ve vysoce inflektivních jazycích. Navrhované nepřímé porovnání lematizátorů navíc obchází nutnost existence obtížně získatelných ručně lematizovaných testovacích dat a také řeší problém nekompatibilních množin lemmat napříč různými systémy.	cs
dc.format	8 s.	cs
dc.format.mimetype	application/pdf
dc.language.iso	en	en
dc.publisher	Springer	en
dc.relation.ispartofseries	Lecture notes in computer science; 6231	en
dc.rights	© Jakub Kanis - Lucie Skorkovská	cs
dc.subject	lemmatizace	cs
dc.subject	vyhledávání informací	cs
dc.title	Comparison of different lemmatization approaches through the means of information retrieval performance	en
dc.title.alternative	Porovnání různých lematizačních přístupů prostřednictvím výkonnosti při vyhledávání informací	cs
dc.type	článek	cs
dc.type	article	en
dc.rights.access	openAccess	en
dc.type.version	publishedVersion	en
dc.description.abstract-translated	This paper presents a quantitative performance analysis of two different approaches to the lemmatization of the Czech text data. The first one is based on manually prepared dictionary of lemmas and set of derivation rules while the second one is based on automatic inference of the dictionary and the rules from training data. The comparison is done by evaluating the mean Generalized Average Precision (mGAP) measure of the lemmatized documents and search queries in the set of information retrieval (IR) experiments. Such method is suitable for efficient and rather reliable comparison of the lemmatization performance since a correct lemmatization has proven to be crucial for IR effectiveness in highly inflected languages. Moreover, the proposed indirect comparison of the lemmatizers circumvents the need for manually lemmatized test data which are hard to obtain and also face the problem of incompatible sets of lemmas across different systems.	en
dc.subject.translated	lemmatization	en
dc.subject.translated	information retrieval	en
dc.type.status	Peer-reviewed	en
Vyskytuje se v kolekcích:	Články / Articles (NTIS) Články / Articles (KKY)

Soubory připojené k záznamu:

Soubor	Popis	Velikost	Formát
JakubKanis_2010_Comparisonof.pdf	Plný text	93,34 kB	Adobe PDF	Zobrazit/otevřít

Zobrazit minimální záznam Zobrazit statistiky

Použijte tento identifikátor k citaci nebo jako odkaz na tento záznam: http://hdl.handle.net/11025/17172

Všechny záznamy v DSpace jsou chráněny autorskými právy, všechna práva vyhrazena.

hledání

navigace