Tesi hoberenaren 2018ko SEPLN saria Aitor Gonzalez Ixakideari – Hizkuntza-teknologiak, Ixa Taldearen bloga

Irailean Aitor Gonzalez Agirre ixakide ohiak SEPLN elkarteko saria jaso du. Tesi hoberenarentzako 2018ko saria eman zioten SEPLN biltzarrean. Zorionak Aitorri eta zorionak zuzendariei!

Tesia bukatuta Aitor Gonzalez Bartzelonara joan da lanera.
Orain ikertzailea da Bartzelonako Superkonputazio Zentro ospetsuan (Barcelona Supercomputing Center).

Tesia 2017ko uztailaren 7an defendatu zuen Aitorrek:

“Computational Models for Semantic Textual Similarity”
Hau da: “Testu-antzekotasun semantikorako eredu konputazionalak“.
Zuzendariak Eneko Agirre eta German Rigau ixakideak izan ziren.

Tesian urrats bat egin zuten esaldi osoen esanahia konputazionalki errepresentatzeko eta horrela bi esaldiren arteko antzekotasun-maila automatikoki neurtu ahal izateko. Lortutako emaitzak aplikagarriak dira Hizkuntza-Teknologiako hainbat aplikazio praktikoetan, besteak beste, Itzulpengintza automatikoa, testuen laburpengintza automatikoa, informazioaren berreskurapena, galdera-erantzun sistemak, hitzen adiera-desanbiguazioa…

Tesiko laburpena ingelesez:

Measuring semantic similarity between textual items (words, sentences, paragraphs or even documents) is a very important research area in Natural Language Processing (NLP). It has many practical applications in other NLP tasks such as Word Sense Disambiguation, Textual Entailment, Paraphrase detection, Machine Translation, Summarization, Information Retrieval or Question Answering.

The overarching goal of this thesis is to advance on computational models of meaning and their evaluation. To achieve this goal we define two tasks and develop state-of-the-art systems that tackle both tasks: Semantic Textual Similarity (STS) and Typed Similarity.

STS aims to measure the degree of semantic equivalence between two sentences by assigning graded similarity values. This graded similarity captures the notion of intermediate shades of similarity ranging from pairs of text that differ only in minor nuanced aspects of meaning, in relatively important differences,

down to pairs that share only some details or that only have in common being about the same topic. In the scope of this research, we have collected pairs of sentences to construct datasets for STS, a total of 15,436 pairs of sentences, being by far the largest collection of data for STS.

Using these new datasets for STS we have designed, constructed and evaluated a new approach to combine knowledge-based and corpus-based methods using a cube. This new system for STS is on par with state-of-the-art approaches that make use of Machine Learning (ML) without using any of it, but ML can be used

on this system, improving the results.

Typed Similarity tries to identify the type of relation that holds between a pair of similar items in a digital library. Being able to provide a reason why items are similar has applications in recommendation, personalization, and search. We investigate the problem within the context of Europeana, a large digital library containing items related to cultural heritage. A range of types of similarity in this collection were identified and a set of 1,500 pairs of items from the collection were annotated using crowdsourcing.

Finally, we present three systems capable of resolving the Typed Similarity task: a baseline approach, a knowledge-based approach and a ML system. The high results obtained by our systems suggests that this technology is close to practical applications. In fact, the system based on ML resulted in a real-world application to recommend similar items to users in an online digital library.

Utzi erantzuna Cancel Reply