Ingalaterrako Sussex Unibertsitateko John Carroll irakaslea gurekin izango da uztailaren 9tik 11ra (egitaraua behean ikusi).
Hizkuntza prozesatzeko analisi sintaktikoa izaten da pausorik garrantzitsuenetariko bat, perpausaren osagai nagusiak zeintzuk diren (izen-sintagma, aditz-sintagma…) eta beraien arteko erlazioak ezagutzeko (subjektu, objektu…). Ingelesa izan da gehien landu den hizkuntza, eta gaur egunean lau dira analizatzaile hoberenak:
a) Ezagutza linguistikoan oinarritutakoak.
Connexor eta Xerox
b) Estatistikan oinarritutako sistemak
Collins eta Charniak
Egungo erronka handiena da ezagutza linguistikoa eta estatistikoa konbinatzea analizatzaile hobeak lortzearren. Ildo horretatik ikertuz John Carroll-ek Robust Accurate Statistical Parsing (RASP) sistema sortu du. Oso ondo dabil eta hainbeste ikerkuntza-proiektutan zein aplikaziotan erabiltzen ari da.
Lekua: Informatika Fakultateko batzar aretoan.
Uztailaren 9/10, 15:30-17:30:
Ikastaroa: NLP and parsing.
1.techniques for shallow parsing: treebanks, linguistic grammars,
3.parser evaluation
4.high precision parsing
5.efficient deep parsing
6.robust parsing and shallow semantics
Uztailaren 11, 11:30-13:00:
Hitzaldia: Text categorization for improved priors of word meaning.
Distributions of the senses of words are often highly skewed. This fact is exploited by word sense disambiguation (WSD) systems which back off to the predominant (most frequent) sense of a word when contextual clues are not strong enough. The topic domain of a document has a strong influence on the sense distribution of words.
Unfortunately, it is not feasible to produce large manually sense-annotated corpora for every domain of interest. Previous experiments have shown that unsupervised estimation of the predominant sense of certain words using corpora whose domain has been determined by hand outperforms estimates based on domain-independent text for a subset of words and even outperforms the estimates based on counting occurrences in an annotated corpus.
In this talk I will address the question of whether it is possible to _automatically_ produce domain-specific corpora which could be used to acquire predominant senses appropriate for specific domains.