The paper presents a method for modernising words in historical Slovene texts, which includes modernising word-forms with the help of a computational lexicon and transcription rules, morphosyntactic tagging, and lemmatisation. The program for modernisation uses the IMP language resources for historical Slovene, which include a hand-annotated text corpus and a lexicon of historical Slovene. The paper introduces the language resources, the ToTrTaLe program for linguistic annotation, an evaluation of the accuracy of the program and directions for future research.
COBISS.SI-ID: 27352871
The paper presents a manually annotated corpus of historical Slovene and a study, based on this corpus, of how clitics have changed in the Slovene language over time. The paper discusses the composition, encoding and availability of the corpus, and then presents a study of word-tokenization mismatches between contemporary and historical Slovene, concentrating on the binding of clitics with their host, and on the variability of clitic orthography in the corpus.
COBISS.SI-ID: 27197223