Loading...
Projects / Programmes source: ARIS

Basic Research for the Development of Spoken Language Resources and Speech Technologies for the Slovenian Language

Research activity

Code Science Field Subfield
6.05.00  Humanities  Linguistics   

Code Science Field
6.02  Humanities  Languages and Literature 
Keywords
spoken language resources, spoken language, research of speech, language technologies, speech technologies, corpus lingustics, lexicography
Evaluation (metodology)
source: COBISS
Points
21,095.78
A''
3,122.96
A'
8,471.71
A1/2
12,092.28
CI10
6,336
CImax
482
h10
35
A1
72.06
A3
26.31
Data for the last 5 years (citations for the last 10 years) on October 15, 2025; Data for score A3 calculation refer to period 2020-2024
Data for ARIS tenders ( 04.04.2019 – Programme tender, archive )
Database Linked records Citations Pure citations Average pure citations
WoS  287  3,716  3,400  11.85 
Scopus  564  7,249  6,362  11.28 
Organisations (9) , Researchers (38)
0796  University of Maribor, Faculty of Electrical Engineering and Computer Science
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  53072  Špela Antloga  Linguistics  Researcher  2022 - 2025  69 
2.  54519  MSc Andreja Bizjak  Linguistics  Researcher  2022 - 2025  31 
3.  33286  PhD Gregor Donaj  Telecommunications  Researcher  2022 - 2025  88 
4.  51357  Simona Majhenič  Linguistics  Researcher  2022 - 2024  45 
5.  50218  PhD Grega Močnik  Telecommunications  Researcher  2022 - 2025  46 
6.  18168  PhD Mirjam Sepesy Maučec  Telecommunications  Researcher  2022 - 2025  263 
7.  23838  PhD Darinka Verdonik  Linguistics  Head  2022 - 2025  219 
8.  20032  PhD Andrej Žgank  Telecommunications  Researcher  2022 - 2025  249 
0106  Jožef Stefan Institute
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  05023  PhD Tomaž Erjavec  Linguistics  Researcher  2022 - 2025  694 
2.  55962  Taja Kuzman  Linguistics  Researcher  2022 - 2025  113 
3.  36871  PhD Nikola Ljubešić  Linguistics  Researcher  2022 - 2025  470 
4.  56348  Peter Rupnik    Technical associate  2022 - 2025  93 
0581  University of Ljubljana, Faculty of Arts
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  27674  PhD Špela Arhar Holdt  Linguistics  Researcher  2022 - 2025  275 
2.  36914  PhD Jaka Čibej  Linguistics  Researcher  2022 - 2025  201 
3.  36491  PhD Kaja Dobrovoljc  Linguistics  Researcher  2024 - 2025  197 
4.  16313  PhD Apolonija Gantar  Linguistics  Researcher  2022 - 2025  234 
5.  33796  PhD Iztok Kosem  Linguistics  Researcher  2022 - 2025  349 
6.  26166  PhD Simon Krek  Linguistics  Researcher  2022 - 2025  420 
7.  57100  Nejc Robida  Linguistics  Researcher  2022 - 2025  31 
8.  05799  PhD Vera Smole  Linguistics  Researcher  2022 - 2025  532 
9.  19059  PhD Mojca Smolej  Humanities  Researcher  2022 - 2025  379 
10.  11651  PhD Marko Stabej  Linguistics  Researcher  2022 - 2025  653 
0618  Research Centre of the Slovenian Academy of Sciences and Arts
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  15689  PhD Helena Dobrovoljc  Linguistics  Researcher  2022 - 2025  412 
2.  32205  PhD Januška Gostenčnik  Linguistics  Researcher  2022 - 2025  144 
3.  37555  PhD Janoš Ježovnik  Linguistics  Researcher  2022 - 2025  126 
4.  10288  PhD Carmen Kenda-Jež  Linguistics  Researcher  2022 - 2025  319 
5.  34592  PhD Tanja Mirtič  Linguistics  Researcher  2023 - 2025  99 
6.  10353  PhD Jožica Škofic  Linguistics  Researcher  2022 - 2025  705 
1538  University of Ljubljana, Faculty of Electrical Engineering
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  11805  PhD Simon Dobrišek  Computer science and informatics  Researcher  2022 - 2025  296 
2.  31985  PhD Janez Križaj  Systems and cybernetics  Researcher  2022 - 2025  43 
1539  University of Ljubljana, Faculty of Computer and Information Science
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  16154  PhD Marko Bajec  Computer science and informatics  Researcher  2022 - 2025  501 
2.  21404  PhD Iztok Lebar Bajec  Computer science and informatics  Researcher  2022 - 2025  198 
1822  University of Primorska, Faculty of Humanities
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  32126  PhD Klara Šumenjak  Linguistics  Researcher  2022 - 2025  60 
2.  27530  PhD Jana Volk  Linguistics  Researcher  2022 - 2025  134 
1986  ALPINEON R & D
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  12000  PhD Jerneja Žganec Gros  Computer science and informatics  Researcher  2022 - 2025  292 
2565  University of Maribor Faculty of Arts
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  12507  PhD Mihaela Koletnik  Linguistics  Researcher  2022 - 2025  558 
2.  20763  PhD Mira Krajnc Ivič  Humanities  Researcher  2022 - 2025  251 
3.  18502  PhD Melita Zemljak Jontes  Linguistics  Researcher  2022 - 2025  509 
Abstract
Spoken language resources are scarce and underdeveloped compared to the written language resources, especially for small languages like Slovenian. To be able to perform basic research on spoken language or speech technologies with significant scientific impact, the problem of scarce spoken language resources needs to be addressed first. However, development of spoken language resources is not only a matter of applied data collection but opens up a number of basic research questions. These research questions will be addressed in this project, with focus on the Slovenian language. This is a big project proposal and is divided into 4 Work Packages (WPs), each including 2-4 tasks, 14 tasks all together. 4 tasks are solely linguistic, 2 tasks are solely technical, while the majority of the tasks (8) are interdisciplinary. The specific objectives of WPs and their corresponding tasks are as follows: WP1 ACQUIRING RECORDINGS OF SPEECH - Objective 1.1: Analyse the needs for spoken language resources in different linguistic and technical disciplines. - Objective 1.2 Analyse advantages and disadvantages of different recording techniques, with particular attention to crowdsourcing as time- and money-efficient technique. - Objective 1.3 Evaluation of the efficiency of speech recognition models trained on domain specific speech data obtained with low-cost unsupervised or semi-supervised techniques compared to general domain data obtained with high-cost techniques. - Objective 1.4 Identify speech/speaker tasks that need further investment into labelled data for Slovene speech recognition. WP2: DIALECT VARIATION - Objective 2.1 Geolinguistic analysis of selected phonetic features, creation of diachronic phonetic maps of the non-standard phonetic inventory, creation of a proposal for the standardisation of Slovenian dialect transcription and its conversion into IPA (and SAMPA). - Objective 2.2 Creation of synthetic synchronic phonetic maps to define the areas of non-standard phonemes in Slovenian dialects. Making recommendations to improve pronunciation-based transcription for the Slovenian spoken corpus. - Objective 2.3 The creation and testing of diasystemic contrastive Tables of phonemes (dialect vs. standard). Establishement of transcription standards for phonetic transcription for spoken corpora - Objective 2.4 Definition and evaluation of an optimal Slovenian phoneme set for Speech Recognition, taking into account newly defined dialect phonemes, similarity metrics and various available speech data. WP3: SPEECH SEGMENTATION AND ANNOTATION - Objective 3.1 Evaluation of the existing speech segments/utterances in Slovene spoken language resources regarding their appropriateness as the basic units for analysis of speech on syntactic and semantic level. - Objective 3.2 The analysis of different types of disfluencies in spoken text, creation of a disfluencies training corpus and experiments for automatic annotation of disfluencies. - Objective 3.3 The development of a linguistic processing pipeline based on speech and transcription data (both manual and automatic) and linguistic annotation of the GOS 2.0 corpus. - Objective 3.4 Evaluation of the GORDAN dialogue act annotation scheme, its adjustment to the ISO 24617-2 Standard and creation of the training corpus with dialogue acts` annotations. WP4: SPOKEN LEXIS - Objective 4.1 The evaluation of existing information on spoken Slovene in the Sloleks lexicon, and the creation of linguistically sound guidelines for the inclusion of (non-standard) spoken data in Sloleks, comparable with machine-readable lexicons for other languages. - Objective 4.2 Analysis of existing semantic information included in lexicographic resources for Slovene from the perspective of spoken Slovene, together with the analysis of the complementary spoken corpus data, and exploration of the principles of inclusion of the findings in lexicographic resources.
Views history
Favourite