Projects / Programmes
Basic Research for the Development of Spoken Language Resources and Speech Technologies for the Slovenian Language
Code |
Science |
Field |
Subfield |
6.05.00 |
Humanities |
Linguistics |
|
Code |
Science |
Field |
6.02 |
Humanities |
Languages and Literature |
spoken language resources, spoken language, research of speech, language technologies, speech technologies, corpus lingustics, lexicography
Data for the last 5 years (citations for the last 10 years) on
October 15, 2025;
Data for score A3 calculation refer to period
2020-2024
Data for ARIS tenders (
04.04.2019 – Programme tender,
archive
)
Database |
Linked records |
Citations |
Pure citations |
Average pure citations |
WoS |
287
|
3,716
|
3,400
|
11.85
|
Scopus |
564
|
7,249
|
6,362
|
11.28
|
Organisations (9)
, Researchers (38)
0796 University of Maribor, Faculty of Electrical Engineering and Computer Science
0106 Jožef Stefan Institute
no. |
Code |
Name and surname |
Research area |
Role |
Period |
No. of publicationsNo. of publications |
1. |
05023 |
PhD Tomaž Erjavec |
Linguistics |
Researcher |
2022 - 2025 |
694 |
2. |
55962 |
Taja Kuzman |
Linguistics |
Researcher |
2022 - 2025 |
113 |
3. |
36871 |
PhD Nikola Ljubešić |
Linguistics |
Researcher |
2022 - 2025 |
470 |
4. |
56348 |
Peter Rupnik |
|
Technical associate |
2022 - 2025 |
93 |
0581 University of Ljubljana, Faculty of Arts
0618 Research Centre of the Slovenian Academy of Sciences and Arts
1538 University of Ljubljana, Faculty of Electrical Engineering
no. |
Code |
Name and surname |
Research area |
Role |
Period |
No. of publicationsNo. of publications |
1. |
11805 |
PhD Simon Dobrišek |
Computer science and informatics |
Researcher |
2022 - 2025 |
296 |
2. |
31985 |
PhD Janez Križaj |
Systems and cybernetics |
Researcher |
2022 - 2025 |
43 |
1539 University of Ljubljana, Faculty of Computer and Information Science
no. |
Code |
Name and surname |
Research area |
Role |
Period |
No. of publicationsNo. of publications |
1. |
16154 |
PhD Marko Bajec |
Computer science and informatics |
Researcher |
2022 - 2025 |
501 |
2. |
21404 |
PhD Iztok Lebar Bajec |
Computer science and informatics |
Researcher |
2022 - 2025 |
198 |
1822 University of Primorska, Faculty of Humanities
no. |
Code |
Name and surname |
Research area |
Role |
Period |
No. of publicationsNo. of publications |
1. |
32126 |
PhD Klara Šumenjak |
Linguistics |
Researcher |
2022 - 2025 |
60 |
2. |
27530 |
PhD Jana Volk |
Linguistics |
Researcher |
2022 - 2025 |
134 |
1986 ALPINEON R & D
no. |
Code |
Name and surname |
Research area |
Role |
Period |
No. of publicationsNo. of publications |
1. |
12000 |
PhD Jerneja Žganec Gros |
Computer science and informatics |
Researcher |
2022 - 2025 |
292 |
2565 University of Maribor Faculty of Arts
Abstract
Spoken language resources are scarce and underdeveloped compared to the written language resources, especially for small languages like Slovenian. To be able to perform basic research on spoken language or speech technologies with significant scientific impact, the problem of scarce spoken language resources needs to be addressed first. However, development of spoken language resources is not only a matter of applied data collection but opens up a number of basic research questions. These research questions will be addressed in this project, with focus on the Slovenian language.
This is a big project proposal and is divided into 4 Work Packages (WPs), each including 2-4 tasks, 14 tasks all together. 4 tasks are solely linguistic, 2 tasks are solely technical, while the majority of the tasks (8) are interdisciplinary. The specific objectives of WPs and their corresponding tasks are as follows:
WP1 ACQUIRING RECORDINGS OF SPEECH
- Objective 1.1: Analyse the needs for spoken language resources in different linguistic and technical disciplines.
- Objective 1.2 Analyse advantages and disadvantages of different recording techniques, with particular attention to crowdsourcing as time- and money-efficient technique.
- Objective 1.3 Evaluation of the efficiency of speech recognition models trained on domain specific speech data obtained with low-cost unsupervised or semi-supervised techniques compared to general domain data obtained with high-cost techniques.
- Objective 1.4 Identify speech/speaker tasks that need further investment into labelled data for Slovene speech recognition.
WP2: DIALECT VARIATION
- Objective 2.1 Geolinguistic analysis of selected phonetic features, creation of diachronic phonetic maps of the non-standard phonetic inventory, creation of a proposal for the standardisation of Slovenian dialect transcription and its conversion into IPA (and SAMPA).
- Objective 2.2 Creation of synthetic synchronic phonetic maps to define the areas of non-standard phonemes in Slovenian dialects. Making recommendations to improve pronunciation-based transcription for the Slovenian spoken corpus.
- Objective 2.3 The creation and testing of diasystemic contrastive Tables of phonemes (dialect vs. standard). Establishement of transcription standards for phonetic transcription for spoken corpora
- Objective 2.4 Definition and evaluation of an optimal Slovenian phoneme set for Speech Recognition, taking into account newly defined dialect phonemes, similarity metrics and various available speech data.
WP3: SPEECH SEGMENTATION AND ANNOTATION
- Objective 3.1 Evaluation of the existing speech segments/utterances in Slovene spoken language resources regarding their appropriateness as the basic units for analysis of speech on syntactic and semantic level.
- Objective 3.2 The analysis of different types of disfluencies in spoken text, creation of a disfluencies training corpus and experiments for automatic annotation of disfluencies.
- Objective 3.3 The development of a linguistic processing pipeline based on speech and transcription data (both manual and automatic) and linguistic annotation of the GOS 2.0 corpus.
- Objective 3.4 Evaluation of the GORDAN dialogue act annotation scheme, its adjustment to the ISO 24617-2 Standard and creation of the training corpus with dialogue acts` annotations.
WP4: SPOKEN LEXIS
- Objective 4.1 The evaluation of existing information on spoken Slovene in the Sloleks lexicon, and the creation of linguistically sound guidelines for the inclusion of (non-standard) spoken data in Sloleks, comparable with machine-readable lexicons for other languages.
- Objective 4.2 Analysis of existing semantic information included in lexicographic resources for Slovene from the perspective of spoken Slovene, together with the analysis of the complementary spoken corpus data, and exploration of the principles of inclusion of the findings in lexicographic resources.