Projects / Programmes
A Treebank Approach to the Study of Spoken Slovenian
Code |
Science |
Field |
Subfield |
6.05.00 |
Humanities |
Linguistics |
|
Code |
Science |
Field |
6.02 |
Humanities |
Languages and Literature |
spoken language, spoken grammar; syntactically annotated corpora, treebanks, dependency syntax, syntactic trees; corpus linguistics, corpus-driven research, comparing corpora, language variation
Organisations (1)
, Researchers (1)
0581 University of Ljubljana, Faculty of Arts
no. |
Code |
Name and surname |
Research area |
Role |
Period |
No. of publicationsNo. of publications |
1. |
36491 |
PhD Kaja Dobrovoljc |
Linguistics |
Head |
2022 - 2025 |
197 |
Abstract
Based on the unitary approach to the study of language, whereby speech and writing are seen as two ends of the same continuum, the past three decades have witnessed an unprecedented increase of corpus linguistic research aimed at describing speech-specific syntactic phenomena that have been ignored or insufficiently addressed by traditional grammatical frameworks. However, this trend is significantly less pronounced in Slovenian linguistics, where research on syntactic characteristics of spoken Slovenian is still scarce and has mostly been focused on top-down investigations of individual syntactic phenomena based on qualitative analyses of relatively small amounts of data.
To bridge this gap and establish the necessary empirical foundations for future grammatical descriptions of spoken Slovenian, this project will systematically investigate the potential of syntactically annotated corpora, i.e. treebanks, for linguistic research on spoken Slovenian by (1) establishing a coherent framework for syntactic annotation of spoken Slovenian, (2) providing a high-quality treebank of spoken Slovenian, and (3) developing a methodology for its bottom-up statistics-driven linguistic analysis, while (4) promoting the use of syntactically annotated corpora in linguistics in general.
Specifically, we will significantly improve the current version of the Spoken Slovenian Treebank (Dobrovoljc and Nivre 2016), the only syntactically annotated corpus of spoken Slovenian to date, both in terms of size, documentation, and the quality of annotations. In turn, the new treebank will be used to perform a pioneering bottom-up identification of speech-specific syntactic patterns in spoken Slovenian by means of a keyness analysis resulting in a list of syntactic trees with a statistically significant higher frequency of occurrence in speech than in writing. We expect the in-depth analysis of this list to empirically confirm the known, prototypical, cognitively most salient speech-specific syntactic phenomena on the one hand, and lead to the potential discovery of previously unidentified, statistically most salient patterns of spoken language use, on the other.
Thus, the project will result in several important contributions to Slovenian linguistics by providing new resources, methods, and analyses for the study of spoken Slovenian, but also to the field of corpus linguistics in general by providing new insights on the heretofore underexploited methodological potential of syntactically parsed corpora, both for spoken language studies and studies on language variation in general.