J6-6842 — Final report
1.
Investigating computer-mediated communication

international scientific monograph, among chapters there is also a chapter from the project group

COBISS.SI-ID: 291555584
2.
Computer-mediated communitacioj

special issue of a scientific journal, most papers were contributed by project group members

COBISS.SI-ID: 286873088
3.
JANES v0.4

The paper presents the current version of the Slovene corpus of netspeak Janes which contains tweets, forum posts, news comments, blogs and blog comments, and user and talk pages from Wikipedia. First, we describe the harvesting procedure for each data source and provide a quantitative analysis of the corpus. Next, we present automatic and manual procedures for enriching the corpus with metadata, such as user type, gender and region, and text sentiment and standardness level. Finally, we give a detailed account of the linguistic annotation workflow which includes tokenization, sentence segmentation, rediacritisation, normalization, morphosyntactic tagging and lemmatization.

COBISS.SI-ID: 62245218
4.
Enabling access to corpora of Slovene internet texts in the light of legal restrictions

Web texts are becoming increasingly relevant sources of information, with web corpora useful for corpus linguistic studies and development of language technologies. Even though web texts are directly accessable, which substantially simplifies the collection procedure compilation of web corpora is still complex, time consuming and expensive. It is crucial that similar endeavours are not repeated, which is why it is necessary to make the created corpora easily and widely accessible both to researchers and a wider audience. While this is logistically and technically a straightforward procedure, legal constraints, such as copyright, privacy and terms of use severely hinder the dissemination of web corpora. This paper discusses legal conditions and actual practice in this area, gives an overview of current practices and proposes a range of mitigation measures on the example of the Janes corpus of Slovene user-generated content in order to ensure free and open dissemination of Slovene web corpora.

COBISS.SI-ID: 62288994
5.
The compilation, processing and analysis of the JANES corpus of Slovene user-generated content

key overview project publication about the creation and annotation of the corpus

COBISS.SI-ID: 64650338