Projects / Programmes
Embeddings-based techniques for Media Monitoring Applications (EMMA)
Code |
Science |
Field |
Subfield |
2.07.00 |
Engineering sciences and technologies |
Computer science and informatics |
|
Code |
Science |
Field |
1.02 |
Natural Sciences |
Computer and information sciences |
machine learning, text mining, natural language processing, deep neural networks, document representation, language models, embeddings, media monitoring
Data for the last 5 years (citations for the last 10 years) on
October 15, 2025;
Data for score A3 calculation refer to period
2020-2024
Data for ARIS tenders (
04.04.2019 – Programme tender,
archive
)
Database |
Linked records |
Citations |
Pure citations |
Average pure citations |
WoS |
274
|
6,424
|
5,972
|
21.8
|
Scopus |
398
|
10,266
|
9,341
|
23.47
|
Organisations (2)
, Researchers (16)
0106 Jožef Stefan Institute
1539 University of Ljubljana, Faculty of Computer and Information Science
no. |
Code |
Name and surname |
Research area |
Role |
Period |
No. of publicationsNo. of publications |
1. |
55754 |
Matej Klemen |
Computer science and informatics |
Young researcher |
2023 |
20 |
2. |
15295 |
PhD Marko Robnik Šikonja |
Computer science and informatics |
Researcher |
2023 - 2025 |
473 |
3. |
50769 |
PhD Tadej Škvorc |
Computer science and informatics |
Researcher |
2023 - 2025 |
18 |
4. |
56007 |
Aleš Žagar |
Computer science and informatics |
Researcher |
2023 - 2025 |
35 |
Abstract
In machine learning, the analysis of big data is still a great challenge. Term big data refers data, characterised by its large volume, velocity, veracity, and variety. The proposed project tackles the challenge of the language variety and velocity (dynamics) of media contents, which we address by using advanced text representation methods (embeddings) and deep learning. The increasing amounts of media content include a spectrum from traditional high-quality news to less-reliable social media content. Media monitoring and analysis need to be performed in real-time: grouping articles by their content, adding several categories of meta-information, summarizing several news sources, performing analyses, and reporting. Clipping agencies, such as Slovenian agency Kliping d.o.o., which will co-finance this industrial project, therefore, face a challenging problem, especially as many analytical tasks have to be performed manually, especially in less-resourced languages where many tools are non-existent or do not return results of sufficient quality. Kliping monitors over 70,000 traditional articles and over 1 million social media posts per day, resulting in more than 1,500 daily reports for their respective target users, covering the Slovenian as well as Western Balkans media space and thus including text in six different languages (Slovenian, Croatian, Bosnian, Serbian, Macedonian and Albanian) and two alphabets (Latin and Cyrillic). Recent machine learning techniques for advanced Natural Language Processing, which are based on text embeddings and large pretrained language models, enable the development of advanced text processing tools for text analysis, such as text categorisation in terms of their topics or sentiment, and text summarisation from multiple sources. However, even the best of these tools have to be adapted and improved to cope with the specific user needs, the complexity of news category hierarchies, metadata structures used in the news industry, and coverage of multiple languages. To this end, this project aims to develop advanced multilingual news and social media content analysis tools to help automate text analysis processes while increasing society’s ability to understand the rapid flow of information surrounding us.