LISBON, mei 28, 2025 | Multilingual open-source initiatives EuroLLM and OpenEuroLLM have joined forces to secure 3 million GPU hours on Leonardo – one of Europe’s most powerful supercomputers – to develop a groundbreaking synthetic dataset covering 40 European languages.
The initiative was selected under the EuroHPC AI Factory Large Scale call recognizing its potential to advance Europe’s leadership in multilingual artificial intelligence.
At the heart of this initiative is a mission to build strategic autonomy for Europe in AI development. By generating high-quality, ethically sourced synthetic data, it addresses a long-standing gap in linguistic representation, in particular for low-resource and minority languages.
André Martins, Chief Scientific Officer at Unbabel and EuroLLM project co-lead said:
“By joining forces through EuroLLM and OpenEuroLLM, we’re bringing together the research strength and open-source ethos needed to tackle one of Europe’s biggest AI challenges: linguistic inclusion at scale. This project is about ensuring Europe owns its language data, reflects its cultural diversity, and sets its own standards in responsible AI development.”
The GPU allocation will power the MultiSynt approach, a key component of the project which seeks to address one of the most persistent bottlenecks in multilingual LLM development: the lack of high-quality pre-training data.
“This is an important step in securing large enough computing power to build the OpenEuroLLM’s family of open LLMs. I am also glad that this has been done in collaboration with the experienced team from the EuroLLM project. The goal of this subproject is to explore multilingual synthetic data creation and evaluate their use in order to reach a higher common goal: building high-quality multilingual LLMs for all European languages and beyond.” – notes Jan Hajic, Charles University, coordinator of the OpenEuroLLM project.
While most synthetic data generation for large language models to date has focused on English, MultiSynt will create the first comprehensive multilingual synthetic dataset designed specifically for pre-training. By leveraging generative models to enhance and diversify existing content, it will support the broader aims of EuroLLM and OpenEuroLLM: building open-source, culturally grounded, and linguistically diverse AI for Europe.
This methodology will support linguistic diversity, open access, and data quality and aligns with the wider objectives of the European Commission’s Digital Decade and the AI Act.
The awarded 3 million hours reflect a strong endorsement of the project’s technical merit and strategic value.
The initiative will be executed through phased releases of the synthetic dataset.
****ENDS****
About EuroLLM
The EuroLLM project includes Unbabel, Instituto Superior Técnico, the University of Edinburgh, Instituto de Telecomunicações, Université Paris-Saclay, Aveni, Sorbonne University, Naver Labs, and the University of Amsterdam. Together they created EuroLLM-9B, a multilingual AI model supporting all 24 official EU languages. Developed with support from Horizon Europe, the European Research Council, and EuroHPC, this open-source LLM aims to enhance Europe’s digital sovereignty and foster AI innovation.
About OpenEuroLLM
Bringing together 20 of Europe’s leading AI companies, research institutions and EuroHPC centres, the OpenEuroLLM project is creating a new generation of open source large language models for European languages. Co-funded by the European Union’s Digital Europe Programme, the project is laying the foundations for AI infrastructure that will enhance competitiveness, resilience, and digital sovereignty.
About EuroHPC
The European High Performance Computing Joint Undertaking (EuroHPC JU) is a joint initiative between the EU, European countries, and private partners to develop a world-class supercomputing ecosystem in Europe.
Media Contacts:
For more information or interview requests, please do not hesitate to reach out to our media contacts below:
• Unbabel: farah.pasha.ext@unbabel.com