Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

📰 ArXiv cs.AI

Researchers introduce Aleph-Alpha-GermanWeb, a 628B-word German pre-training dataset created using a model-based data curation pipeline and synthetic data generation

advanced Published 1 Apr 2026

Action Steps

Combine heuristic and model-based filtering techniques to curate high-quality data
Generate synthetic data to augment the dataset and improve model performance
Apply the curation pipeline to create a large-scale pre-training dataset
Evaluate the effectiveness of the dataset in improving LLM performance and training efficiency

Who Needs to Know This

NLP engineers and researchers on a team can benefit from this approach to improve the performance and training efficiency of German-language LLMs, and data scientists can apply these techniques to other languages and domains

Key Insight

💡 Data quality can significantly boost LLM performance and training efficiency, and model-based data curation and synthetic data generation can be effective techniques for improving data quality