1st Multilingual Model Workshop - FinGPT: Large Generative Models for a Small Language
Cerebras Systems Cerebras Systems
2.29K subscribers
139 views
0

 Published On Feb 12, 2024

Pre-training large language models (LLMs) requires enormous amounts of text, far exceeding the resources available for smaller languages. Sampo explores the challenges involved in creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. In this talk, Sampo describes the curation of an extensive dataset of Finnish combining web crawls, news, social media and eBooks, and the pursuit of two approaches to pretrain models: 1) training seven monolingual models from scratch (186M to 13B parameters) and 2) continuing the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish.

show more

Share/Embed