- AI Dev and Research News
- Posts
- AI Insights: Machine Learning Meets Physics: The 2024 Nobel Prize Story and .....
AI Insights: Machine Learning Meets Physics: The 2024 Nobel Prize Story and .....

Newsletter Series by Marktechpost.com
Hi There,
Dive into the hottest AI breakthroughs of the week—handpicked just for you!
Super Important AI News 🔥 🔥 🔥
🎃 Machine Learning Meets Physics: The 2024 Nobel Prize Story
⭐ Can we improve retrieval for RAG by learning from neighboring contexts? Contextual Document Embedding shows how neighboring document information, during training and encoding, can create "context-aware" embeddings that significantly improve retrieval performance, especially in out-of-domain scenarios.
📍 Anthropic challenges OpenAI with affordable batch processing
🧲 Surya Table Recognition Release: It uses a new architecture to outperform table transformer, the current SoTA open source model
🔖 Qwen2.5-72B is now available to free tier users on the HF Serverless Inference API (with a generous quota)!
⛳ LLM360 Group Introduces TxT360: A Top-Quality LLM Pre-Training Dataset with 15T Tokens
Featured AI Research 🛡️🛡️🛡️
EuroLLM Released: A Suite of Open-Weight Multilingual Language Models (EuroLLM-1.7B and EuroLLM-1.7B-Instruct) Capable of Understanding and Generating Text in All Official European Union languages
Key Takeaways:
Project Overview: The EuroLLM project aims to create multilingual large language models (LLMs) capable of understanding and generating text in all official European Union languages and several additional relevant languages.
Initial Models: The initial models, EuroLLM-1.7B and EuroLLM-1.7B-Instruct, were developed to improve multilingual support, particularly for underrepresented European languages.
Data Collection: The training corpus involved a combination of web data, parallel data, code/math data, and high-quality sources like Wikipedia and ArXiv. The data was filtered and mixed to optimize multilingual performance.
Scaling Laws: Scaling laws were employed to optimize the inclusion of parallel data and to determine the appropriate mixture of high-quality data to enhance the model's performance.
Multilingual Tokenizer: A multilingual tokenizer was developed with a large vocabulary (128,000 pieces) to accommodate the diverse set of languages while balancing efficiency and effectiveness.
Modeling and Training: EuroLLM uses a standard dense Transformer architecture with specific enhancements like grouped query attention (GQA), RMSNorm, and SwiGLU activation. Training was performed on 4 trillion tokens using a mix of data sources.
Performance: The models were evaluated using multilingual general benchmarks and machine translation tasks, showing competitive performance across European languages and outperforming some baseline models on several tasks.
