Hi There,
Dive into the hottest AI breakthroughs of the week—handpicked just for you!
Super Important AI News 🔥 🔥 🔥
⭐ Can we improve retrieval for RAG by learning from neighboring contexts? Contextual Document Embedding shows how neighboring document information, during training and encoding, can create "context-aware" embeddings that significantly improve retrieval performance, especially in out-of-domain scenarios.
🧲 Surya Table Recognition Release: It uses a new architecture to outperform table transformer, the current SoTA open source model
⛳ LLM360 Group Introduces TxT360: A Top-Quality LLM Pre-Training Dataset with 15T Tokens
Featured AI Research 🛡️🛡️🛡️
EuroLLM Released: A Suite of Open-Weight Multilingual Language Models (EuroLLM-1.7B and EuroLLM-1.7B-Instruct) Capable of Understanding and Generating Text in All Official European Union languages
Key Takeaways:
Project Overview: The EuroLLM project aims to create multilingual large language models (LLMs) capable of understanding and generating text in all official European Union languages and several additional relevant languages.
Initial Models: The initial models, EuroLLM-1.7B and EuroLLM-1.7B-Instruct, were developed to improve multilingual support, particularly for underrepresented European languages.
Data Collection: The training corpus involved a combination of web data, parallel data, code/math data, and high-quality sources like Wikipedia and ArXiv. The data was filtered and mixed to optimize multilingual performance.
Scaling Laws: Scaling laws were employed to optimize the inclusion of parallel data and to determine the appropriate mixture of high-quality data to enhance the model's performance.
Multilingual Tokenizer: A multilingual tokenizer was developed with a large vocabulary (128,000 pieces) to accommodate the diverse set of languages while balancing efficiency and effectiveness.
Modeling and Training: EuroLLM uses a standard dense Transformer architecture with specific enhancements like grouped query attention (GQA), RMSNorm, and SwiGLU activation. Training was performed on 4 trillion tokens using a mix of data sources.
Performance: The models were evaluated using multilingual general benchmarks and machine translation tasks, showing competitive performance across European languages and outperforming some baseline models on several tasks.
Thursday, October 17
Join over 300 GenAI executives from Bayer, Microsoft, Flagship Pioneering to learn how to build fast, accurate AI search on object storage.
Build a solid data foundation for your GenAI. Act fast—tickets are limited!
✅ Dive deep into cutting-edge strategies for building GenAI-based research systems and multi-modal RAG solutions
✅ Unlock the secrets of lightning-fast search on object storage for AI applications
✅ Learn from 15+ industry pioneers from Fortune 500 companies and top research institutions on how to build highly accurate, LLM-powered solutions that deliver real business value
✅ Network with peers tackling similar challenges in GenAI implementation and scaling
Other AI News 🎖️🎖️🎖️
🥁 Hex-LLM: A New LLM Serving Framework Designed for Efficiently Serving Open LLMs on Google Cloud TPUs
🎙️ SEAL: A Dual-Encoder Framework Enhancing Hierarchical Imitation Learning with LLM-Guided Sub-Goal Representations