AI Dev and Research News
Posts
AI Insights: Machine Learning Meets Physics: The 2024 Nobel Prize Story and .....

AI Insights: Machine Learning Meets Physics: The 2024 Nobel Prize Story and .....

October 09, 2024

Newsletter Series by Marktechpost.com

Hi There,

Dive into the hottest AI breakthroughs of the week—handpicked just for you!

Super Important AI News 🔥 🔥 🔥

🎃 Machine Learning Meets Physics: The 2024 Nobel Prize Story

⭐ Can we improve retrieval for RAG by learning from neighboring contexts? Contextual Document Embedding shows how neighboring document information, during training and encoding, can create "context-aware" embeddings that significantly improve retrieval performance, especially in out-of-domain scenarios.

📍 Anthropic challenges OpenAI with affordable batch processing

🧲 Surya Table Recognition Release: It uses a new architecture to outperform table transformer, the current SoTA open source model

🔖 Qwen2.5-72B is now available to free tier users on the HF Serverless Inference API (with a generous quota)!

⛳ LLM360 Group Introduces TxT360: A Top-Quality LLM Pre-Training Dataset with 15T Tokens

Featured AI Research 🛡️🛡️🛡️

EuroLLM Released: A Suite of Open-Weight Multilingual Language Models (EuroLLM-1.7B and EuroLLM-1.7B-Instruct) Capable of Understanding and Generating Text in All Official European Union languages

Key Takeaways:

Project Overview: The EuroLLM project aims to create multilingual large language models (LLMs) capable of understanding and generating text in all official European Union languages and several additional relevant languages.

Initial Models: The initial models, EuroLLM-1.7B and EuroLLM-1.7B-Instruct, were developed to improve multilingual support, particularly for underrepresented European languages.

Data Collection: The training corpus involved a combination of web data, parallel data, code/math data, and high-quality sources like Wikipedia and ArXiv. The data was filtered and mixed to optimize multilingual performance.

Scaling Laws: Scaling laws were employed to optimize the inclusion of parallel data and to determine the appropriate mixture of high-quality data to enhance the model's performance.

Multilingual Tokenizer: A multilingual tokenizer was developed with a large vocabulary (128,000 pieces) to accommodate the diverse set of languages while balancing efficiency and effectiveness.

Modeling and Training: EuroLLM uses a standard dense Transformer architecture with specific enhancements like grouped query attention (GQA), RMSNorm, and SwiGLU activation. Training was performed on 4 trillion tokens using a mix of data sources.

Performance: The models were evaluated using multilingual general benchmarks and machine translation tasks, showing competitive performance across European languages and outperforming some baseline models on several tasks.

RetrieveX - The GenAI Data Retrieval Conference

Thursday, October 17

Join over 300 GenAI executives from Bayer, Microsoft, Flagship Pioneering to learn how to build fast, accurate AI search on object storage.

Build a solid data foundation for your GenAI. Act fast—tickets are limited!

✅ Dive deep into cutting-edge strategies for building GenAI-based research systems and multi-modal RAG solutions

✅ Unlock the secrets of lightning-fast search on object storage for AI applications

✅ Learn from 15+ industry pioneers from Fortune 500 companies and top research institutions on how to build highly accurate, LLM-powered solutions that deliver real business value

✅ Network with peers tackling similar challenges in GenAI implementation and scaling

partner with us

Other AI News 🎖️🎖️🎖️

🎯 Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

♦️ Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

🧩 Meet Open NotebookLM: An Open Source Alternative to Google's NotebookLM

📢 Have you heard about @karpathy ‘s tweet on LLMs and Markov chains? Here is a research paper based on that…

🥁 Hex-LLM: A New LLM Serving Framework Designed for Efficiently Serving Open LLMs on Google Cloud TPUs

🎙️ SEAL: A Dual-Encoder Framework Enhancing Hierarchical Imitation Learning with LLM-Guided Sub-Goal Representations