AI Dev and Research News
Posts
Marktechpost Newsletter: NVIDIA RankRAG + CodeGeeX4-ALL-9B + MInference + ESFT...

Marktechpost Newsletter: NVIDIA RankRAG + CodeGeeX4-ALL-9B + MInference + ESFT...

ASIF RAZZAQ
July 09, 2024

Good morning, AI aficionados! Today, we dive into the latest advancements and breakthroughs shaping the future of artificial intelligence. From cutting-edge research in machine learning to transformative applications in industries such as healthcare, finance, and entertainment, the AI landscape is evolving at a breathtaking pace. In this edition, we'll explore innovative AI models, discuss ethical considerations in AI deployment, and highlight success stories from startups to tech giants. Join us as we unravel the complexities and celebrate the triumphs of this ever-changing field, ensuring you stay at the forefront of AI innovation.

Stay curious and inspired!

—The MarkTechPost Team

NVIDIA

NVIDIA Introduces RankRAG: A Novel RAG Framework that Instruction-Tunes a Single LLM for the Dual Purposes of Top-k Context Ranking and Answer Generation in RAG

Researchers from NVIDIA and Georgia Tech introduced an innovative framework RankRAG, designed to enhance the capabilities of LLMs in RAG tasks. This approach uniquely instruction-tunes a single LLM to perform both context ranking and answer generation within the RAG framework. RankRAG expands on existing instruction-tuning datasets by incorporating context-rich question-answering, retrieval-augmented QA, and ranking datasets. This comprehensive training approach aims to improve the LLM’s ability to filter irrelevant contexts during both the retrieval and generation phases.

The framework introduces a specialized task that focuses on identifying relevant contexts or passages for given questions. This task is structured for ranking but framed as regular question-answering with instructions, aligning more effectively with RAG tasks. During inference, the LLM first reranks retrieved contexts before generating answers based on the refined top-k contexts. This versatile approach can be applied to a wide range of knowledge-intensive natural language processing tasks, offering a unified solution for improving RAG performance across diverse domains.

Tsinghua University

Tsinghua University Open Sources CodeGeeX4-ALL-9B: A Groundbreaking Multilingual Code Generation Model Outperforming Major Competitors and Elevating Code Assistance

The CodeGeeX4-ALL-9B model is a product of extensive training on the GLM-4-9B framework, which has markedly improved its capabilities in code generation. With a parameter count of 9.4 billion, this model stands out as one of the most powerful in its class, surpassing even larger general-purpose models. It excels in inference speed and overall performance, making it a versatile tool for various software development tasks.

One of the standout features of CodeGeeX4-ALL-9B is its ability to handle various functions seamlessly. This model covers all critical aspects of software development, from code completion and generation to code interpretation and web searches. It offers repository-level code Q&A, enabling developers to interact with their codebase more intuitively and efficiently. This comprehensive functionality makes CodeGeeX4-ALL-9B an invaluable asset for developers in diverse programming environments.

Microsoft

MInference (Milliontokens Inference): A Training-Free Efficient Method for the Pre-Filling Stage of Long-Context LLMs Based on Dynamic Sparse Attention

Researchers from Microsoft Corporation and the University of Surrey have developed MInference (Million-tokens Inference), a method to speed up long-sequence processing in LLMs. By identifying three distinct attention patterns—A-shape, Vertical-Slash, and Block-Sparse—they optimize sparse calculations for GPUs. MInference dynamically builds sparse indices for these patterns during inference, significantly reducing latency without altering pre-training or needing fine-tuning. Tests on various LLMs and benchmarks, such as LLaMA-3-8B-1M and InfiniteBench, show up to a 10x speedup, cutting the pre-filling stage from 30 minutes to 3 minutes on a single A100 GPU while maintaining accuracy.

Sparse attention methods aim to improve Transformer efficiency by reducing the quadratic complexity of attention. These methods include static sparse patterns (e.g., sliding windows, dilated attention), cluster-based approaches (e.g., hash-based, kNN-based), and dynamic sparse attention. However, they typically require pre-training, limiting their direct applicability to ready-to-use LLMs. Recent approaches extend LLM context windows through staged pre-training, modified position embeddings, and external memory modules but do not reduce high inference costs. Other studies optimize pre-filling and decoding in long-context LLMs yet often involve training from scratch or substantial overhead, making them impractical for existing pre-trained models.

DeepSeek AI

DeepSeek AI Researchers Propose Expert-Specialized Fine-Tuning, or ESFT to Reduce Memory by up to 90% and Time by up to 30%

DeepSeek AI and Northwestern University researchers have introduced a novel method called Expert-Specialized Fine-Tuning (ESFT) tailored for sparse-architecture LLMs, specifically those using a mixture-of-experts (MoE) architecture. This method aims to fine-tune only the most relevant experts for a given task while freezing the other experts and model components. By doing so, ESFT enhances tuning efficiency and maintains the specialization of the experts, which is crucial for optimal performance. The ESFT method capitalizes on the MoE architecture’s inherent ability to assign different tasks to experts, ensuring that only the necessary parameters are updated.

In more detail, ESFT involves calculating the affinity scores of experts to task-specific data and selecting a subset of experts with the highest relevance. These selected experts are then fine-tuned while the rest of the model remains unchanged. This selective approach significantly reduces the computational costs associated with fine-tuning. For instance, ESFT can reduce storage requirements by up to 90% and training time by up to 30% compared to full-parameter fine-tuning. This efficiency is achieved without compromising the model’s overall performance, as demonstrated by the experimental results.

Also, don’t forget to follow us on Twitter and join our 46k+ ML SubReddit, 26k+ AI Newsletter, Telegram Channel, and LinkedIn Group.

If You are interested in a promotional partnership (content/ad/newsletter), please fill out this form.