• AI Research Insights
  • Posts
  • Marktechpost Newsletter: Apple DCLM, Llama 3.1 Released, DVC.AI DataChain Released, and many more....

Marktechpost Newsletter: Apple DCLM, Llama 3.1 Released, DVC.AI DataChain Released, and many more....

Marktechpost Newsletter: Apple DCLM, Llama 3.1 Released, DVC.AI DataChain Released, and many more....

Presented by

Featured Research

Researchers from Apple, the University of Washington, and many other institutions have introduced DataComp for Language Models (DCLM) to address these issues. They have recently open-sourced the DCIM models and datasets on the Hugging Face Platform. The open source release comprises DCLM-7B, DCLM-1B, dclm-7b-it, DCLM-7B-8k, dclm-baseline-1.0, and dclm-baseline-1.0-parquet. This innovative testbed allows controlled experiments with large datasets to improve language models. The DCLM framework includes a comprehensive corpus of 240 trillion tokens from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. This setup provides a standardized approach to dataset curation, enabling consistent and comparable experiments.

DCLM offers a structured workflow for researchers. Participants can choose scales ranging from 412M to 7B parameters and experiment with data curation strategies such as deduplication, filtering, and data mixing. Researchers can train models on curated datasets using a standardized training recipe and specific hyperparameters. The performance of these models is then evaluated on a suite of downstream tasks, providing a clear measure of dataset quality. This systematic approach helps identify the most effective data curation strategies.

 Editor’s Picks…

Llama 3.1 Released: Meta’s New Open-Source AI Model that You can Fine-Tune, Distill, and Deploy Anywhere and available in 8B, 70B, and 405B

Meta announced the release of Llama 3.1, the most capable model in the LLama Series. This latest iteration of the Llama series, particularly the 405B model, represents a substantial advancement in open-source AI capabilities, positioning Meta at the forefront of AI innovation.

The Llama 3.1 405B model stands out for its exceptional flexibility, control, and performance, rivaling even the most advanced closed-source models. It is designed to support various applications, including synthetic data generation and model distillation, thus enabling the community to explore new workflows and innovations. With support for eight languages and an expanded context length of 128K, Llama 3.1 is versatile and robust, catering to diverse use cases such as long-form text summarization and multilingual conversational agents.

Multilingual - English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai...

🥇 MMLU - 405B (85.2), 70B (79.3) & 8B (66.7)

Trained on 15 Trillion tokens + 25M synthetically generated outputs.

Used a massive 39.3 Million GPU hours (16K H100s for 405B)

Excels at Code output tasks, too!

Release Prompt Guard - BERT-based classifier to detect jailbreaks, malicious code, etc

Llama Guard 8B w/ 128K context for securing prompts across a series of topics

ADVERTISEMENT

[Synthetic Data Webinar] Learn how Gretel’s synthetic data platform, powered by generative AI, make’s data generation easier than ever before..

During this webinar, you will see live demos of the Gretel platform and learn about the latest product additions:

🐝 Gretel Navigator: Our new agent-based, compound AI system tailor-made for tabular data generation

🐝 Gretel Open Datasets: We’ve released a few open source datasets including the world’s largest text-to-SQL dataset

🐝 Navigator Fine Tuning: Fine-tune a specialized language model on your unique, domain-specific data

🐝 Transform v2: Apply flexible de-identification and rule-based transformations to real and synthetic datasets

and many more…

DVC. ai Released DataChain: A Groundbreaking Open-Source Python Library for Large-Scale Unstructured Data Processing and Curation

DVC.ai has announced the release of DataChain, a revolutionary open-source Python library designed to handle and curate unstructured data at an unprecedented scale. By incorporating advanced AI and machine learning capabilities, DataChain aims to streamline the data processing workflow, making it invaluable for data scientists and developers.

Key Features of DataChain:

AI-Driven Data Curation: DataChain utilizes local machine learning models and large language (LLM) API calls to enrich datasets. This combination ensures the data processed is structured and enhanced with meaningful annotations, adding significant value for subsequent analysis and applications.

GenAI Dataset Scale: The library is built to handle tens of millions of files or snippets, making it ideal for extensive data projects. This scalability is crucial for enterprises and researchers who manage large datasets, enabling them to process and analyze data efficiently.

Python-Friendly: DataChain employs strictly typed Pydantic objects instead of JSON, providing a more intuitive and seamless experience for Python developers. This approach integrates well with the existing Python ecosystem, allowing for smoother development and implementation.

Microsoft Research Introduces E5-V: A Universal AI Framework for Multimodal Embeddings with Single-Modality Training on Text Pairs

Researchers from Beihang University and Microsoft Corporation introduced the E5-V framework, designed to adapt MLLMs for universal multimodal embeddings. This innovative approach leverages single-modality training on text pairs, significantly reducing training costs and eliminating the need for multimodal data collection. By focusing on text pairs, the E5-V framework demonstrates substantial improvements in representing multimodal inputs compared to traditional methods, offering a promising alternative for future developments in the field.

The E5-V framework employs a novel prompt-based representation method to unify multimodal embeddings into a single space. During training, the model exclusively uses text pairs, simplifying the process and cutting costs associated with collecting multimodal data. The key innovation lies in instructing MLLMs to represent multimodal inputs as words, effectively removing the modality gap. This method allows the model to handle highly accurate tasks like composed image retrieval. By unifying different embeddings into the same space based on their meanings, the E5-V framework enhances the robustness and versatility of multimodal representations.

Upcoming AI Webinars (July 22-31, 2024)

Here is a list of Upcoming AI Webinars (July 24- 31, 2024) from various AI and Data Companies