• AI Research Insights
  • Posts
  • AI News: Microsoft's new Multimodal Large Language Model, Kosmos-1, Google's Grounded Decoding research, SpikeGPT, UC Berkeley's proposed D5 task, as well as Plenoxels and DepthGen

AI News: Microsoft's new Multimodal Large Language Model, Kosmos-1, Google's Grounded Decoding research, SpikeGPT, UC Berkeley's proposed D5 task, as well as Plenoxels and DepthGen

This newsletter brings AI research news that is much more technical than most resources but still digestible and applicable

Welcome to this edition of our AI newsletter, where we showcase some of the most exciting recent advancements in the field. We'll be covering Microsoft's new Multimodal Large Language Model, Kosmos-1, Google's Grounded Decoding research, SpikeGPT, UC Berkeley's proposed D5 task, as well as Plenoxels and DepthGen, two innovative developments in the world of AI.

Microsoft introduces Kosmos-1: Kosmos-1 is a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, they trained KOSMOS-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. It achieves great performance on language understanding, OCR-free NLP, perception-language tasks, visual QA, and more.

Grounded Decoding (GD): language models gathered tons of world knowledge by speaking the human language. But can they ever speak ā€œrobot languageā€? Introducing ā€œGrounded Decodingā€: a scalable way to decode grounded text from LLM for robots. Prior work SayCan grounds LLM for robots using affordances. But instead of speaking with full vocab, LLM only ranks across pre-set skills. Imagine scratching your head through 700+ choices or O(billions) of natural language choices. How can we do better? To let LLM speak ā€œrobot languageā€ that scales & grounds, Google researchers look at its most basic functioning unit - tokens. The proposed formulation decodes likely tokens under both LLM & Grounded Models. GMs reward tokens that respect embodiment, while LLM provides world knowledge & coherent behaviors. Using affordance as GM, LLM can generate plans for ā€œseparating vowels from other lettersā€ w/o prompted a list of present objects.

SpikeGPT: SpikeGPT is a generative language model using spiking neural networks for energy-efficient language generation, achieving performance comparable to ANNs with 5x less energy consumption. The research team trained the proposed model on three model variants: 45M, 125M and 260M parameters. According to the research team,, this is 4Ɨ larger than any functional backprop-trained SNN to date.

D5 task: Although AI is still far from doing research, a group of researchers from UC Berkeley proposed the D5 task ā€“ the goal-driven discovery of distributional differences via language descriptions. Each input problem = a research goal + a pair of large corpora. The output is a natural language predicate (discovery) that describes corpus-level differences. The discovery needs to be (1) valid: it should describe a true difference, i.e. the predicate is indeed more often true for corpus A than B; (2) meaningful: it needs to be driven by the research goal and thus relevant, novel, and significant. To benchmark the D5 system, researchers constructed a new dataset, OpenD5, aggregating 675 open-ended problems across social sciences, humanities, business, health, and machine learning. They surveyed papers and courses related to text analysis and scraped the data over nine months.

šŸšØ A Study on Various Deep Learning-based Weather Forecasting Models. The past few years have seen the development of deep learning-based Weather Forecasting Models like MetNet-2, WF-UNet, ClimaX, GraphCast, Pangu-Weather, and more. This article briefly discusses these models to get an insight into how these models are quickly beating traditional Meteorological Simulators by large margins.

1. Pangu-Weather For Global Weather Forecasting

2. A Multi-Resolution Deep Learning Framework

3. Real-time Bias Correction of Wind Field Forecasts

4. Predicting Wind Farm Power And Downstream Wakes Using Weather Patterns

5. GraphCast: Providing Efficient Medium-Range Global Weather Forecasting

6. WeatherFusionNet For Predicting Precipitation from Satellite Data

7. WF-UNet: Weather Fusion UNet for Precipitation Nowcasting

8. ClimaX: Foundation Model For Weather & Climate

Plenoxels: One of the main limitations of Neural Radiance Fields (NeRFs) is that training them requires many images and a lot of time (several days on a single GPU). Recent research indicates that we can make this a lot faster by eliminating deep learning. Researchers introduce Plenoxels, a NeRF-inspired scene representation. It can model scenes with similar fidelity compared to NeRFs without using neural nets, which speeds up the optimization process by ~100X (only 11 minutes per scene).

DepthGen: It is a diffusion model for monocular depth estimation. It achieves SOTA performance on NYU dataset. It allows zero-shot depth completion and text-to-3D. It captures uncertainty with multimodal depth prediction.

Do You Know Marktechpost has a Community of 1.5 Million AI Professionals and Developers?