What is Google DeepMind’s Michelangelo and why is it important?
Cover Photo Major News from DeepMind's Michelangelo, Walmart's Wallaby, Pyramid Flow, Writer's Palmyra X 004, ApertureData and Scope3

DeepMind’s Michelangelo: A New Benchmark for Long-Context Language Models

Google DeepMind has introduced Michelangelo, a novel benchmark for evaluating long-context reasoning capabilities of large language models (LLMs). While current LLMs excel at retrieving information from extensive contexts, they struggle with tasks requiring reasoning over data structures. Michelangelo addresses this gap by focusing on three core tasks: Latent List, Multi-round Co-reference Resolution, and “I Don’t Know” scenarios. These tasks assess a model’s ability to understand relationships within large context windows, rather than simply retrieving isolated facts. The benchmark reveals that even frontier models with very long context windows have significant room for improvement in reasoning over large amounts of information.

Walmart Develops Wallaby: A Retail-Focused AI Language Model

Walmart is testing Wallaby, a suite of retail-focused large language models (LLMs) trained on decades of company data. This AI understands Walmart’s unique employee and customer communication styles, aligning with the company’s customer service values. While not yet deployed, Wallaby is undergoing extensive internal testing, particularly with Walmart associates. The retail giant plans to use a mix of AI models, including Wallaby and third-party options, for various applications. Walmart’s multi-layered AI approach includes the Element platform, which manages and directs different models to specific uses. The company has already implemented AI in various areas, including customer support, inventory management, and personalized recommendations, with plans to expand its AI integration further.

Pyramid Flow: Open-Source AI Video Generator Challenges Proprietary Models

Researchers from Peking University, Beijing University of Posts and Telecommunications, and Kuaishou Technology have launched Pyramid Flow, a new open-source AI video generator. This model can create high-quality video clips up to 10 seconds long using a novel technique called pyramidal flow matching. Pyramid Flow generates videos in stages, mostly at low resolution, producing a full-res version only at the end. This approach significantly reduces computational costs while maintaining visual quality. The model is freely available for download and use, even for commercial purposes, potentially competing with paid services like Runway’s Gen-3 Alpha and Luma’s Dream Machine. While Pyramid Flow shows promise, it currently lacks some advanced features offered by proprietary models.

Writer’s Palmyra X 004: A Leap Forward in AI Function Calling for Enterprises

Writer has unveiled Palmyra X 004, a new large language model (LLM) that excels in function calling and workflow execution. This model outperforms offerings from major tech companies on Berkeley’s Tool Calling Leaderboard by nearly 20%, achieving a score of 78.76%. Palmyra X 004 boasts a 128,000 token context window, supports 30+ languages, and can handle multimodal inputs. Despite having only around 150 billion parameters, it ranks in the top 10 on Stanford’s HELM benchmark. Writer attributes this efficiency to innovative training techniques and synthetic data use.

The model offers various deployment options, including on-premises hosting, addressing enterprise data privacy concerns. This release signifies a shift towards AI systems capable of executing complex business workflows, potentially transforming enterprise applications in the near future.

ApertureData Revolutionizes Multimodal Data Management for AI Applications

ApertureData, a California-based startup, has introduced ApertureDB, a unified data layer that combines graph and vector databases with multimodal data management. This innovative solution aims to streamline the process of handling diverse data types for AI applications, potentially reducing data infrastructure and preparation times by several months. The company recently secured $8.25 million in seed funding and launched a cloud-native version of their graph-vector database. ApertureDB centralizes various datasets, including images, videos, and documents, offering efficient retrieval and query handling. By providing a comprehensive solution for multimodal data management, ApertureData claims to increase productivity for data science and AI teams by an average of tenfold, addressing a critical challenge in the AI industry.

Scope3 Expands to Track AI’s Carbon Footprint

Scope3, founded by Brian O’Kelley, is expanding its focus from tracking carbon emissions in digital advertising to measuring the environmental impact of AI. The company, which initially aimed to reduce waste and carbon footprint in digital ads, has secured new funding to venture into the AI sector. Scope3’s approach involves gathering data and building models to identify inefficiencies and their associated carbon emissions. By addressing these issues, the company aims to help clients reduce both economic waste and environmental impact. This expansion comes as AI increasingly intersects with media and advertising, presenting new challenges and opportunities for sustainability in the tech industry. Scope3’s innovative approach could potentially reshape how businesses view and manage the environmental costs of AI implementation.

Frequently asked questions

Michelangelo is a new benchmark developed by Google DeepMind to evaluate how well large language models (LLMs) can reason with long-context information. It focuses on three key tasks: Latent List, Multi-round Co-reference Resolution, and “I Don’t Know” scenarios. The benchmark is significant because it tests LLMs’ ability to understand relationships within large context windows rather than just retrieving isolated facts, revealing that even advanced models still have considerable room for improvement in this area.
Wallaby is Walmart’s proprietary suite of retail-focused large language models trained on decades of company data. It’s designed to understand Walmart’s specific employee and customer communication styles while aligning with the company’s service values. Currently in internal testing, Wallaby will be part of a broader AI strategy that includes the Element platform and third-party models, aimed at enhancing customer support, inventory management, and personalized recommendations.
Pyramid Flow is an open-source AI video generator that uses a unique pyramidal flow matching technique to create high-quality video clips up to 10 seconds long. Unlike proprietary models, it generates videos primarily at low resolution before producing a final high-res version, significantly reducing computational costs. It’s freely available for commercial use, making it a competitive alternative to paid services like Runway’s Gen-3 Alpha and Luma’s Dream Machine.
Palmyra X 004 distinguishes itself through superior function calling capabilities, outperforming major tech companies by nearly 20% on Berkeley’s Tool Calling Leaderboard. It features a 128,000 token context window, supports 30+ languages, and handles multimodal inputs. Despite having only 150 billion parameters, it ranks in the top 10 on Stanford’s HELM benchmark, achieving this efficiency through innovative training techniques and synthetic data use.
ApertureDB is revolutionizing AI data management by combining graph and vector databases with multimodal data management in a unified data layer. It centralizes various data types including images, videos, and documents, making them easily accessible for AI applications. The solution claims to reduce data infrastructure and preparation times by months while increasing data science team productivity up to tenfold.
Scope3 is expanding from digital advertising to measure and track AI’s carbon footprint. The company uses specialized data collection and modeling to identify inefficiencies and their associated carbon emissions in AI operations. This initiative aims to help businesses reduce both economic waste and environmental impact as AI becomes increasingly integrated with media and advertising.
According to the Michelangelo benchmark findings, current LLMs struggle with tasks requiring complex reasoning over large data structures, even when they have very long context windows. While they excel at retrieving specific information from extensive contexts, they have difficulty understanding and maintaining relationships between different pieces of information over long passages, particularly in tasks involving list manipulation and multi-round reasoning.
Picture of Gor Gasparyan

Gor Gasparyan

Optimizing digital experiences for growth-stage & enterprise brands through research-driven design, automation, and AI