xAI Announces High-Performance AI Model 'Grok 4' - Achieves Industry-Leading Results Across Multiple Benchmarks

On July 9, 2025, xAI announced its latest AI models, “Grok 4” and the high-performance variant “Grok 4 Heavy”¹. The announcement highlights scaled reinforcement learning, native tool use capabilities, and industry-leading achievements across multiple academic benchmarks.

Particularly impressive is Grok 4 Heavy’s achievement of 44.4% on the “Humanity’s Last Exam” benchmark when using tools¹. Additionally, it scored 15.9% on the ARC-AGI V2 benchmark, nearly double Claude Opus 4’s score of approximately 8.6%¹.

Performance Improvements Through Scaled Reinforcement Learning

xAI leveraged its “Colossus” cluster consisting of 200,000 GPUs to conduct reinforcement learning training at an unprecedented scale¹. This enabled successful refinement of Grok 4’s reasoning capabilities at pretraining scale.

Technical Improvements

The training process achieved a 6x improvement in computational efficiency through innovations across the entire stack¹. The company also conducted a massive data collection effort, significantly expanding verifiable training data from primarily math and coding domains to many other areas.

Integration of Native Tool Use

Grok 4 has been trained with reinforcement learning to use tools such as code interpreters and web browsing¹. This allows the model to augment its thinking with tools in situations that typically challenge large language models.

Advanced Search Capabilities Within X

As a notable feature, Grok 4 can use powerful tools to search deeply within X (formerly Twitter)¹. It employs advanced keyword and semantic search tools and can even view media to improve response quality.

Grok 4 Heavy and Parallel Test-Time Compute

xAI has made further progress on parallel test-time compute, which allows consideration of multiple hypotheses simultaneously¹. Grok 4 Heavy, leveraging this technology, establishes new standards for performance and reliability.

The system employs multiple agents working on problems simultaneously, then comparing their work “like a study group” to find the best answer².

Benchmark Results and Industry Positioning

According to Artificial Analysis, Grok 4 achieves an Intelligence Index of 73, surpassing OpenAI o3’s 70 and Google Gemini 2.5 Pro’s 70². Key benchmark results include:

Humanity’s Last Exam: Grok 4 Heavy achieves 44.4% (with tools), significantly outperforming Gemini 2.5 Pro’s 26.9%²
ARC-AGI V2: Sets a new record with 15.9%, approximately double the next best commercial AI model¹
USAMO 2025: Tops the leaderboard with 61.9%¹
AIME 2025: Excellent performance in competitive mathematics¹

API and Subscription Plans

Grok 4 is available to developers as an API with a 256k context window and multimodal understanding capabilities¹. The API ensures enterprise-grade security and compliance, including SOC 2 Type 2, GDPR, and CCPA certifications.

For subscription plans, a new $300/month “SuperGrok Heavy” plan has been introduced, providing early access to Grok 4 Heavy and priority access to new features². Regular Grok 4 is available to SuperGrok and Premium+ subscribers, as well as through the xAI API¹.

Future Roadmap

xAI has announced plans to continue scaling reinforcement learning to unprecedented levels, building on Grok 4’s advancements¹. The company plans to expand from verifiable rewards in controlled domains to tackling complex real-world problems where models can learn and adapt in dynamic environments.

The company has also revealed plans to release an AI coding model in August 2025, a multimodal agent in September, and a video generation model in October².

For those interested in learning more about xAI’s Grok 4 technology, the official announcement page provides detailed benchmark results and demonstration videos. For developers, the xAI API documentation offers integration methods and usage examples.