Together AI Achieves Breakthrough Inference Speed with NVIDIA's Blackwell GPUs
Together AI has announced a significant advancement in AI performance by offering the fastest inference for the DeepSeek-R1-0528 model, utilizing an inference engine designed for the NVIDIA HGX B200 platform. This development positions Together AI as a leading platform for running open-source reasoning models at scale, according to together.ai.
NVIDIA Blackwell Integration
Earlier this year, Together AI invited select customers, including major corporations like Zoom and Salesforce, to test NVIDIA Blackwell GPUs on its GPU Clusters. The results have led to a broader rollout of NVIDIA Blackwell support, unlocking enhanced performance for AI applications. As of July 17, 2025, the company claims to have achieved the fastest serverless inference performance for DeepSeek-R1 using this technology.
Technological Advancements
The new inference engine optimizes every layer of the stack, incorporating bespoke GPU kernels and a proprietary inference engine. These innovations aim to boost speed and efficiency without compromising model quality. The stack includes state-of-the-art speculative decoding methods and advanced model optimization techniques.
Performance Metrics
Together AI's inference stack achieves up to 334 tokens per second, outperforming previous benchmarks. This performance is facilitated by the integration of NVIDIA's fifth-generation Tensor Cores and the ThunderKittens framework, which Together AI uses to develop optimized GPU kernels.
Speculative Decoding and Quantization
Speculative decoding significantly accelerates large language models by using a smaller, faster speculator model to predict multiple tokens ahead. Together AI's Turbo Speculator outperforms existing models by maintaining high target-speculator alignment across various scenarios. Additionally, Together AI has pioneered a lossless quantization technique that maintains model accuracy while reducing computational overhead.
Real-World Application
The enhancements are designed to support a range of AI workloads, offering flexible infrastructure options for both inference and training. Dedicated Endpoints provide additional optimization, delivering substantial speed improvements while maintaining quality and performance standards.
As the AI landscape continues to evolve, Together AI's collaboration with NVIDIA and its innovative approach to inference engine development positions it as a formidable player in the race for AI supremacy.
Read More
UK's Isambard-AI Supercomputer Goes Live, Setting New Standards
Jul 18, 2025 0 Min Read
Gala Games Unveils Jaxel Mystery Box Offering Exclusive NFT Rewards
Jul 18, 2025 0 Min Read
GitHub Launches General Availability of Copilot Agent Mode for Leading IDEs
Jul 18, 2025 0 Min Read
Microsoft's UX Strategy Targets Cybercrime with Intuitive Design
Jul 18, 2025 0 Min Read
GitHub Faces Multiple Service Disruptions in June 2025
Jul 18, 2025 0 Min Read