Business leaders know they need to put AI front and center. But that’s easier said than done. AI can be complex, expensive, and risky. And both the technology and the ecosystem are evolving rapidly.
First, there is a clear shift away from a one-size fits all approach. Predictive AI/ML, generative AI, and now agentic AI are all being adapted for specific industries and applications. As purpose-built AI models proliferate, the AI landscape is becoming increasingly diverse.
It’s now clear that AI applications require tailored infrastructure, not only optimized for performance, cost, and energy efficiency, but also able to keep pace with the rapidly evolving needs of AI models, applications and agents. A perfect example is Model Context Protocol (MCP), a powerful innovation that didn’t even exist just a few months ago.
As organizations race to take advantage of generative AI and increasingly AI agents, some are building their own dedicated data centers. Others are turning to specialized providers deploying cloud-scale infrastructures tailored to support multiple large language models (LLMs). Often called AI factories or Neoclouds, these platforms feature massive investments in accelerated computing, networking, and storage, all purpose-built to meet the intense performance and scale the demands of AI workloads.
Building sovereign, scalable AI and LLM inference infrastructure requires tackling four key challenges:
At F5, we are collaborating with NVIDIA, to help ensure AI factories and cloud-scale AI infrastructure rise to the demands of modern AI. Today, at NVIDIA GTC Paris 2025, we’re unveiling the next level of innovation with new capabilities for F5 BIG-IP Next for Kubernetes deployed on NVIDIA BlueField-3 DPUs. This builds on the enhanced performance, multi-tenancy, and security that we introduced at GTC San Jose 2025. Part of the F5 Application Delivery and Security Platform, F5 BIG-IP Next for Kubernetes runs natively on NVIDIA BlueField-3, powerful, programmable processors purpose-built for data movement and processing.
By offloading tasks like network processing, storage management, and security operations (e.g., encryption and traffic monitoring), DPUs free up valuable CPU cycles and GPU resources to focus on AI training and inference. This reduces bottlenecks, boosts performance, and improves latency, helping AI factories operate faster and more efficiently delivering more tokens.
Located on network interface cards, DPUs manage data flow across servers and between external customers/users/agents and the AI factory, orchestrating networking and security at scale. F5 BIG-IP Next for Kubernetes deployed on NVIDIA BlueField-3 DPUs became generally available in April.
LLMs have advanced rapidly in recent months, now offering a wide range of sizes, costs, and domain-specific expertise. Choosing the right model for each prompt not only ensures better responses and regulatory compliance but also optimizes for resources consumption, cost. and latency.
With today’s integration of NVIDIA NIM microservices, organizations can now intelligently route AI prompt requests to the most suitable LLM or precisely to the right model for each task. For example, lightweight, energy-efficient models can handle simple requests, while more complex or large and specialized prompts are directed to larger or domain-specific models.
This approach allows AI factories to use computing resources more efficiently, reducing inference costs by up to 60%. It’s a win-win for both model providers and model users to have a better response, faster, and at better cost.
In addition to GPUs, NVIDIA continues to innovate at the software level to tackle key challenges in AI inference. NVIDIA Dynamo and KV cache, which are included with NVIDIA NIM, are great examples. NVIDIA Dynamo introduces disaggregated serving for inference, separating context understanding (prefill) that is GPU compute heavy from response generation (decode) that is memory-bandwidth heavy, across different GPU clusters. This improves GPU utilization and simplifies scaling across data centers by efficiently handling scheduling, routing, and memory management. KV cache optimizes how model context is stored and accessed. By keeping frequently used data in GPU memory and offloading the rest to CPU or storage, it eases memory bottlenecks allowing support for larger models or more users without the need for extra hardware.
A powerful new capability of BIG-IP Next for Kubernetes is its support for KV caching, which speeds up AI inference while reducing time and energy use. Combined with intelligent routing from NVIDIA Dynamo, based on few explicit metrics such as GPU memory usage and other criteria, this enables significantly lower time to first token (TTFT), higher tokens generation, and ultimately more prompt throughput. DeepSeek has shown gains of 10x to 30x in capacity.
Customers can use F5 programmability to extend and adapt F5 BIG-IP capabilities to meet their precise and unique needs at very high performance.
For most organizations and particularly large ones, like financial services, telcos, and healthcare companies with complex legacy systems, agentic AI holds strong appeal. Built on LLMs, these AI agents can navigate complex databases, servers, tools, and applications to retrieve precise information, unlocking new levels of efficiency and insight.
Introduced by Anthropic in November 2024, MCP is transforming how AI systems interact with real-world data, tools, and services. Acting as standardized connectors, MCP servers enable AI models to access APIs, databases, and file systems in real time, allowing AI to transcend the limitations of static training data and execute tasks efficiently. As adoption grows, these servers require advanced reverse proxies with load balancing, strong security, authentication, authorization for data and tools as well as seamless Kubernetes integration, making MCP a key pillar of sovereign AI infrastructure and securing and enabling agentic AI.
Deployed as a reverse proxy in front of the MCP servers, BIG-IP Next for Kubernetes deployed on NVIDIA BlueField-3 DPUs can scale and secure MCP servers, verifying requests, classifying data, checking their integrity and privacy—thereby protecting both organizations and LLMs from security threats and data leaks. Meanwhile, F5 programmability makes it straightforward to ensure the AI application complies with the requirements of MCP and other protocols.
In recent earnings announcements, some major organizations have begun disclosing the number of tokens generated each quarter, their growth, and the revenue tied to them. This reflects a growing need among our customers: the ability to track, manage, and control token usage just like a budget to avoid unexpected costs as happens sometimes with public clouds.
That’s why BIG-IP Next for Kubernetes now includes new capabilities for metering and governing token consumption across the organization. When customers ask, we listen and deliver with care.
As industries develop AI factories and countries build their sovereign AI, AI agents are emerging and infrastructure, ecosystems and applications must be flexible and adaptable. Organizations that deploy AI efficiently will move faster, serve customers better, and reduce costs. But to realize this potential, AI must remain secure, scalable, and cost-effective without slowing the pace of innovation.
That’s where F5 comes in, Last March we delivered performance, multi-tenancy, and security. Now with BIG-IP Next for Kubernetes, we’re enabling innovation built to move at the speed of AI.
Our promise: More tokens per dollar, per watt. Try it and see the difference firsthand.
F5 is proud to be a Gold Sponsor of NVIDIA GTC Paris 2025. Visit us at Booth G27 to experience how the F5 Application Delivery and Security Platform supports secure, high-performance AI infrastructure, and attend our joint session with NVIDIA, Secure Infrastructure by Design: Building Trusted AI Factories, on Thursday, June 12 at 10:00 AM CEST.
To learn more about F5 BIG-IP Next for Kubernetes deployed on NVIDIA BlueField-3 DPUs, see my previous blog post. Also, be sure to read our press release for today’s announcement.
F5’s focus on AI doesn’t stop here—explore how F5 secures and delivers AI apps everywhere.