Hedgehog Simplifies On-Premises AI Network Configuration
The potential of on-premises AI infrastructure is immense, offering unparalleled control, security, and performance for temperamental machine learning workloads. However, executing on this involves navigating a notoriously complex challenge of network configuration and management. Modern networks specialized for AI training are a maze of high-speed connections, multiple network types, and intricate multi-tenancy requirements. Traditionally, this has meant painstaking manual configurations of VRFs, BGP with EVPN, VXLANs, and VLANs, which leads to a process prone to errors and significant operational overhead.
Enter Hedgehog, a platform designed to dramatically simplify this complexity through intelligent abstraction, robust automation, and the power of the Kubernetes API.
The Headaches of High-Performance AI Networks
Imagine equipping a single server with eight 400 GbE ports for backend RDMA traffic and another pair of 200 GbE ports for frontend traffic. Now multiply that across dozens or even hundreds of servers in an AI training cluster. The volume of physical wiring alone is daunting, let alone the intricate software configurations required to ensure optimal performance and isolation. Traditional multi-tenancy in these environments demands specialized networking expertise and plenty of manual configuration. One mistake could lead to misconfigurations and network disruptions with every change. Each of those has an associated cost for the equipment and the data waiting on the system being ready to deliver value.
Simplification Through Abstraction: A New Network Language
Hedgehog takes this challenge head-on with two key abstractions that fundamentally change how network engineers interact with the infrastructure. The first is the Low-Level Wiring Diagram. This abstraction allows users to describe their physical network topology to Hedgehog using intuitive YAML configurations, which is quickly becoming the common language of the network. Instead of wrestling with cryptic CLI commands, users can specify switch models, roles (spine/leaf), and connections. This simplifies the enablement of AI-specific features like RoCE (RDMA over Converged Ethernet) without requiring an in-depth understanding of its myriad options.
The second key piece is the Operational Abstraction. Drawing inspiration from cloud giants like Amazon, Hedgehog introduces Virtual Private Clouds (VPCs). These declarative YAML configurations allow for easy partitioning of a cluster into multiple tenants. Users define network parameters like DHCP, IP ranges, and host routes, and choose between Layer 2 or Layer 3 modes of operation. This empowers teams to manage multi-tenancy with unprecedented ease and clarity. You can get a great look at the way that Hedgehog uses VPCs in this excellent video from their recent appearance at Networking Field Day in July:
At the heart of Hedgehog’s approach is the Kubernetes API. Every action, every configuration within the Hedgehog platform, is built upon this well-known and widely adopted API. Kubernetes has become one of the de facto standards in the cloud world. This offers a familiar interface for network engineers and leverages a vast ecosystem of existing tools, including robust Role-Based Access Control (RBAC). By extending Kubernetes with Custom Resource Definitions (CRDs) to configure the physical network, Hedgehog allows users to describe their desired infrastructure state in a language they understand, hiding the complex, vendor-specific CLI commands and logic that have historically plagued network operations. This ensures common languages are spoken and outcomes are much more predictable.
Automated Provisioning and Lifecycle Management
Once the wiring diagram is fed into the Kubernetes API, the magic of automation begins. Hedgehog automatically handles zero-touch provisioning, booting, and installing the network operating system on all switches. It takes on the tasks of managing upgrades and lifecycle issues for switches, and configures an agent on each switch to enforce the specified configuration. No need for manual intervention any longer.
For control nodes and gateways, users can boot from a generated ISO, providing an appliance-style experience without extra configuration. Future enhancements include PXE boot over the management network for even simpler provisioning of additional nodes. This automated system is designed to prevent common configuration mistakes, ensuring users don’t misconfigure something and lock themselves out of the device. That ensures resiliency in the network for both operations teams and users alike.
Simplified Multi-Tenancy and Intelligent Peering
Hedgehog’s VPC abstraction dramatically simplifies multi-tenancy. It automatically configures the underlying network components like BGP EVPN and VLANs. This gets your tenants up and running without manual intervention.
For inter-tenant communication, Hedgehog offers two powerful peering options:
- Switch-Based Peering: This method leverages the inherent capabilities of the network switches for direct communication between VPCs. The primary benefit is full cut-through bandwidth and extremely low latency, making it ideal for high-bandwidth, low-latency applications like AI training. However, switches have limited CPU and RAM, making them unsuitable for stateful network functions like firewalls or NAT. If you need “quick and dirty,” this is your solution.
- Gateway-Based Peering: This introduces a CPU-rich, high-bandwidth server into the traffic flow. This is what Hedgehog calls the Gateway. While it introduces slightly higher latency due to the additional hop, it enables a wide array of advanced network functions on commodity hardware. This includes basic firewalling (ACLs), port forwarding, implied NAT specific to the peering, and future capabilities like DoS protection, IDS/IPS, and Layer 7 inspection. The Gateway’s ability to run Hedgehog’s own data plane also provides enhanced monitoring and insights into traffic flow, offering detailed counters for NAT and firewall rules. If you need features and don’t mind a tiny bit of latency, you should be deploying the Gateway instead.
The choice between switch-based and gateway-based peering depends on the specific requirements of the workload. Switch-based peering is for pure speed and low latency between trusted tenants, while gateway-based peering prioritizes rich network services and security for more complex inter-VPC communication or connections to external networks.
Enhanced Monitoring and Troubleshooting
Hedgehog doesn’t stop at configuration and provisioning. It also provides comprehensive monitoring and troubleshooting capabilities. It collects and exports all metrics and logs from switches and control nodes to compatible monitoring systems like Grafana Cloud, Prometheus, and Loki.
The Kubernetes API itself becomes a single source of truth for the desired state of the network, with observed device statuses propagated back into the API. This reduces the need to log directly into devices for diagnostics. Furthermore, intuitive CLI tools allow for deep inspection into the fabric’s status, answering questions about connectivity between servers and even tracing the precise path traffic would take. This democratizes troubleshooting, making it more accessible and less reliant on low-level network CLI commands, benefiting software engineers who may not be network specialists. It also ensures that when advanced troubleshooting needs to happen the A-team can come in and have access to the kinds of tools they need when the stakes are high.
Bringing It All Together
By abstracting complexity, automating repetitive tasks, and leveraging the ubiquitous Kubernetes API, Hedgehog empowers organizations to deploy and operate high-performance AI networks with unprecedented ease, reliability, and operational simplicity. The shift in thinking about how to deploy these networks is slowly evolving as we understand how they have different needs but companies like Hedgehog are leading the way in ensuring that we don’t face the kinds of challenges that could sink a multimillion dollar project before it ever produces output because of a complicated network mess.
To learn more about Hedgehog and their AI networking solutions, make sure to check out their website at https://Hedgehog.cloud. To see more of their presentation from Networking Field Day, check out the Networking Field Day presentation page here.