Network engineer checking cable connections
Services > Manage

Delivering Infrastructure Managed Services for AI & HPC Workloads

Unlike traditional IT systems, HPC & AI infrastructures use different processors, platforms, networks, and involve precision operations. These differences can impact your internal IT team’s ability to manage performance and uptime.

Let's Talk

Solving Architecture
Precision Management

Sensitive Equipment

AI & HPC clusters use specialty components with unique failure signatures. Traditional monitoring tools might need to be modified to manage and adjust elements properly.

Expensive GPUs

As with any cluster, those used for AI & HPC must be continuously managed with health checks as performance issues and failure patterns can drive significant financial impact.

Reliable Methods

Persistent monitoring, alerting, and escalation management conducted by NVIDIA-certified Managed Services engineers with SLA-based uptime reporting prevents workload delays.

Best-in-Class Architecture

AI Success Requires Proven
Management Experience

Penguin Solutions has over 25 years of experience in building and managing HPC clusters and
over 8 years of experience with very large clusters. This certified experience has allowed us to develop unmatched capabilities with very large AI factories.

Rack of servers

2+ Billion Hours

Driving uptime and throughput for large scale, complex environments with over 2 billion hours of GPU runtime.

GPU chip on motherboard

85,000 GPUs

With more than 85,000 GPUs deployed and under our management services, we continue to meet current and evolving AI infrastructure requirements.

Team members reviewing rack storage

Centers of Excellence (CoEs)

From engineering to technical operations, Penguin offers specialized knowledge and orchestrates key functional areas to ensure optimal performance.

In the News

Managing Large NVIDIA DGX Clusters Expertise

Our years of experience have allowed us to develop unmatched capabilities with running large AI factories. For example, we are helping Meta manage the Meta Research Super Cluster, with over 2000 NVIDIA DGX systems, 16,000 NVIDIA A100 Tensor Core GPUs, 500 PB of storage and 40,000 NVIDIA InfiniBand networking links.

Penguin Solutions worked with Meta’s operations team on the hardware integration to deploy the cluster and set up major parts of the control plane. Penguin’s hardware and software expertise helped to unite contributions from NVIDIA and Pure Storage.

Together, these three partners were key to supplying Meta with an optimized solution—the new AI Research SuperCluster (RSC)—which enabled Meta to lay the groundwork for the Metaverse.

Read full story
Read press release

Delivering AI-Optimized
Architecture and Managed Services

Penguin Solutions continues to provide exceptional uptime and availability for Meta’s large NVIDIA DGX cluster.

Meta data center

Certified NVIDIA DGX-Ready
Managed Services Partner

Penguin Solutions has designed large NVIDIA DGX clusters, with high-speed NVIDIA InfiniBand networking and optimized storage. We have relationships and expertise with most storage vendors, allowing us to provide bespoke solutions for every customer.

Meta server racks
Woman in data center with tablet
Request a callback

Talk to the Experts at
Penguin Solutions

Reach out today and learn more how we can help assure production readiness and change management as a certified NVIDIA DGX-Ready Managed Services provider, with a full set of end-to-end services including complete 24/7 support.

Let's Talk