Network engineer checking cable connections
Services > Manage

Delivering Infrastructure Managed Services for AI & HPC Workloads

Unlike traditional IT systems, HPC & AI infrastructures use different processors, platforms, networks, and involve precision operations. These differences can impact your internal IT team’s ability to manage performance and uptime.

Let's Talk

Solving Architecture
Precision Management

Sensitive Equipment

AI & HPC clusters use specialty components with unique failure signatures. Traditional monitoring tools might need to be modified to manage and adjust elements properly.

Expensive GPUs

As with any cluster, those used for AI & HPC must be continuously managed with health checks as performance issues and failure patterns can drive significant financial impact.

Reliable Methods

Persistent monitoring, alerting, and escalation management conducted by NVIDIA-certified Managed Services engineers with SLA-based uptime reporting prevents workload delays.

Best-in-Class Architecture

AI Success Requires Proven
Management Experience

Penguin Solutions has over 25 years of experience in building and managing HPC clusters and
over 8 years of experience with very large clusters. This certified experience has allowed us to develop unmatched capabilities with very large AI factories.

Rack of servers

2+ Billion Hours

Driving uptime and throughput for large scale, complex environments with over 2 billion hours of GPU runtime.

GPU chip on motherboard

85,000 GPUs

With more than 85,000 GPUs deployed and under our management services, we continue to meet current and evolving AI infrastructure requirements.

Team members reviewing rack storage

Centers of Excellence (CoEs)

From engineering to technical operations, Penguin offers specialized knowledge and orchestrates key functional areas to ensure optimal performance.

In the News

Managing Large NVIDIA DGX Clusters Expertise

Our years of experience have allowed us to develop unmatched capabilities with running large AI factories. For example, we are helping Meta manage the Meta Research Super Cluster, with over 2000 NVIDIA DGX systems, 16,000 NVIDIA A100 Tensor Core GPUs, 500 PB of storage and 40,000 NVIDIA InfiniBand networking links.

Penguin Solutions worked with Meta’s operations team on the hardware integration to deploy the cluster and set up major parts of the control plane. Penguin’s hardware and software expertise helped to unite contributions from NVIDIA and Pure Storage.

Together, these three partners were key to supplying Meta with an optimized solution—the new AI Research SuperCluster (RSC)—which enabled Meta to lay the groundwork for the Metaverse.

Read full story
Read press release

Delivering AI-Optimized Architecture and
AI Managed Services

Penguin Solutions continues to provide exceptional uptime and availability for Meta’s large NVIDIA DGX cluster.

Meta data center

Certified NVIDIA DGX-Ready
AI Managed Services Partner

Penguin Solutions has designed large NVIDIA DGX clusters, with high-speed NVIDIA InfiniBand networking and optimized storage. We have relationships and expertise with most storage vendors, allowing us to provide bespoke solutions for every customer.

Meta server racks
Our Process: Additional Services

AI & HPC Infrastructure Comprehensive Services

Penguin Solutions is dedicated to our customers’ success. With 25 years of HPC experience in designing, building, deploying, and managing AI and accelerated computing clusters, we have enabled some of the world’s most sophisticated workloads.

Empty server room
Design

Design Infrastructure Services

Accelerate time to value by basing system architectures on a proven set of designs that have been validated at scale in numerous production deployments.

Discover Our Design Service
Discover Our Design Service
Clean room server build cabling
Build

Building Infrastructure Services

Achieve high rates of system stability with our in-factory experts who validate all components of the compute cluster including rack integration, network configuration, and burn-in testing.

Discover Our Build Service
Discover Our Build Service
Server room network engineers
Deploy

Deployment Infrastructure Services

Drive on-site installations including coordinating with data storage partners, data center staff, system cooling infrastructures, and utilizing our ClusterWare software to validate production readiness.

Discover Our Deployment Service
Discover Our Deployment Service
Woman in data center with tablet
Request a callback

Talk to the Experts at Penguin Solutions

Reach out today and learn more how we can help assure production readiness and change management as a certified NVIDIA DGX-Ready AI Managed Services provider, with a full set of end-to-end services including complete 24/7 support.

Let's Talk