Optimize HPC & AI Workloads With AI Managed Services

Solving Architecture
Precision Management

Sensitive Equipment

AI & HPC clusters use specialty components with unique failure signatures. Traditional monitoring tools might need to be modified to manage and adjust elements properly.

Expensive GPUs

As with any cluster, those used for AI & HPC must be continuously managed with health checks as performance issues and failure patterns can drive significant financial impact.

Reliable Methods

Persistent monitoring, alerting, and escalation management conducted by NVIDIA-certified Managed Services engineers with SLA-based uptime reporting prevents workload delays.

Best-in-Class Architecture

AI Success Requires Proven
Management Experience

Penguin Solutions has over 25 years of experience in building and managing HPC clusters and over 8 years of experience with very large clusters. This certified experience has allowed us to develop unmatched capabilities with very large AI factories.

2+ Billion Hours

Driving uptime and throughput for large scale, complex environments with over 2 billion hours of GPU runtime.

85,000 GPUs

With more than 85,000 GPUs deployed and under our management services, we continue to meet current and evolving AI infrastructure requirements.

Centers of Excellence (CoEs)

From engineering to technical operations, Penguin offers specialized knowledge and orchestrates key functional areas to ensure optimal performance.

In the News

Managing Large NVIDIA DGX Clusters Expertise

Our years of experience have allowed us to develop unmatched capabilities with running large AI factories. For example, we are helping Meta manage the Meta Research Super Cluster, with over 2000 NVIDIA DGX systems, 16,000 NVIDIA A100 Tensor Core GPUs, 500 PB of storage and 40,000 NVIDIA InfiniBand networking links.

Penguin Solutions worked with Meta’s operations team on the hardware integration to deploy the cluster and set up major parts of the control plane. Penguin’s hardware and software expertise helped to unite contributions from NVIDIA and Pure Storage.

Together, these three partners were key to supplying Meta with an optimized solution—the new AI Research SuperCluster (RSC)—which enabled Meta to lay the groundwork for the Metaverse.

Delivering AI-Optimized Architecture and
AI Managed Services

Penguin Solutions continues to provide exceptional uptime and availability for Meta’s large NVIDIA DGX cluster.

Certified NVIDIA DGX-Ready
AI Managed Services Partner

Penguin Solutions has designed large NVIDIA DGX clusters, with high-speed NVIDIA InfiniBand networking and optimized storage. We have relationships and expertise with most storage vendors, allowing us to provide bespoke solutions for every customer.

Our Process: Additional Services

AI & HPC Infrastructure Comprehensive Services

Penguin Solutions is dedicated to our customers’ success. With 25 years of HPC experience in designing, building, deploying, and managing AI and accelerated computing clusters, we have enabled some of the world’s most sophisticated workloads.

Design

Design Infrastructure Services

Accelerate time to value by basing system architectures on a proven set of designs that have been validated at scale in numerous production deployments.

Build

Building Infrastructure Services

Achieve high rates of system stability with our in-factory experts who validate all components of the compute cluster including rack integration, network configuration, and burn-in testing.

Deploy

Deployment Infrastructure Services

Drive on-site installations including coordinating with data storage partners, data center staff, system cooling infrastructures, and utilizing our ClusterWare software to validate production readiness.

Request a callback

Talk to the Experts at Penguin Solutions

Reach out today and learn more how we can help assure production readiness and change management as a certified NVIDIA DGX-Ready AI Managed Services provider, with a full set of end-to-end services including complete 24/7 support.

Delivering Infrastructure Managed Services for AI & HPC Workloads

Solving Architecture
Precision Management

Sensitive Equipment

Expensive GPUs

Reliable Methods

AI Success Requires Proven
Management Experience

2+ Billion Hours

85,000 GPUs

Centers of Excellence (CoEs)

Managing Large NVIDIA DGX Clusters Expertise

Delivering AI-Optimized Architecture and
AI Managed Services

Certified NVIDIA DGX-Ready
AI Managed Services Partner

AI & HPC Infrastructure Comprehensive Services

Design Infrastructure Services

Building Infrastructure Services

Deployment Infrastructure Services

Talk to the Experts at Penguin Solutions

Solving complexity. Accelerating results.

Get in touch

Partners

Company

Delivering Infrastructure Managed Services for AI & HPC Workloads

Solving Architecture Precision Management

Sensitive Equipment

Expensive GPUs

Reliable Methods

AI Success Requires Proven Management Experience

2+ Billion Hours

85,000 GPUs

Centers of Excellence (CoEs)

Managing Large NVIDIA DGX Clusters Expertise

Delivering AI-Optimized Architecture andAI Managed Services

Certified NVIDIA DGX-ReadyAI Managed Services Partner

AI & HPC Infrastructure Comprehensive Services

Design Infrastructure Services

Building Infrastructure Services

Deployment Infrastructure Services

Talk to the Experts at Penguin Solutions

Solving complexity. Accelerating results.

Get in touch

Partners

Company

Solving Architecture
Precision Management

AI Success Requires Proven
Management Experience

Delivering AI-Optimized Architecture and
AI Managed Services

Certified NVIDIA DGX-Ready
AI Managed Services Partner