+1-415-954-2800 Support

Penguin Solutions Delivers
NVIDIA DGX-Ready Managed Services

Penguin Solutions™ is an expert in managing large NVIDIA DGX clusters.

Penguin Solutions™ has 25 years of experience in managing HPC clusters and
6 years of experience in managing very large NVIDIA DGX clusters. Our years of experience have allowed us to develop  unmatched capabilities with running large AI factories. For example, we are helping Meta manage the Meta Research Super Cluster, with over 2000 NVIDIA DGX systems, 16,000 NVIDIA A100 Tensor Core GPUs, 500 PB of storage and 40,000 NVIDIA InfiniBand networking links.

Penguin continues to provide exceptional uptime and availability for Meta’s large NVIDIA DGX cluster.

“Working in partnership with our implementation partner, Penguin Solutions, we improved our overall cluster management. By the time we completed the second phase of building RSC, availability stayed above 95 percent on a consistent basis. This was no small feat given that we added a 10K GPU cluster while concurrently running multiple research projects.” (Meta website, May 18, 2023).




Penguin Solutions is a certified NVIDIA DGX-ready Managed Services partner.

Penguin has designed large NVIDIA DGX clusters, with high-speed NVIDIA InfiniBand networking and optimized storage. We have relationships and expertise with most storage vendors, allowing us to provide bespoke solutions for every customer. Our designs are field proven, scalable, future proof, and they de-risk our customers’ investments.

Tailored

  • System, network and storage designs that de-risk deployments and enhance stability and productivity

Proven

  • More than 25 years in HPC
  • Over seven years building AI Factories
  • Deployment of over 86,000 GPUs

Innovative

  • Hybrid Cloud architectures to accelerate system availability and enable burst and disaster recovery
  • Sustainable computing with immersion cooling

Tailored

  • Provision AI and HPC server clusters at scale
  • Sophisticated factory capabilities to rack, cable, and validate full AI clusters in Penguin’s environment

Proven

  • Expert cluster integration
  • Validated software stack process eliminating compatibility issues

Innovative

  • In-factory burn-in testing and performance validation drives smooth deployments and rapid user access

Penguin is an optimal integration partner offering greater buying power, accurate and rapid installation, comprehensive supply chain management, and predictable project orchestration that addresses the most complex solutions. Our solutions complete full in-factory integration and burn-in testing to ensure that deployed systems are stable and robust at initial deployment.

Thanks to our NVIDIA DGX DevOps and Services teams, Penguin provides:

  • Monitoring
  • Detection and remediation of issues with all core components (nodes, network, storage, overall cluster, DNS servers, overall compute availability)
  • Automated ticketing and integration to customer dashboards
  • Automation scripting (Ansible, Python)
  • Operational playbooks
  • Regular improvements on bandwidth targets and compute targets, node deployment and provisioning

Expert Factory Cluster Integration

Validated Software Stack

Penguin Cluster Provisioning

In-Factory Performance Validation

Tailored

  • On-site integration and validation by Penguin’s expert services team

Proven

  • System-level testing and project management expertise accelerates system availability and performance
  • 700 racks built, delivered, live within 8 months

Innovative

  • Penguin-supplied monitoring software continuously validates system health – and maintains cluster availability

White Glove Services

Measurable Success

Production Readiness Reporting

Complete DevOps-based Monitoring

Tailored

  • On-site spares inventory and services personnel for maximum system availability

Proven

  • Expert support team and Service Level Agreement (SLA) management

Innovative

  • Cloud-first deployment model to drive immediate value
  • Cloud consumption cost management reporting and guardrails

With Penguin Managed Services, customers enjoy enhanced system availability levels and stable system operation for long-running AI workloads. The result is improved return on investment in AI technology.

Professional Services

Hosting Services

Managed Services

Contact Penguin Solutions today to discuss your HPC for AI and Generative AI for business requirements. Click the button below or call us at 415-954-2800