CoreWeave SUNK

The industry's first unified training system for the most demanding AI workloads—delivering production-grade reliability and operational visibility for large, long-running training jobs.

Download solution brief Contact sales

Redefining the AI research cluster for production-grade training

SUNK is built for AI research teams running large, long-running training jobs, where predictability, reliability, and operational visibility matter as much as raw performance. SUNK preserves the Slurm workflows researchers rely on while bringing Kubernetes-native operational discipline to the cluster.

Lifecycle unity

Unify how researchers run Slurm and how platform teams operate clusters—without requiring weeks of bespoke setups. SUNK User Provisioning (SUP) automates secure onboarding and reduces identity/config drift so teams stay aligned from day one.

Reliability

Run large, long-running training jobs with production-grade reliability. CoreWeave Mission Control monitors cluster health end-to-end, detects silent hardware issues and GPU stragglers, and mitigates disruption before it compounds into lost training time.

Performance

Maximize productive training time with topology-aware scheduling and predictable cluster behavior tuned for distributed training. Keep multi-day runs moving forward by reducing disruption, retries, and fragmentation across GPU resources.

Observability

Get operational visibility from infrastructure health to job-level behavior. Correlate Slurm metrics with GPU, network, and storage signals to spot bottlenecks fast, validate performance, and keep training on track.

The market's most proven Slurm‑on‑Kubernetes offering

SUNK is the industry’s first unified training system for the most demanding AI workloads. SUNK is built for large, long-running training jobs where reliability and operational visibility matter.

SUNK self-service

Bring production-ready SUNK clusters online faster. SUNK self-service streamlines how teams deploy and manage clusters while reducing setup friction before productive training begins.

SUNK Anywhere

Extend the same unified training system beyond CoreWeave so teams can preserve one way of running demanding AI workloads as infrastructure environments expand. SUNK Anywhere helps reduce fragmentation across environments.

Mission Control

Monitor cluster health end to end with disciplined lifecycle control. Mission Control detects hardware anomalies and GPU stragglers and automatically mitigates failures to keep long-running training jobs on track.

Unified scheduling and observability

SUNK integrates Slurm and Kubernetes with tighter synchronization, unified scheduling, and built-in observability hooks. Reduce fragmented tooling and manual coordination as workloads scale.

Proven by leading pioneers at production scale

See more customer stories

CoreWeave SUNK Self-Service is a big improvement for customers who want easy deployments. Even if you've got a long term committed contract, there is lots of reasons to spin up clusters quickly in a self-service manner. At the end of the day, speed is the moat. CoreWeave recognizes this, and supports their customers by moving at the speed they need to be successful.

Dylan Patel

Founder, CEO, and Chief Analyst, SemiAnalysis

On rack-scale GB200 systems, CoreWeave SUNK's topology-aware scheduling and custom dashboards enabled faster, more efficient training runs and higher cluster utilization. Integrated health checks, automated node remediation, and deep observability reduced interruptions, enabling researchers to iterate faster and platform engineers to spend less time firefighting.

Brian Belgodere

Senior Technical Staff Member, IBM

We needed infrastructure that scales without dragging operations along with it. SUNK delivered that out of the box: shared file systems, automated user provisioning, and customizable environments that let our researchers focus on research instead of fighting their tooling. CoreWeave Mission Control's node health checks and remediation alone have saved us significant operational overhead. It just works, and at our pace of growth, that is critical.

Sam Kottler

ML Infra, Cursor

A faster path to production-ready SUNK clusters

SUNK self-service streamlines how teams deploy and manage SUNK clusters capturing CoreWeave’s operational learnings from supporting research clusters of all sizes. Reduce setup friction, simplify operations, and move from cluster bring-up to productive training faster.

Run on industry-leading Cloud infrastructure services

SUNK runs on CoreWeave infrastructure services built for AI training performance, scale, and operational consistency.

Compute services

Get the latest GPU compute you need for your most complex AI workloads through a Kubernetes-native environment.

Storage services

Flexible, purpose-built, high-performance storage solutions that are purpose-built for AI.

Networking services

High-performance networking designed for optimal cluster scale-out and connectivity.

Supercomputing scale and enterprise-grade security

With massive megaclusters, CoreWeave GPU clusters help support multi-trillion parameter model training.

Explore the CoreWeave Cloud platform

Frequently asked questions

What is CoreWeave SUNK?

CoreWeave SUNK, part of CoreWeave’s research fabric, is the industry’s first unified training system for the most demanding AI workloads. It applies cloud-native scale and agility to deliver production-grade reliability and operational visibility for large, long-running training jobs.

How does SUNK self-service help teams bring clusters into operation faster?

With the general availability of SUNK self-service, customers can bring SUNK clusters into operation using a guided path that captures CoreWeave’s operational learnings. Standardized setups reduce drift over time and provide a production-ready starting point that is easier to onboard, easier to manage, and more consistent as clusters evolve.

What is SUNK Anywhere?

SUNK Anywhere extends CoreWeave’s unified training system beyond CoreWeave infrastructure so teams can operate demanding AI workloads with consistent workflows and operational discipline across environments. This helps platform teams expand without fragmentation and helps researchers keep familiar scheduling behavior and workflows as their infrastructure footprint grows.

Learn more here

How does Mission Control support reliability and operational visibility for long-running training jobs?

Mission Control provides continuous health monitoring, automated remediation, and deep operational visibility to help keep large, long-running jobs operating reliably at scale. Mission Control also expands observability with GPU straggler detection, available in private preview, to help identify performance outliers that can degrade synchronized training.

What performance and reliability outcomes has SUNK demonstrated?

SUNK is designed to maximize productive training time with high training goodput, effective training time (ETTR), and improved reliability in benchmark scenarios. Published proof points include up to 96% training goodput, 97–98% effective training time (ETTR), 10× longer mean time to failure (MTTF), and demonstrated >50% model FLOPs utilization (MFU) at large scale.