CoreWeave SUNK

The industry's first unified training system for the most demanding AI workloads—delivering production-grade reliability and operational visibility for large, long-running training jobs.

Play video

Redefining the AI research cluster for production-grade training

SUNK is built for AI research teams running large, long-running training jobs, where predictability, reliability, and operational visibility matter as much as raw performance. SUNK preserves the Slurm workflows researchers rely on while bringing Kubernetes-native operational discipline to the cluster.

Lifecycle unity

Unify how researchers run Slurm and how platform teams operate clusters—without requiring weeks of bespoke setups. SUNK User Provisioning (SUP) automates secure onboarding and reduces identity/config drift so teams stay aligned from day one.

Reliability

Run large, long-running training jobs with production-grade reliability. CoreWeave Mission Control monitors cluster health end-to-end, detects silent hardware issues and GPU stragglers, and mitigates disruption before it compounds into lost training time.

Performance

Maximize productive training time with topology-aware scheduling and predictable cluster behavior tuned for distributed training. Keep multi-day runs moving forward by reducing disruption, retries, and fragmentation across GPU resources.

Observability

Get operational visibility from infrastructure health to job-level behavior. Correlate Slurm metrics with GPU, network, and storage signals to spot bottlenecks fast, validate performance, and keep training on track.

The market's most proven Slurm‑on‑Kubernetes offering

SUNK is the industry’s first unified training system for the most demanding AI workloads. SUNK is built for large, long-running training jobs where reliability and operational visibility matter.

SUNK self-service

Bring production-ready SUNK clusters online faster. SUNK self-service streamlines how teams deploy and manage clusters while reducing setup friction before productive training begins.

SUNK Anywhere

Extend the same unified training system beyond CoreWeave so teams can preserve one way of running demanding AI workloads as infrastructure environments expand. SUNK Anywhere helps reduce fragmentation across environments.

Mission Control

Monitor cluster health end to end with disciplined lifecycle control. Mission Control detects hardware anomalies and GPU stragglers and automatically mitigates failures to keep long-running training jobs on track.

Unified scheduling and observability

SUNK integrates Slurm and Kubernetes with tighter synchronization, unified scheduling, and built-in observability hooks. Reduce fragmented tooling and manual coordination as workloads scale.

Left
Right

Proven by leading pioneers at production scale

ZohoZoho
Rev.comRev.com
AltumAltum
AletheaAlethea
DatabricksDatabricks
OpenAIOpenAI
GoogleGoogle
MistralAIMistralAI
CohereCohere
Jane StreetJane Street
DecartDecart
CloudflareCloudflare
AbridgeAbridge
Stability AIStability AI
RunDiffusionRunDiffusion
MozillaMozilla
InflectionInflection
Fireworks AIFireworks AI
DebuildDebuild
AugmentAugment
ConjectureConjecture
ChaiChai
NovelAINovelAI
RunwayRunway
General IntuitionGeneral Intuition
World LabsWorld Labs

A faster path to production-ready SUNK clusters

SUNK self-service streamlines how teams deploy and manage SUNK clusters capturing CoreWeave’s operational learnings from supporting research clusters of all sizes. Reduce setup friction, simplify operations, and move from cluster bring-up to productive training faster.

Play video

Run on industry-leading Cloud infrastructure services

SUNK runs on CoreWeave infrastructure services built for AI training performance, scale, and operational consistency.

Compute services

Get the latest GPU compute you need for your most complex AI workloads through a Kubernetes-native environment.

Storage services

Flexible, purpose-built, high-performance storage solutions that are purpose-built for AI.

Networking services

High-performance networking designed for optimal cluster scale-out and connectivity.

Supercomputing scale and enterprise-grade security

With massive megaclusters, CoreWeave GPU clusters help support multi-trillion parameter model training.

Left
Right

Frequently asked questions

What is CoreWeave SUNK?

CoreWeave SUNK, part of CoreWeave’s research fabric, is the industry’s first unified training system for the most demanding AI workloads. It applies cloud-native scale and agility to deliver production-grade reliability and operational visibility for large, long-running training jobs.

How does SUNK self-service help teams bring clusters into operation faster?

With the general availability of SUNK self-service, customers can bring SUNK clusters into operation using a guided path that captures CoreWeave’s operational learnings. Standardized setups reduce drift over time and provide a production-ready starting point that is easier to onboard, easier to manage, and more consistent as clusters evolve.

What is SUNK Anywhere?

SUNK Anywhere extends CoreWeave’s unified training system beyond CoreWeave infrastructure so teams can operate demanding AI workloads with consistent workflows and operational discipline across environments. This helps platform teams expand without fragmentation and helps researchers keep familiar scheduling behavior and workflows as their infrastructure footprint grows.

Learn more here

How does Mission Control support reliability and operational visibility for long-running training jobs?

Mission Control provides continuous health monitoring, automated remediation, and deep operational visibility to help keep large, long-running jobs operating reliably at scale. Mission Control also expands observability with GPU straggler detection, available in private preview, to help identify performance outliers that can degrade synchronized training.

What performance and reliability outcomes has SUNK demonstrated?

SUNK is designed to maximize productive training time with high training goodput, effective training time (ETTR), and improved reliability in benchmark scenarios. Published proof points include up to 96% training goodput, 97–98% effective training time (ETTR), 10× longer mean time to failure (MTTF), and demonstrated >50% model FLOPs utilization (MFU) at large scale.

Left
Right

See what SUNK can do for you

Experience the resource flexibility your teams need to build, train, and deploy new models.