SUNK is designed for AI research teams running the most demanding training workloads—where job duration, scale, and failure tolerance make reliability and predictability as critical as raw performance. SUNK delivers a production-ready, researcher-first training system that abstracts infrastructure complexity while preserving the Slurm workflows researchers rely on.
And now you can spin up a SUNK cluster using SUNK self-service. In just one click, researchers and platform teams can get a unified training system able to handle the most critical workloads, without the operational burden.
1
00:00:03,520 --> 00:00:06,360
Hi, I'm Deok, a PM here at CoreWeave.
2
00:00:06,520 --> 00:00:07,920
Let me tell you about some exciting
3
00:00:07,920 --> 00:00:10,080
things we're doing with SUNK.
4
00:00:10,080 --> 00:00:11,880
SUNK Self-Service
5
00:00:11,880 --> 00:00:12,920
turns spinning up
6
00:00:12,920 --> 00:00:14,040
a Slurm-on-Kubernetes
7
00:00:14,040 --> 00:00:14,640
cluster
8
00:00:14,640 --> 00:00:16,320
from a week of infrastructure
9
00:00:16,320 --> 00:00:18,200
work into a few clicks.
10
00:00:18,200 --> 00:00:19,440
With only one click,
11
00:00:19,440 --> 00:00:20,160
all your nodes
12
00:00:20,160 --> 00:00:23,240
magically flow into Slurm, IAM users get
13
00:00:23,360 --> 00:00:24,240
SSH access,
14
00:00:24,240 --> 00:00:25,000
the control plane
15
00:00:25,000 --> 00:00:26,320
is automatically right-sized
16
00:00:26,320 --> 00:00:27,240
to your needs,
17
00:00:27,240 --> 00:00:28,560
and a shared file system spans
18
00:00:28,560 --> 00:00:30,120
your whole cluster.
19
00:00:30,120 --> 00:00:31,720
One click to get a production
20
00:00:31,720 --> 00:00:33,120
research cluster.
21
00:00:33,120 --> 00:00:34,880
Users can change the default
22
00:00:34,880 --> 00:00:36,440
to fit their needs:
23
00:00:36,440 --> 00:00:37,800
You can safely manage access
24
00:00:37,800 --> 00:00:38,400
through IAM
25
00:00:38,400 --> 00:00:39,360
groups, reduce
26
00:00:39,360 --> 00:00:42,360
CPU costs by turning off login pods,
27
00:00:42,360 --> 00:00:43,520
or drop into the YAML
28
00:00:43,520 --> 00:00:45,040
for advanced configs.
29
00:00:45,040 --> 00:00:46,160
There's no helm charts,
30
00:00:46,160 --> 00:00:47,840
and there's no waiting.
31
00:00:47,840 --> 00:00:48,880
CoreWeave manages
32
00:00:48,880 --> 00:00:49,680
the end-to-end
33
00:00:49,680 --> 00:00:51,720
SUNK cluster life cycle
34
00:00:51,720 --> 00:00:54,000
with automated upgrades and patches,
35
00:00:54,000 --> 00:00:55,520
so you can get the Slurm experience
36
00:00:55,520 --> 00:00:56,680
your researchers expect
37
00:00:56,680 --> 00:00:59,360
without owning the operational burden.
38
00:00:59,360 --> 00:01:01,040
Customers can also deploy SUNK
39
00:01:01,040 --> 00:01:03,600
through a Kubernetes custom resource.
40
00:01:03,600 --> 00:01:05,600
You edit the CR directly
41
00:01:05,600 --> 00:01:07,600
for advanced workflows
42
00:01:07,600 --> 00:01:09,120
so you can change things
43
00:01:09,120 --> 00:01:10,720
like your Slurm configurations,
44
00:01:10,720 --> 00:01:12,800
QOS settings partitions.
45
00:01:12,800 --> 00:01:14,680
And because it's just a CR,
46
00:01:14,680 --> 00:01:16,360
it drops right into your existing
47
00:01:16,360 --> 00:01:17,640
GitOps workflow,
48
00:01:17,640 --> 00:01:18,760
so you can use a tool
49
00:01:18,760 --> 00:01:19,360
like Argo
50
00:01:19,360 --> 00:01:20,760
CD or whatever continuous
51
00:01:20,760 --> 00:01:22,720
delivery thing you use.
52
00:01:22,720 --> 00:01:23,720
You keep everything
53
00:01:23,720 --> 00:01:25,880
customers love about SUNK:
54
00:01:25,880 --> 00:01:27,000
the SUNK pod scheduler
55
00:01:27,000 --> 00:01:28,480
for unifying workloads
56
00:01:28,480 --> 00:01:30,600
like inference, sandboxes and training,
57
00:01:30,600 --> 00:01:32,400
in the same cluster
58
00:01:32,400 --> 00:01:34,520
driving up the utilization.
59
00:01:34,520 --> 00:01:35,520
You can also do
60
00:01:35,520 --> 00:01:36,600
the same prebuilt
61
00:01:36,600 --> 00:01:38,120
dashboards and custom metrics
62
00:01:38,120 --> 00:01:40,440
with our deep observability capabilities,
63
00:01:40,440 --> 00:01:41,280
and you get the benefit
64
00:01:41,280 --> 00:01:42,560
of the CoreWeave integration
65
00:01:42,560 --> 00:01:44,240
with health checks and burn-in tests.
66
00:01:45,240 --> 00:01:45,960
Finally, you get
67
00:01:45,960 --> 00:01:47,920
optimized job performance
68
00:01:47,920 --> 00:01:50,160
with topology aware scheduling.
69
00:01:50,160 --> 00:01:52,840
So it's the same SUNK, same scheduler
70
00:01:52,840 --> 00:01:54,360
your researchers trust,
71
00:01:54,360 --> 00:01:56,400
but now self-service
72
00:01:56,400 --> 00:01:57,960
and managed in day one.