Training Your Own ChatGPT on DGX Spark: Getting nanochat to Work with CUDA 13.0

The $100 ChatGPT Challenge

When Andrej Karpathy released nanochat with the tagline “The best ChatGPT that $100 can buy,” I knew I had to try it. The promise is compelling: train your own ChatGPT-like model from scratch - including tokenization, pretraining, finetuning, evaluation, and web serving - all in a clean, hackable codebase under 8,500 lines of code.

I have an NVIDIA DGX Spark workstation with an NVIDIA GB10 GPU running Ubuntu 24.04 and CUDA 13.0. Naturally, I wanted to see if I could get nanochat running locally. Spoiler alert: it worked, and after 261 hours of training, I have a working ChatGPT-like model!

The Hardware Reality Check

Here’s what I’m working with:

The challenge? The upstream nanochat repository is built for CUDA 12.8, which is what most cloud GPU providers (Lambda, AWS, etc.) are running. CUDA 13.0 is newer, and while you’d think “newer is better,” in the deep learning world, bleeding edge often means broken dependencies.

What is nanochat?

Before diving into the fixes, let me explain why nanochat is special. Unlike massive frameworks like Transformers or megarepos with thousands of configuration options, nanochat is designed to be:

The speedrun.sh script trains a competent 1.9 billion parameter model in about 4 hours on an 8xH100 node for roughly $100. The resulting model outperforms GPT-2 (2019) while being completely yours to modify and experiment with.

The CUDA 13.0 Problem

When I first cloned the repo and tried to run bash speedrun.sh, I immediately hit dependency issues. The errors were the usual suspects:

The root cause: nanochat’s pyproject.toml pulls PyTorch from the pytorch-cu128 index, which expects CUDA 12.8.

The Fix: Four Key Changes

I had to make coordinated changes across the dependency stack. Here’s what worked:

1. Upgrade PyTorch to 2.9.0 with CUDA 13.0 Support

In pyproject.toml, I updated the PyTorch version and changed the CUDA build target:

# Updated dependencies
dependencies = [
    # ... other deps ...
    "numpy==1.26.4",      # Pin to stable version
    "torch>=2.9.0",        # Bumped from 2.8.0
    "triton>=3.5.0",       # Explicitly add Triton
    # ... other deps ...
]

Then I updated the PyTorch source indices:

# Changed from pytorch-cu128 to pytorch-cu130
[[tool.uv.index]]
name = "pytorch-cu130"
url = "https://download.pytorch.org/whl/cu130"
explicit = true

[project.optional-dependencies]
gpu = [
    "torch>=2.9.0",  # Now pulls from cu130 index
]

Why pin NumPy to 1.26.4? I found that NumPy 2.x had breaking changes that caused subtle issues with PyTorch tensor operations. Pinning to 1.26.4 eliminated those headaches.

Why explicitly add Triton? Triton is PyTorch’s JIT compiler for GPU kernels. It needs to match your CUDA version, and making it explicit prevents version conflicts.

2. Point Everything at CUDA 13.0

In speedrun.sh, I added environment variables to make sure all tools know where CUDA lives:

#!/bin/bash

# ... existing setup code ...

# Point to CUDA 13.0 installation
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:${LD_LIBRARY_PATH}

# Triton needs to know where the CUDA assembler lives
export TRITON_PTXAS_PATH=/usr/local/cuda-13.0/bin/ptxas

# ... rest of speedrun script ...

These variables ensure:

3. Adjust for Single GPU

The upstream speedrun.sh assumes you have 8 GPUs. My DGX Spark has one GB10, so I changed:

# Number of processes/GPUs to use
NPROC_PER_NODE=1  # Changed from 8

The beauty of nanochat’s implementation is that it automatically adapts. With fewer GPUs, it does more gradient accumulation steps - essentially trading parallel processing for sequential processing. You’ll wait longer, but you’ll get the same results.

4. Regenerate the Lock File

After changing pyproject.toml, I regenerated the dependency lock file:

uv lock

This pulled in all the CUDA 13.0-compatible versions of PyTorch, its dependencies (like nvidia-cublas, nvidia-cudnn-cu13, etc.), and resolved everything correctly.

Does It Actually Work?

Yes! After these changes, I could run:

bash speedrun.sh

The training kicked off successfully. Here’s what the actual training looked like on the DGX Spark:

nanochat training on DGX Spark

Status as of Nov 12 08:56 Eastern Time

nanochat training on DGX Spark

As you can see from the screenshot, the training was progressing through the iterations. The loss decreased steadily, learning rate was being adjusted, and gradient norms remained stable around 1.0 - all good signs.

Training Complete: The Results

After 261 hours (nearly 11 days) of continuous training, the full pipeline completed successfully! Here’s what the model achieved:

Model Specifications

I trained a 560 million parameter model (smaller than the default 1.9B to fit my single GPU setup better). The training consumed:

Tokenizer Performance

The custom-trained tokenizer showed solid performance compared to established tokenizers:

vs GPT-2:

vs GPT-4:

This is impressive for a tokenizer trained in under 50 seconds!

Base Model Results

After 21,400 training iterations, the base model achieved:

Key benchmark scores:

Supervised Fine-Tuning Results

After SFT on 22,439 conversation examples, the model improved across the board:

BenchmarkMid-trainingPost-SFTImprovement
ChatCORE0.23170.2482+7.1%
ARC-Easy0.40530.4386+8.2%
ARC-Challenge0.29520.3106+5.2%
MMLU0.32210.3292+2.2%
GSM8K0.03260.0591+81%
SpellingBee0.98830.9922+0.4%

The SFT phase completed in just 701 iterations (1 epoch), showing that even a relatively small amount of high-quality instruction data can meaningfully improve a pretrained model’s capabilities.

What This Means

These results demonstrate a few important things:

  1. You can train competitive models on modest hardware - A single GB10 GPU can train a functional language model, it just takes patience (11 days vs 4 hours on 8xH100s).

  2. Small models can be surprisingly capable - At 560M parameters, this model is about 1/3 the size of GPT-2, yet shows reasonable performance across diverse tasks.

  3. SFT provides consistent improvements - The supervised fine-tuning phase showed gains across nearly all benchmarks, particularly in reasoning tasks like GSM8K.

  4. The full pipeline works - From raw text to a conversational model, every stage of nanochat executed successfully on CUDA 13.0.

Sample Output

Here are a few samples from the final model:

User: The capital of France is
Model: Paris. It is the largest city in France and the second largest city in Europe

User: The chemical symbol of gold is
Model: Au. It is a soft, malleable, ductile, and corrosion-resistant metal.

User: If yesterday was Friday, then tomorrow will be
Model: Saturday. If tomorrow is Friday, then tomorrow will be Saturday.

User: The planets of the solar system are:
Model: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune,

As you can see, the model demonstrates basic factual knowledge and can complete common prompts reasonably well. It’s far from GPT-4, but it’s genuinely impressive considering it was trained from scratch in my office!

Serving the Model

Now that training is complete, I can run the model locally:

python -m scripts.chat_web

And interact with my own tiny ChatGPT at http://localhost:8000!

The model loads quickly and responds in real-time. While it makes mistakes and has limitations, there’s something magical about chatting with a model you trained yourself with full understanding of every line of code that created it.

What I Learned

CUDA version compatibility is no joke. Every piece of the stack - PyTorch, Triton, CUDA drivers, runtime libraries - needs to align. One mismatch and nothing works.

Environment variables solve runtime mysteries. Half my debugging time was saved by properly setting CUDA_HOME, LD_LIBRARY_PATH, and TRITON_PTXAS_PATH.

Dependency pinning prevents future headaches. NumPy 1.26.4 might not be the latest, but it’s stable and works. Chasing the newest version isn’t always worth it.

nanochat is impressively robust. Despite being designed for 8 GPUs, it gracefully scaled down to my single GPU without manual gradient accumulation tuning. The training code automatically compensates for fewer GPUs.

Patience pays off in ML. 11 days of training gave me not just a working model, but deep appreciation for the compute requirements of modern AI and the engineering that goes into making it efficient.

Small models have their place. While everyone chases bigger models, a well-trained 560M parameter model can be surprisingly useful and is much easier to understand, modify, and serve.

The nanochat Experience

Training a model from scratch - even a small one - is eye-opening. You realize how much data these models need, how sensitive they are to hyperparameters, and how much compute goes into even “toy” models.

The resulting model is charming in its limitations. It hallucinates, makes silly mistakes, and sometimes gives hilariously wrong answers - like a smart kindergartener. But it’s yours. You understand every line of code that created it. You can modify the training data, tweak the architecture, or experiment with different optimization strategies.

For anyone serious about understanding modern LLMs, nanochat is an invaluable educational tool. The complete transparency - from raw bytes to conversational AI - is something you simply can’t get from using commercial APIs or even most open-source frameworks.

Code Changes Summary

If you’re running CUDA 13.0, here are the files you’ll need to modify:

pyproject.toml:

speedrun.sh:

Then regenerate:

uv lock

You can see all my changes in my fork: github.com/gavi/nanochat

What’s Next?

Now that the base model is trained, I’m planning to:

The beauty of nanochat is that I now have a fully functional baseline model to experiment with. Every modification I make will teach me something about how LLMs work.

Final Thoughts

Getting nanochat running on CUDA 13.0 took a few hours of dependency wrangling, and training took 261 hours, but it was absolutely worth it. If you’re on DGX hardware or any system running CUDA 13.0, these changes should save you the debugging time.

More importantly, nanochat itself is a gem. In an era of massive, opaque AI frameworks, having a complete, understandable, hackable LLM training pipeline is refreshing. Whether you’re learning how LLMs work or want to experiment with your own models, nanochat is an excellent starting point.

The journey from “bash speedrun.sh” to a working conversational AI was educational at every step. I now understand tokenization, transformer architectures, optimization strategies, and evaluation metrics at a much deeper level than any tutorial could teach.

Now go train your own ChatGPT!


Resources:

Update Log:

By: Gavi Narra on: