Getting reproducible training results with Fast.ai + PyTorch

Mark Pors

2025-05-25

Getting reproducible PyTorch/Fast.ai training results requires more than just seeding random number generators. Reproducibility is affected by three factors:

  1. seeding RNGs,
  2. using deterministic algorithms, and
  3. recreating an identical starting state.

Key insights:

  • DataLoaders maintain internal state that persists between training runs. You must recreate DataLoaders before each training run, not just reseed.
  • DataLoader workers need to be seeded before each run.

Minimal setup for reproducibility:

  • Set torch.backends.cudnn.deterministic = True
  • Recreate DataLoaders between runs
  • Basic RNG seeding and DataLoader worker seeding

While training an image classifier, I became a bit annoyed by the fact that subsequent runs have different results, even though I didn’t change anything.

How can we conduct experiments if we don’t know whether the one parameter I modified caused the change, or if something else was responsible, under the hood?

Why this matters

This cost me days of debugging. When you’re tuning hyperparameters or comparing model architectures, you need to know whether that 2% accuracy improvement is real or just random variance. Without reproducibility, you’re flying blind.

Jump to Conclusion

What have I tried?

When I did my experiments, I used this at the start of my notebooks:

def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

seed_everything(42)

And in the DataBlock, I provided a seed for the splitter like:

    splitter=RandomSplitter(valid_pct=0.2, seed=42),

I picked up these snippets in the Fast.ai course, but I don’t really know what each of these statements does in detail. I understand the general idea: ensure that when a random number is generated, it starts from the same seed, so that the random number is consistent each time. Like:

import random

print("Without seed:", random.randint(1, 100), random.randint(1, 100), random.randint(1, 100))
random.seed(42)
print("With seed 42:", random.randint(1, 100), random.randint(1, 100), random.randint(1, 100))
random.seed(42)
print("With seed 42 again:", random.randint(1, 100), random.randint(1, 100), random.randint(1, 100))
Without seed: 15 60 38
With seed 42: 81 14 3
With seed 42 again: 81 14 3

Unfortunately, it is not that simple, proven by the fact that it didn’t work for me. Here is an experiment that shows that loss numbers are not reproducible despite seeding “everything”: Getting reproducible results with Fast.ai / PyTorch - Attempt #1.

Perhaps a good time to read the documentation.

RTFM

Fast.ai docs

There is currently no section to be found in the Fast.ai docs covering reproducibility, but there was in version one (deprecated): Getting reproducible results:

In some situations you may want to remove randomness for your tests. To get identical reproducible results set, you’ll need to set num_workers=1 (or 0) in your DataLoader/DataBunch, and depending on whether you are using torch’s random functions, or python’s (numpy) or both:

seed = 42

# python RNG
import random
random.seed(seed)

# pytorch RNGs
import torch
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)

# numpy RNG
import numpy as np
np.random.seed(seed)

The Python and numpy parts speak for themselves. I will dive into the PyTorch statements in a moment.

If we compare this code with what I did in my first attempt, the only thing that is new is num_workers=1 for dataloaders.

I tried that, and nope: still not reproducible.

The current docs don’t have a section like the one above, but there is a function set_seed (only available in Fast.ai). Let’s have a look at the source code:

def set_seed(s, reproducible=False):
    "Set random seed for `random`, `torch`, and `numpy` (where available)"
    try: torch.manual_seed(s)
    except NameError: pass
    try: torch.cuda.manual_seed_all(s)
    except NameError: pass
    try: np.random.seed(s%(2**32-1))
    except NameError: pass
    random.seed(s)
    if reproducible:
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

So, mostly the same stuff, except: torch.backends.cudnn.benchmark = False. Which leads us to the PyTorch docs…

PyTorch docs

Luckily, there is a note about reproducibility in the PyTorch docs: Reproducibility.

There is a warning at the top of the document that I will repeat here.

Warning

Deterministic operations are often slower than nondeterministic operations, so single-run performance may decrease for your model. However, determinism may save time in development by facilitating experimentation, debugging, and regression testing.

So it might be worth trying to make everything deterministic during our early experiments (with a smaller dataset and smaller base model), and when we are confident we are on the right track, we switch to nondeterministic.

Sounds great! But first we have to get the deterministic part working in the first place…

The PyTorch doc consists of three sections. Let’s go through them one by one.

1. Controlling sources of randomness

Apart from reiterating the assignment of a fixed seed for Python, NumPy, and PyTorch, it also discusses the cudnn.benchmark feature we saw earlier. cudnn is a library built on top of CUDA that accelerates the training of neural networks. Apparently, it starts by trying out a couple of approaches (that’s the benchmarking), and picks the winner for the rest of the training. There is randomness involved in this benchmarking, so setting it to False should make the process deterministic.

What happened when I tried this?

-> Still no reproducible results!

2. Avoiding nondeterministic algorithms

Now it gets interesting! PyTorch provides a method that might solve our problems: torch.use_deterministic_algorithms(True).

This statement instructs all other Torch code to use the deterministic variant of its algorithms, and if this is not possible, to throw an exception.

And that is exactly what I got, an exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-12-25c053374517> in <cell line: 0>()
      1 learn = vision_learner(dls, resnet18, metrics=error_rate)
----> 2 learn.fine_tune(3)

21 frames
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/linear.py in forward(self, input)
    123 
    124     def forward(self, input: Tensor) -> Tensor:
--> 125         return F.linear(input, self.weight, self.bias)
    126 
    127     def extra_repr(self) -> str:

RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility

It didn’t happen in some exotic part of the library, but at the most elemental level that even I understand: return F.linear(input, self.weight, self.bias).

The error message is super specific and helpful:

Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility

Let’s see what version of CUDA we are using on Colab:

print(torch.version.cuda)
12.4

Following the advice and setting this at the top of the notebook

os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'   # or ':16:8'

got me over this hurdle!

Tip

Just a “Restart session and run all” is not enough after introducing this environment var. You have to actually disconnect from the colab runtime, connect again, and run all.

We got a step further, but another error pops up, now in the backward pass:

RuntimeError                              Traceback (most recent call last)
<ipython-input-11-af2d886d9870> in <cell line: 0>()
      1 print(os.environ['CUBLAS_WORKSPACE_CONFIG'])
      2 learn = vision_learner(dls, resnet18, metrics=error_rate)
----> 3 learn.fine_tune(3)

22 frames
/usr/local/lib/python3.11/dist-packages/torch/autograd/graph.py in _engine_run_backward(t_outputs, *args, **kwargs)
    821         unregister_hooks = _register_logging_hooks_on_whole_graph(t_outputs)
    822     try:
--> 823         return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    824             t_outputs, *args, **kwargs
    825         )  # Calls into the C++ engine to run the backward pass

RuntimeError: adaptive_max_pool2d_backward_cuda does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

Oh, oh, bad news. We have no other choice now to set the warn_only flag to True on the use_deterministic_algorithms call. This indeed allows the fine-tuning run to finish, but still not reproducible :(

3. DataLoader

Finally, the PyTorch doc about reproducibility mentions the possibility of seeding dataloader workers identically. As we previously used only one worker, I assume this won’t have any impact on our case. And guess what: it doesn’t 1!

Here is the notebook where I did all the experiments as recommended in the docs: Getting reproducible results with Fast.ai / PyTorch - Attempt #2. BTW the Colab GPU was of type L4.

Speaking of GPUs: would we get reproducible results if we run it on a CPU? The answer is NO (I tried) 2.

Let’s get some help (hello Claude)

The documentation didn’t really help. All the tricks to keep everything deterministic failed. It could be the backward pass, in which case we don’t have much hope of resolving it. But how do people train their models if you can’t do any experiments where just one meta-parameter changes?

Anyway, we are not giving up! Let’s ask our friend the LLM.

Claude speaks:

When you create a DataLoader in PyTorch, it establishes its own internal random state that becomes decoupled from the global random seed. The DataLoader contains several stateful components that persist between training runs:

The DataLoader’s internal random number generator gets initialized when you first create it, and this generator maintains its own state independently of any subsequent global seed resets. Even though you call seed_everything(42) before each run, the DataLoader’s internal RNG has already advanced its state from the previous run and won’t reset back to its initial position.

The data shuffling mechanism also maintains state. Even with a fixed seed in your RandomSplitter, the underlying shuffling operations within the DataLoader can accumulate state changes that persist across runs, leading to subtle differences in the exact order that batches are presented to your model.

Ah, of course, state is the key thing here! Random generators can start nicely with the same seed again, but if the state is not the same as it was before, that has no impact!

So let’s apply that wisdom by setting the seeds and recreate the DataLoader before starting a new training run…

Victory, we have reproducible results now! Have a look at this winning notebook: Getting reproducible results with Fast.ai / PyTorch - Attempt #3.

Warning

Even with the most conservative setup, like in this notebook, there can be small differences between runs. I noticed this when changing the GPU model from L4 to A100. This is acknowledged in the PyTorch documentation.

There is just one problem with it: it has become really slow.

^H^H^H^H^H

For you younger kids: ^H is the backspace character in Unix-type terminals. In other words, let’s delete some of the things we have done to increase reproducibility at the cost of performance.

Observations

num_workers:

Changing from num_workers=0 to num_workers=12 3 maintains reproducibility, but only if the workers are seeded as follows (from the PyTorch docs):

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    numpy.random.seed(worker_seed)
    random.seed(worker_seed)

g = torch.Generator()
g.manual_seed(42)

DataLoader(
    train_dataset,
    batch_size=batch_size,
    num_workers=num_workers,
    worker_init_fn=seed_worker,
    generator=g,
)

This also maintains reproducibility across sessions.

I’m very happy this works as it speeds up things significantly!

use_deterministic_algorithms:

Removing torch.use_deterministic_algorithms(True) also maintains reproducibility, both in the notebook and across sessions 4.

set_seed:

Setting the second parameter of set_seed, reproducible to False (which is the default) basically sets torch.backends.cudnn.benchmark = False in our case. This also maintains reproducibility, both in the notebook and across sessions.

cudnn.deterministic:

Setting torch.backends.cudnn.deterministic = False breaks reproducibility, so it is essential when running meaningful experiments. I hardly saw any performance degradation by setting this to True, but that might be different for other use cases.

Key findings: as long as you seed every DataLoader worker and keep torch.backends.cudnn.deterministic = True, you can crank num_workers back up, drop torch.use_deterministic_algorithms(True), and rely on the default set_seed(..., reproducible=False): reproducibility still holds across notebook sessions while you recover full training speed.

The notebook that has applied the above, and is fast and reproducible, is here: Getting reproducible results with Fast.ai / PyTorch - Attempt #4

Conclusion

Note

These conclusions are probably not generic for all training pipelines and base models, but they probably are for convolutional networks like Resnet18. It might be different for transformer-based models, who knows. At least we know what knobs we can turn to get reproducible results.

Reproducible training comes down to three levers:

  1. Seed every RNG – Python random, NumPy, and each DataLoader worker (worker_init_fn + torch.Generator).
  2. Force deterministic kernels torch.backends.cudnn.deterministic = True (leave benchmarking off while you experiment).
  3. Start from the same state: rebuild the DataLoader before each fresh run so its internal sampler is reset.

If you are using Fast.ai, use the Fast.ai notebook as a reference for reproducible training.

In the end it comes down to:

# First run
seed_everything(42)
g = torch.Generator()
g.manual_seed(42)
torch.backends.cudnn.deterministic = True
dls = create_dataloaders(g)
learn = vision_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(3)

# Next run(s)
seed_everything(42)
g = torch.Generator()
g.manual_seed(42)
dls = create_dataloaders(g)
learn = vision_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(3)

If you prefer to just use PyTorch, use the PyTorch notebook (Claude generated this based on the Fast.ai version).

Once you move from experimentation to full-scale training, you can flip torch.backends.cudnn.deterministic = False to potentially regain speed. Just remember that it will sacrifice strict repeatability.

What’s next

Another interlude that got a bit out of hand. Time to go back to our image classifier and see if we can improve it. Read on…

Footnotes

  1. As I found out later, it certainly does matter when using more than one worker, which is what you always want in order to have acceptable performance.↩︎

  2. With what I know now, this is because I didn’t recreate the dataloaders between runs. If we did that and used a CPU, the results would be reproducible.↩︎

  3. This is the default in my case; it is calculated as min(16, os.cpu_count()).↩︎

  4. This stays repeatable unless your model calls a layer that can’t guarantee identical results on the GPU; in that case, the numbers may shift a bit between runs, or PyTorch will throw a warning.↩︎