Checkpointing your Machine Learning

Mar 12

Checkpointing is a valuable skill when running Machine Learning code - it essentially allows you to take a snapshot of your work in a particular state, and save key information like your learning rate or weights. These checkpoints then allow you to resume training at any checkpoint you may have saved, or revert back to an earlier epoch of training. This give you the ability to resume an earlier training state with higher accuracy, or run tests from a common state while tweaking hyper-parameters.

Checkpointing is also a vital skill when working on HPC systems, as it allows you to save your work when running long jobs. Being able to resume work is important in case you run out of wall-time and the job doesn’t complete, or other faults occur - it’s the difference between restarting your entire training process or being able to resume part way. Many machine learning libraries make it easy to checkpoint your code, and you can find a brief guide on the MASSIVE M3 documentation website.

Guest User

Checkpointing your Machine Learning

The ML4AU GitHub Organisation

Benchmarking GPUs for Machine Learning