Job Resource Monitoring on HPCs

One challenge encountered by researchers when moving from their local workstation to HPC is the transition from interactive programming to the job submission black-box. One particular point of obscurity when submitting jobs is the resources your Machine Learning algorithm is utilising; this information is important as it allows you to see if your code is taking full advantage of resources like GPUs, or spending a large amount of time reading in data.

To combat this lack of visibility, we have put together some scripts for monitoring the resources a Python job uses, such as GPUs and CPUs, and a Jupyter Notebook to analyse the results. These scripts have been tested on MASSIVE M3 and should make it relatively easy to gain visibility over the resources used by Python jobs.

Output of the provided Jupyter Notebook, looking at GPU utilisation of a job. You can clearly see the GPU spike with each epoch of training.

Output of the provided Jupyter Notebook, looking at GPU utilisation of a job. You can clearly see the GPU spike with each epoch of training.

Find them here: https://github.com/ML4AU/job-monitoring

Previous
Previous

Benchmarking GPUs for Machine Learning

Next
Next

The CvL Desktop: Strudel2