This page contains information on how to make your own program checkpointable and how to use checkpoints to resume your computation. Currently we do not provide means (e.g. BLCR) to checkpoint precompiled software.
Imagine your job is already running for several hours when an event occurs which leads to the abortion of your job. Such events can be:
- Exceeding the time limit
- Exceeding allocated memory
- Job gets preempted by another job (can happen only in the 'gpu' partition)
- Node failure
If you never saved intermediate results you can indeed resubmit your job, but the computation will start from the beginning. You lost valuable time and wasted valuable resources. Checkpointing your job means that you frequently save the state of your job so that you can resume computation from the last checkpoint should a disastrous event occur.
How to Checkpoint
For checkpointing your own programs you can simple introduce logic into your code that will take care of saving state and resuming from saved state when restarting the job.
<Put further references here>
There is no content with the specified labels