UBELIX

  • {{ layer1.title }}
    • {{ layer2.title }}
      • {{ layer3.title }}
        • {{ layer4.title }}
  1. Home
  2. :: Specific Topics

Checkpointing

View in Confluence Edit Page Log In Log Out

Description

This page contains information on how to make your own program checkpointable and how to use checkpoints to resume your computation. Currently we do not provide means (e.g. BLCR) to checkpoint precompiled software.

On this page

Why Checkpointing?

Imagine your job is already running for several hours when an event occurs which leads to the abortion of your job. Such events can be:

  • Exceeding the time limit
  • Exceeding allocated memory
  • Job gets preempted by another job (can happen only in the 'gpu' partition)
  • Node failure

If you never saved intermediate results you can indeed resubmit your job, but the computation will start from the beginning. You lost valuable time and wasted valuable resources. Checkpointing your job means that you frequently save the state of your job so that you can resume computation from the last checkpoint should a disastrous event occur.

How to Checkpoint

For checkpointing your own programs you can simple introduce logic into your code that will take care of saving state and resuming from saved state when restarting the job.

Further Information

<Put further references here>

Related pages:

There is no content with the specified labels