Garbage collection

Garbage collection

Domino has configurable values to help you tune your cluster to balance performance with cost controls. The more idle volumes you allow the more likely it is that users can reuse a volume and avoid needing to copy project files from the blob store. However, this comes at the cost of keeping additional idle PVs.

By default, Domino will:

  • Limit the total number of idle PVs to 32. This can be adjusted by setting the following option in the central config: common com.cerebro.domino.computegrid.kubernetes.volume.maxIdle

  • Terminate any idle PV that has not been used in a certain number of days. This can be adjusted by setting the following option in the central config: common com.cerebro.domino.computegrid.kubernetes.volume.maxAge This value is expressed in terms of days. The default value is empty, which means unlimited. A value of 7d will terminate any idle PV after seven days.

Salvaged volumes

In the scenario when a user’s job fails unexpectedly, Domino will preserve the volume so data can be recovered. After a workspace or job ends, claimed PV’s are placed into one of the following states, indicated with the dominodatalab.com/volume-state label.

  • available

    If the run ends normally, the underlying PV will be available for future runs.

  • salvaged

    If the run fails, the underlying PV will not be eligible for reuse, and is held in this state to be salvaged.

Salvaged PVs will not be reused automatically by the future workspaces or jobs, but can be manually mounted to a workspace to recover work.

By default, Domino will:

  • Limit the total number of salvaged PVs to 64. This can be adjusted by setting the following option in the central config: common com.cerebro.domino.computegrid.kubernetes.volume.maxSalvaged

  • Terminate any salvaged PV that has not been used in a certain number of days. This can be adjusted by setting the following option in the central config: common com.cerebro.domino.computegrid.kubernetes.volume.maxSalvagedAge

    The value is expressed in terms of days. The default value is seven days. A value of 14d will terminate any salvaged PV after fourteen days.

Recover a salavaged volume:
  1. Find the PV that was attached to your job or workspace, which will be in the Deployment logs for your job or workspace.

  2. Create a pod attached to the salvaged volume.

  3. Recover the files with your most convenient method (scp, AWS CLI, kubectl cp, and so on)

This script will do Step 2 and will provide the appropriate commands in its output. Remember to delete the PVC and PV, otherwise these resources will continue to be used.