Resolved

Decreased reliability for GPU workers that need to spawn large numbers of processes

May 9, 2024 at 7:00pm UTC

Resolved
May 10, 2024 at 8:00pm UTC

Summary:

GPU pods were being given too low of a Process ID (PID) Limit, which could cause them to suffer unexpected failures when launching >1024 processes.

Source of Bug:

Logic error created as part of adding AMD GPU vendor support.

Timeline

START: ~12:00 PST 2024-05-09
END: ~13:00 PST 2024-05-10

Suggested Actions by Category:

Serverless

This should resolve itself automatically if you allow your workers to scale to zero. Alternatively, force-scale to zero or do a new release: newly created workers will have the proper # of PIDs.

GPU Pods

You will need to stop and start the pod or reset the container.

Created
May 9, 2024 at 7:00pm UTC

Incident Report: Decreased reliability for GPU workers that need to spawn large numbers of processes

Affected Products

GPU pods (NVIDIA, AMD)
GPU serverless