Back to overview
Downtime
Decreased reliability for GPU workers that need to spawn large numbers of processes
May 09 at 12:00pm PDT
Affected services
runpod.io/console/login
Resolved
May 10 at 01:00pm PDT
Summary:
GPU pods were being given too low of a Process ID (PID) Limit, which could cause them to suffer unexpected failures when launching >1024 processes.
Source of Bug:
- Logic error created as part of adding AMD GPU vendor support.
Timeline
- START: ~12:00 PST 2024-05-09
- END: ~13:00 PST 2024-05-10
Suggested Actions by Category:
Serverless
This should resolve itself automatically if you allow your workers to scale to zero. Alternatively, force-scale to zero or do a new release: newly created workers will have the proper # of PIDs.
GPU Pods
You will need to stop and start the pod or reset the container.
Affected services
runpod.io/console/login
Created
May 09 at 12:00pm PDT
Incident Report: Decreased reliability for GPU workers that need to spawn large numbers of processes
Affected Products
- GPU pods (NVIDIA, AMD)
- GPU serverless
Affected services
runpod.io/console/login