Summary:
GPU pods were being given too low of a Process ID (PID) Limit, which could cause them to suffer unexpected failures when launching >1024 processes.
Source of Bug:
- Logic error created as part of adding AMD GPU vendor support.
Timeline
- START: ~12:00 PST 2024-05-09
- END: ~13:00 PST 2024-05-10
Suggested Actions by Category:
Serverless
This should resolve itself automatically if you allow your workers to scale to zero. Alternatively, force-scal...