Previous incidents

May 2024
May 09, 2024
1 incident

Decreased reliability for GPU workers that need to spawn large numbers of pro...

Downtime

Resolved May 10 at 01:00pm PDT

Summary:

GPU pods were being given too low of a Process ID (PID) Limit, which could cause them to suffer unexpected failures when launching >1024 processes.

Source of Bug:

  • Logic error created as part of adding AMD GPU vendor support.

Timeline

  • START: ~12:00 PST 2024-05-09
  • END: ~13:00 PST 2024-05-10

Suggested Actions by Category:

Serverless

This should resolve itself automatically if you allow your workers to scale to zero. Alternatively, force-scal...

1 previous update

April 2024
No incidents reported
March 2024
No incidents reported