Back to overview
Downtime

Decreased reliability for GPU workers that need to spawn large numbers of processes

May 09 at 12:00pm PDT
Affected services
runpod.io/console/login

Resolved
May 10 at 01:00pm PDT

Summary:

GPU pods were being given too low of a Process ID (PID) Limit, which could cause them to suffer unexpected failures when launching >1024 processes.

Source of Bug:

  • Logic error created as part of adding AMD GPU vendor support.

Timeline

  • START: ~12:00 PST 2024-05-09
  • END: ~13:00 PST 2024-05-10

Suggested Actions by Category:

Serverless

This should resolve itself automatically if you allow your workers to scale to zero. Alternatively, force-scale to zero or do a new release: newly created workers will have the proper # of PIDs.

GPU Pods

You will need to stop and start the pod or reset the container.

Created
May 09 at 12:00pm PDT

Incident Report: Decreased reliability for GPU workers that need to spawn large numbers of processes

Affected Products

  • GPU pods (NVIDIA, AMD)
  • GPU serverless