Previous incidents

April 2025
Apr 28, 2025
1 incident

US-NC-1 Network Issue

Resolved Apr 28 at 06:50pm PDT

Our US-NC-1 data center is currently experiencing a network issue. The team is actively investigating.


The network has been restored.

Apr 21, 2025
1 incident

Error rates elevated for Serverless endpoints

Downtime

Resolved Apr 21 at 11:40am PDT

The issue has been resolved and error rates have returned to normal levels.

3 previous updates

Apr 10, 2025
1 incident

RunPod console shows Pods and Serverless endpoints unavailable

Resolved Apr 10 at 12:06pm PDT

Monitoring - all services are returning to normal operating baselines, however we are continuing to monitor overall service recovery.


On April 10, 2025, between 18:26:30 UTC and 18:53:00 UTC, a service disruption occurred due to a software release that was dependent on a database change which had not yet been applied. This caused our primary API to become temporarily non-functional. As a result, customers experienced issues including missing pods and serverless endpoints in the dashbo...

2 previous updates

Apr 07, 2025
1 incident

Billing and Audit Log pages down

Degraded

Resolved Apr 07 at 02:08pm PDT

Resolved - Users were unable to access the Billing and Audit Log pages in User Settings. We rolled out a fix and this issue is now resolved.

2 previous updates

March 2025
Mar 27, 2025
1 incident

EUR-IS-1 Network Issue

Resolved Mar 27 at 03:00pm PDT

Investigating - We are currently experiencing an issue with EUR-IS-1 Data center
We are currently investigating and will post an update as soon as we are able.


Update - This incident requires extended resolution time,
Next update scheduled for 03/27/2025 23:59 UTC


Update - This incident requires extended resolution time,
Next update scheduled for 03/28/2025 01:00 UTC


Update - This incident requires extended resolution time,
Next update scheduled for 03/28/2025 ...

Mar 11, 2025
1 incident

Urgent: Emergency Firmware Update for US-TX-4 at 21:00 UTC (March 11, 2025)

Resolved Mar 11 at 11:59am PDT

Our engineering team has identified a network disruption at our US-TX-4 datacenter, caused by a required firmware update for our router.

To resolve this, we will deploy an emergency fix at 21:00 UTC on March 11, 2025, with a maximum expected downtime of 10-15 minutes.


The update was successfully completed.

Mar 06, 2025
1 incident

US-NC-1 Network Issue

Resolved Mar 06 at 10:44am PST

Our primary ISP circuit for the US-NC-1 data center experienced an outage. The secondary router failed to take over due to a known firmware issue that was scheduled for a later patch. We’ve now upgraded the router to the latest patched version and are running on the secondary circuit.


The issue has been resolved.

February 2025
Feb 25, 2025
1 incident

Issue with Volume Storage in CA-MTL-1

Resolved Feb 25 at 06:53am PST

We have discovered an issue affecting pods running in CA-MTL-1 when using volume disk or network storage. When executing commands, the process may hang, although the file is still created successfully.

So far, this issue primarily impacts most H100 GPUs and a few A40 GPUs. Our team is actively investigating and will provide updates here as we learn more.


We have identify the root cause of the issue, team is pushing the updates to machine.


All machines have been updated, and...

Feb 15, 2025
1 incident

EU-CZ-1 Data Center Upgrade

Resolved Feb 15 at 09:00am PST

We are currently upgrading the EU-CZ-1 data center, and all machines are offline during this process. Services hosted in this region are temporarily unavailable during this period.


We’ve successfully brought most of the machines online. However, due to some technical issues, we need a bit more time to restore the remaining ones. Thanks for your patience, we’ll keep you posted!


All machines in the EU-CZ-1 data center are now fully online. The data center upgrade is complete, t...

Feb 13, 2025
1 incident

Serverless Request Issue

Resolved Feb 13 at 03:23pm PST

We experienced an issue affecting serverless requests from 10:00 PM to 10:23 PM UTC. This was due to an update made to improve system capacity in the NYC region, which led to temporary request issues.

The issue has been identified and resolved, and we’ve taken steps to minimize future risks.


We are still seeing issues, and our team is actively investigating. We’ll provide further updates as soon as we have more information.


We have identified the issue and will be rolling out a...

Feb 11, 2025
1 incident

🚨 CA-MTL-1 Network Volume Performance Issue 🚨

Resolved Feb 11 at 08:00am PST

We’re currently experiencing performance issues with network volumes in the CA-MTL-1 data center. Our team is investigating the issue, and we’ll provide updates as soon as possible.


We detected a performance issue with one of the chunk servers and have isolated the affected server.


The issue has been resolved

Feb 05, 2025
1 incident

Main UI Console Page Down

Resolved Feb 05 at 05:52pm PST

We are currently experiencing issues accessing the Main UI Console Page. Our team is actively investigating the cause, and we will provide updates as soon as we have more information.


Our authentication provider, Clerk, is experiencing issues and is currently down. We are closely monitoring the situation and will provide updates as soon as we have more information.


Workaround:

  1. Our GraphQL API and serverless endpoints are unaffected.
  2. Users can still call the GraphQL ...