Post Mortem on API Outage May 5th 2011

On Thursday May 5th 2011 we had an API outage from ~9:00 UTC to ~14:00 UTC. This was caused by a deploy that included a migration with unexpected behavior and a bug in the application code. The migration ran on tables related to API applications and made these tables unavailable for the time of the migration. When the deploy was finished, a bug causing huge memory consumption brought our app servers down. We recovered by rolling back the faulty code and migrating the database down.

After the API was running again, we discovered that there was an inconsistency in the OAuth 2 token data. This caused all tokens to be invalid. It took us until ~00:00 UTC to restore the correct state of the tokens. During this time it was possible to create new valid tokens, so even though some users where confused that they had to log-in new in apps, the API was fully functional. No data was lost during these events.

We apologize for the outage. In the future we will test migrations more thoroughly and take an extra effort to make sure that they do not conflict with API operations.