As some of you might know, we had an outage yesterday. We believe that in every mistake there is something to learn from, so after each outage we are writing post-mortems. Usually we do this internally because the issues we run into are very specific to our infrastructure.
This time we ran into a quite nasty issue which could affect everyone running a linux system with a lot sessions on it and we thought you might be interested to know about that pitfall.
At 4:40pm CEST, we got reports about
Yikes (503/504 errors) on SoundCloud. Around the same time, our monitoring alerted for a high amount of 503s at our caching layer and right after that one of our L7 routing nginx instances was…