Sunday Nov 17th, 11:08am PST. The site is down. The dev and ops team are looking into it. More information in the next 20 minutes. Sorry for the inconvenience.
Sunday Nov 17th, 11:22am PST. We understand the problem. It is related to a timeout issue in our caching layer. We are looking at fixing the problem and bringing the system back up.
Sunday Nov 17th, 12:29pm PST. A new build is needed to protect the caching layer from a set of rogue requests coming from a third party client which seems to be suffering from some infinite loops. The dev team is working on the code change. More in 30 minutes.
Sunday Nov 17th, 1:27pm PST. We disabled the IP range of the rogue third party clients and restarted the app servers. The service should be back online now.
We learned a couple of lessons from this painful experience: 1) we are going to change the code to prevent this type of rogue requests to jam the cache and 2) we are going to better partition requests so that one third party client can not impact the quality of service for other clients (this is going to take a little bit of time).
Sorry for the inconvenience. Enjoy the rest of the week end.
/Seb, Oliv, David, Kireet, Michal and Edwin