Surprising Results! Notes from the ashes of my servers.
Ouch! When we tried to launch the game tonight, the servers literally couldn’t take the insane load that we underwent. Here’s everything that happened:
- One day before launch, DOS attacks began hitting the wordpress and forum servers. These were mitigated as they were pretty small scale, and against normal web servers which can deal with a reduced load much better.
- When setting up the API server on our large scale servers, DOS attacks began coming through to the API server recently added to the DNS. These were rate limited, but somehow were still enough to take down the server. It became clear something was wrong with the Nancy driven API server from this.
- By rate limiting to 10 requests a minute (yikes!) we were able to reduce the load on the registration enough to keep it alive, before it was even announced to anyone. During this period, I was able to log in with what seemed like no issues. In this period, 100 users registered.
- I released the registration link on the discord channel. The whole thing blew up, immediately 500 concurrent requests were made to the api server, taking it down immediatly. Hours later, these have still not simmered down – there are a few hundred requests being made and simultaneously active. The large delay each of these requests is incurring is causing the majority of them to fail to get through.
Throughout this process, it became clear that there was a serious issue with the amount of load the API server was able to take. I would say that it should have been able to take at least 25x as much load, and it certainly should not have repeatedly crashed. After some quick experimentation tonight, it seems that the HTTP serving library used to create the API server, Nancy, was serving even plain non-sql requests way slower than other libraries.
We also did not expect this many registrations. Of all of the requests, 885 users from unique IPs signed up. This makes things really difficult. I’m only one developer, and the flash interest in a game that basically died of inactivity years ago is very surprising. It’s also pretty satisfying; it is fantastic to see this much interest in FreeSO, it’s just also very hard to manage.
Could this have been mitigated by opening registrations earlier? Could users whose registrations got through get to play the game?
No, the API server does not only drive registration. If we were to somehow mitigate impact on registration, then the impact caused by users logging into the game would bring it down instead. This would be a lot more certain, as authentication makes 4 times the requests as registration. Registration was also not ready until close to the deadline. This is mainly due to the DOS attack to the wordpress mentioned before.
So what is the plan?
We will perform further load and api testing, look into the root causes of load issues and the server stability. The current plan is to rewrite the API server using ASP.NET instead of Nancy, as it is a lot more “battle-tested”. This could take some amount of time.
We cannot announce a retry date. We might even need a new launch plan altogther, to avoid the load of a few thousand users wanting to play the game on day one. Whatever the plan is, it will be posted on here eventually.