Update on AO3 performance issues
Since last month, we've been experiencing frequent and worsening performance problems on the Archive of Our Own as the site has expanded suddenly and dramatically. The number of new users joining the site doubled between April and May, and we currently have over 17,000 users waiting for an invitation. We've been working hard to deal with the 502 errors and site slowdowns, and we've implemented a number of emergency fixes which have slightly alleviated the issues, but these haven't been as effective as we'd hoped. We're confident that we will be able to fix the problems, but unfortunately we expect the next round of fixes to take at least two weeks to implement.
We know that it's really frustrating for users when the site is inaccessible, and we're sorry that we're not able to fix the problems more quickly. We wanted to give you an update on what's going on and what we're doing to fix it: see below for some more details on the problems. While we work on these issues, you should get better performance (and alleviate the load on the servers) by browsing logged-out where possible (more details below).
Why so many problems?
As we mentioned in our previous post on performance issues, the biggest reason for the site slowdowns is that site usage has increased dramatically! We've almost doubled our traffic since January, and since the beginning of May the pace of expansion has accelerated rapidly. In the last month, more than 8,000 new user accounts were created, and more than 31,000 new works were posted. This is a massive increase: April saw just 4,000 new users and 19,000 new works. In addition to the growing number of registered users, we know we've had a LOT more people visiting the site: between 10 May and 9 June we had over 3,498.622 GB of traffic. In the past week, there were over 12.2 million page views - this number only includes the ones where the page loaded successfully, so it represents a lot of site usage!
This sudden and dramatic expansion has come about largely as a result of changes on Fanfiction.net, who have recently introduced more stringent enforcement of their policies relating to explicit fanworks which have resulted in some fans no longer being able to host their works there. One of the primary reasons the AO3 was created was in order to provide a home for fanworks which were at risk of deletion elsewhere, so we're very keen to welcome these new users, but in the short term this does present us with some challenges!
We'd already been preparing for site expansion and identifying areas of the site which needed work in order to ensure that we could grow. This means some important performance work has been ongoing; however, we weren't expecting quite such a rapid increase, so we've had to implement some changes on an emergency basis. This has sometimes meant a few additional unexpected problems: we're sorry if you ran into bugs while our maintenance was in progress.
What we've done so far
Our sys-admins and coders have implemented a number of things designed to reduce the load on the site over the last week:
- Implemented Squid caching for a number of the most performance intensive places on the site, including work index pages. For the biggest impact, we focused on caching the pages which are delivered to logged-out users. This is because all logged-out users usually see the same things, whereas logged in users might have set preferences (e.g. to hide warnings) which can't be respected by the cache. We initially implemented Squid caching for individual works, but this caused quite a few bugs, so we've suspended that for now while we figure out ways of making it work right. (You can read more about what Squid is and what it does in Release Notes 0.8.17.
- Redistributed and recalibrated our unicorns (which deliver requests to the server and retrieve the data) to make sure they're focused on the areas where we need them most. This included setting priorities on posting actions (so that you're less likely to lose data when posting or commenting), increasing the numbers of unicorns, and adjusting the time they wait for an answer.
- Simplified bookmark listings, which were using lots of processing power. We'll be looking into revamping these in the future, but right now we've stripped them back to the basics to try to reduce the load on the site.
- Cached the listing of guest kudos so the number doesn't have to be fetched from the database every time there are new kudos (which caused a big strain on the servers)
Implementing these changes has involved sustained work on the part of our sys-admins, coders and testers; in particular, the Squid caching involved a great deal of hard work in order to set up and test. Several members of the team worked through the night in the days leading up to the weekend (when we knew we would have lots of visitors) in order to implement the performance fixes. So, we're disappointed that the changes so far haven't done as much as we'd hoped to get rid of the performance problems - we were hoping to be able to restore site functionality quickly for our users, but that hasn't been possible.
What we're going to do next
Although the emergency fixes we've implemented haven't had as much impact as we'd hoped, we're confident that there are lots of things we can do to address the performance problems. We're now working on the following:
- New search and browse code. As we announced in our previous post on performance issues, we've been working for some time on refactoring our search and browse code, which is used on some of the most popular pages and needs to be more efficient. This is almost ready to go -- in fact, we delayed putting it onto our test archive in order to test and implement some of the emergency fixes -- so as soon as we have been able to test it and verify that it's working as it should, then we will deploy this code.
- More Squid caching. We weren't able to cache as many things as we'd initially hoped because the Squid caching threw up some really tricky bugs. We're continuing to work on that and we'll implement more caching across the site once we've tested it more thoroughly.
- More servers. We're currently looking at purchasing a more robust database server and moving our old database server (aka 'the Beast') into an application slot, giving us three app servers. We'll also be upgrading the database software we use so that we can make the most of this server power.
When we'll be able to implement the fixes
We're working as fast as we can to address the problems -- we poured all our resources into the emergency fixes this week to try to get things up and running again quickly. Now that we've implemented those emergency fixes, we think that we need to focus on making some really substantive changes. This means we will have to slow down a little bit in order to make the bigger changes and test them thoroughly (to minimise the chances of introducing new bugs while we fix the existing problems). Buying servers will also take us some time because we need to identify the right machines, order them and install them. For this reason, we expect it to take at least two weeks for us to implement the next round of major fixes.
We're sorry that we're not able to promise that we'll fix these problems right away. We're working as hard as we can, but we think it's better to take the time to fix the problems properly rather than experimenting with lots of emergency fixes that may not help. Since the AO3 is run entirely by volunteers, we also need to make sure we don't burn out our staff, who have been working many hours while also managing their day jobs. So, for the long term health of the site as a whole, we need to ensure we're spending time and resources on really effective fixes.
Invitations and the queue
As a result of the increasing demand for the site, we're experiencing a massive increase in requests for invitations: our invitations queue now stands at over 17,000. We know that people are very disappointed at having to wait a long time for an invitation, and we'd love to be able to issue them faster. However, the main reason we have an invitations system for creating accounts is to help manage the growth of the site -- if the 16,000 people currently waiting for an invitation all signed up and started posting works on the same day the site would definitely collapse. So, we're not able to speed up issuing invitations at this time: right now we're continuing to issue 100 invitations to the queue each day, but we'll be monitoring this closely and we may consider temporarily suspending issuing invitations if we need to.
Until recently, we were releasing some invitations to existing users who requested them. However, we've taken the decision to suspend issuing invitations this way for the present, to enable us to better monitor site usage. We know that this will be a disappointment to many users who want to be able to invite friends to the site, but we feel that the fairest and most manageable way to manage account creation at present is via the queue alone.
What can users do?
We've been really moved by the amount of support our users have given us while we've been working on these issues. We know that it's incredibly annoying when you arrive at the Archive full of excitement about the latest work in your fandom, only to be greeted by the 502 error. We appreciate the way our users have reached out to ask if they can help. We've had lots of questions about whether we need donations to pay for our servers. We always appreciate donations to our parent Organization for Transformative Works, but thanks to the enormous generosity fandom showed in the last OTW membership drive, we aren't in immediate need of donations for new servers. In fact, thanks to your kindness in donating during the last drive, we're in good financial shape and we're able to buy the new server we need just as soon as we've done all the necessary work.
As we've mentioned a few times over the weekend, we can always use additional volunteers who are willing to code and test. If this is you or anyone you know, stop by Github or our IRC chat room #otw-dev!
There are a few things users can do when browsing which will make the most of the performance fixes we've implemented so far. Doing the following should ease the pressure on the site and also get you to the works you want to see faster:
- Browse while logged out, and only log in when you need to (e.g. to leave comments, subscribe to a work, etc). Most of our caching is currently working for logged-out users, as those pages are easier to cache, so this will mean you get the saved copies which come up faster.
- Go direct to works when you can - for example, follow the feeds for your favourite fandoms to keep up with new works without browsing the AO3 directly, so you can click straight into the works you like the sound of.
Our server problems have caused some problems accessing our support form. If you have an urgent query, you can reach our Support team via the backup Support form. It's a little more difficult to manage queries coming through this route, so we'd appreciate it if you'd avoid submitting feature requests through this form, to enable us to keep on top of bug reports. Thanks!
We'd like to say a big, big thank you to all our staff who have been working really hard to address these problems. A particular shoutout to James, Elz, Naomi and Arrow, who have been doing most of the high level work and have barely slept in the last few days! We're also incredibly grateful to all our coders and testers who have been working on fixing issues and testing them, to our Support team, who have done an amazing job of keeping up with the many support tickets, and to our Communications folk who've done their best to keep our users updated on what's going on.
We'd also like to say a massive thank you to all our users for your incredible patience and support. It means so much to us to hear people sending us kind words while we work on these issues, and we hope we can repay you by restoring the site to full health soon.
A note on comments: We've crossposted this notice to multiple OTW news sites in order to ensure that as many people see it as possible. We'll do our best to keep up with comments and questions; however, it may be difficult for us to answer quickly (and on the AO3, the performance issues may also inhibit our responses). We're also getting lots of traffic on our AO3_Status Twitter! Thanks for your patience if we don't respond immediately.