Update on AO3 performance issues

Since last month, we’ve been experiencing frequent and worsening performance problems on the Archive of Our Own as the site has expanded suddenly and dramatically. The number of new users joining the site doubled between April and May, and we currently have over 17,000 users waiting for an invitation. We’ve been working hard to deal with the 502 errors and site slowdowns, and we’ve implemented a number of emergency fixes which have slightly alleviated the issues, but these haven’t been as effective as we’d hoped. We’re confident that we will be able to fix the problems, but unfortunately we expect the next round of fixes to take at least two weeks to implement.

We know that it’s really frustrating for users when the site is inaccessible, and we’re sorry that we’re not able to fix the problems more quickly. We wanted to give you an update on what’s going on and what we’re doing to fix it: see below for some more details on the problems. While we work on these issues, you should get better performance (and alleviate the load on the servers) by browsing logged-out where possible (more details below).

Why so many problems?

As we mentioned in our previous post on performance issues, the biggest reason for the site slowdowns is that site usage has increased dramatically! We’ve almost doubled our traffic since January, and since the beginning of May the pace of expansion has accelerated rapidly. In the last month, more than 8,000 new user accounts were created, and more than 31,000 new works were posted. This is a massive increase: April saw just 4,000 new users and 19,000 new works. In addition to the growing number of registered users, we know we’ve had a LOT more people visiting the site: between 10 May and 9 June we had over 3,498.622 GB of traffic. In the past week, there were over 12.2 million page views – this number only includes the ones where the page loaded successfully, so it represents a lot of site usage!

This sudden and dramatic expansion has come about largely as a result of changes on Fanfiction.net, who have recently introduced more stringent enforcement of their policies relating to explicit fanworks which have resulted in some fans no longer being able to host their works there. One of the primary reasons the AO3 was created was in order to provide a home for fanworks which were at risk of deletion elsewhere, so we’re very keen to welcome these new users, but in the short term this does present us with some challenges!

We’d already been preparing for site expansion and identifying areas of the site which needed work in order to ensure that we could grow. This means some important performance work has been ongoing; however, we weren’t expecting quite such a rapid increase, so we’ve had to implement some changes on an emergency basis. This has sometimes meant a few additional unexpected problems: we’re sorry if you ran into bugs while our maintenance was in progress.

What we’ve done so far

Our sys-admins and coders have implemented a number of things designed to reduce the load on the site over the last week:

  • Implemented Squid caching for a number of the most performance intensive places on the site, including work index pages. For the biggest impact, we focused on caching the pages which are delivered to logged-out users. This is because all logged-out users usually see the same things, whereas logged in users might have set preferences (e.g. to hide warnings) which can’t be respected by the cache. We initially implemented Squid caching for individual works, but this caused quite a few bugs, so we’ve suspended that for now while we figure out ways of making it work right. (You can read more about what Squid is and what it does in Release Notes 0.8.17.
  • Redistributed and recalibrated our unicorns (which deliver requests to the server and retrieve the data) to make sure they’re focused on the areas where we need them most. This included setting priorities on posting actions (so that you’re less likely to lose data when posting or commenting), increasing the numbers of unicorns, and adjusting the time they wait for an answer.
  • Simplified bookmark listings, which were using lots of processing power. We’ll be looking into revamping these in the future, but right now we’ve stripped them back to the basics to try to reduce the load on the site.
  • Cached the listing of guest kudos so the number doesn’t have to be fetched from the database every time there are new kudos (which caused a big strain on the servers)

Implementing these changes has involved sustained work on the part of our sys-admins, coders and testers; in particular, the Squid caching involved a great deal of hard work in order to set up and test. Several members of the team worked through the night in the days leading up to the weekend (when we knew we would have lots of visitors) in order to implement the performance fixes. So, we’re disappointed that the changes so far haven’t done as much as we’d hoped to get rid of the performance problems – we were hoping to be able to restore site functionality quickly for our users, but that hasn’t been possible.

What we’re going to do next

Although the emergency fixes we’ve implemented haven’t had as much impact as we’d hoped, we’re confident that there are lots of things we can do to address the performance problems. We’re now working on the following:

  • New search and browse code. As we announced in our previous post on performance issues, we’ve been working for some time on refactoring our search and browse code, which is used on some of the most popular pages and needs to be more efficient. This is almost ready to go — in fact, we delayed putting it onto our test archive in order to test and implement some of the emergency fixes — so as soon as we have been able to test it and verify that it’s working as it should, then we will deploy this code.
  • More Squid caching. We weren’t able to cache as many things as we’d initially hoped because the Squid caching threw up some really tricky bugs. We’re continuing to work on that and we’ll implement more caching across the site once we’ve tested it more thoroughly.
  • More servers. We’re currently looking at purchasing a more robust database server and moving our old database server (aka ‘the Beast’) into an application slot, giving us three app servers. We’ll also be upgrading the database software we use so that we can make the most of this server power.

When we’ll be able to implement the fixes

We’re working as fast as we can to address the problems — we poured all our resources into the emergency fixes this week to try to get things up and running again quickly. Now that we’ve implemented those emergency fixes, we think that we need to focus on making some really substantive changes. This means we will have to slow down a little bit in order to make the bigger changes and test them thoroughly (to minimise the chances of introducing new bugs while we fix the existing problems). Buying servers will also take us some time because we need to identify the right machines, order them and install them. For this reason, we expect it to take at least two weeks for us to implement the next round of major fixes.

We’re sorry that we’re not able to promise that we’ll fix these problems right away. We’re working as hard as we can, but we think it’s better to take the time to fix the problems properly rather than experimenting with lots of emergency fixes that may not help. Since the AO3 is run entirely by volunteers, we also need to make sure we don’t burn out our staff, who have been working many hours while also managing their day jobs. So, for the long term health of the site as a whole, we need to ensure we’re spending time and resources on really effective fixes.

Invitations and the queue

As a result of the increasing demand for the site, we’re experiencing a massive increase in requests for invitations: our invitations queue now stands at over 17,000. We know that people are very disappointed at having to wait a long time for an invitation, and we’d love to be able to issue them faster. However, the main reason we have an invitations system for creating accounts is to help manage the growth of the site — if the 16,000 people currently waiting for an invitation all signed up and started posting works on the same day the site would definitely collapse. So, we’re not able to speed up issuing invitations at this time: right now we’re continuing to issue 100 invitations to the queue each day, but we’ll be monitoring this closely and we may consider temporarily suspending issuing invitations if we need to.

Until recently, we were releasing some invitations to existing users who requested them. However, we’ve taken the decision to suspend issuing invitations this way for the present, to enable us to better monitor site usage. We know that this will be a disappointment to many users who want to be able to invite friends to the site, but we feel that the fairest and most manageable way to manage account creation at present is via the queue alone.

What can users do?

We’ve been really moved by the amount of support our users have given us while we’ve been working on these issues. We know that it’s incredibly annoying when you arrive at the Archive full of excitement about the latest work in your fandom, only to be greeted by the 502 error. We appreciate the way our users have reached out to ask if they can help. We’ve had lots of questions about whether we need donations to pay for our servers. We always appreciate donations to our parent Organization for Transformative Works, but thanks to the enormous generosity fandom showed in the last OTW membership drive, we aren’t in immediate need of donations for new servers. In fact, thanks to your kindness in donating during the last drive, we’re in good financial shape and we’re able to buy the new server we need just as soon as we’ve done all the necessary work.

As we’ve mentioned a few times over the weekend, we can always use additional volunteers who are willing to code and test. If this is you or anyone you know, stop by Github or our IRC chat room #otw-dev!

There are a few things users can do when browsing which will make the most of the performance fixes we’ve implemented so far. Doing the following should ease the pressure on the site and also get you to the works you want to see faster:

  • Browse while logged out, and only log in when you need to (e.g. to leave comments, subscribe to a work, etc). Most of our caching is currently working for logged-out users, as those pages are easier to cache, so this will mean you get the saved copies which come up faster.
  • Go direct to works when you can – for example, follow the feeds for your favourite fandoms to keep up with new works without browsing the AO3 directly, so you can click straight into the works you like the sound of.

Support form

Our server problems have caused some problems accessing our support form. If you have an urgent query, you can reach our Support team via the backup Support form. It’s a little more difficult to manage queries coming through this route, so we’d appreciate it if you’d avoid submitting feature requests through this form, to enable us to keep on top of bug reports. Thanks!

Thank you

We’d like to say a big, big thank you to all our staff who have been working really hard to address these problems. A particular shoutout to James, Elz, Naomi and Arrow, who have been doing most of the high level work and have barely slept in the last few days! We’re also incredibly grateful to all our coders and testers who have been working on fixing issues and testing them, to our Support team, who have done an amazing job of keeping up with the many support tickets, and to our Communications folk who’ve done their best to keep our users updated on what’s going on.

We’d also like to say a massive thank you to all our users for your incredible patience and support. It means so much to us to hear people sending us kind words while we work on these issues, and we hope we can repay you by restoring the site to full health soon.

A note on comments: We’ve crossposted this notice to multiple OTW news sites in order to ensure that as many people see it as possible. We’ll do our best to keep up with comments and questions; however, it may be difficult for us to answer quickly (and on the AO3, the performance issues may also inhibit our responses). We’re also getting lots of traffic on our AO3_Status Twitter! Thanks for your patience if we don’t respond immediately.

37 thoughts to “Update on AO3 performance issues”

  1. Thanks for all your hard work and for the clear and understandable updates.

    As stated, it’s much better to be patient and do things right the first time rather than rush and have to fix things twice. So, I’m more than willing to wait while you all get caught up, get the hardware you need, and the code written.

    Meanwhile, please take good care of yourselves! Rest, please! AO3 is a wonderful resource for fandom but not at the expense of anyone’s health.

    Please don’t answer this email; you all need a break more than I need an answer.

    Best to all of you,

    ~Mischief

  2. You all are doing such a great job in the face of these issues! I’ll be sure to donate to the site to help out very soon. Thank you for working hard and ensuring that people have a good place to post their works!

    1. Thank you! We really appreciate your support.

      Lucy
      AD&T / Communications / Support

  3. I guess you’ve already thought about this, but just in case: did you look at combining MongoDB or one of the other document no-sql solutions in the mix to try to get any speed ups?

    And if you provided an API, some of your bottleneck traffic might start to go away as other folks (like me!) develop tools for fetching/displaying stories. Then your front-end issues might reduce over time. Just a thought…

    1. Seconding a hopeful call for an API! I know it’s on the roadmap. πŸ™‚

      (What tools are you thinking about? I have no experience in making Android apps, but I WOULD LEARN HOW for an AO3 app, omg.)

    2. We currently use Redis, which is a No-SQL solution. We’re looking at other options as we go forward,

      Making an API available is definitely in our plans; right now the site code is still changing quite a bit, so it might pose some challenges, but it’s definitely something we’d like to do.

      Lucy
      AD&T / Communications / Support

  4. I have to say I’m one of the new members of Ao3 that arrived here ’cause the site is amazing!! I have known it for a while but since I was able to became a member and leave reviews and kudos more easily is been my place to be and check ever day… several times a day :).

    I hope things work out fine but I just wanted to thank you all for having a place like this open to everyone where we can enjoy the writing of such talented people like all the authors here. Is been a mental health thing for me to be able to find something great to read everyday, specially lately.

    Thank you, good luck and I hope things work out soon. Be well.

    1. A belated thanks for your support! We really appreciate it.

      Lucy
      AD&T / Communications / Support

  5. I love AO3. It’s an amazing(and one of my favorite) site. And I really appreciate all the hard work everyone is putting to keep the site going. Thanks! πŸ™‚

    And don’t push yourself too much. Health comes before anything. We can wait. Once again, all the good work is really appreciated.

    1. A belated thanks for your support! We really appreciate it.

      Lucy
      AD&T / Communications / Support

  6. You need to pull yourselves together. This is getting ridiculous now. Stop playing happy clappy, and get the people who know how to fic your problems to fix them. You’re all so worried about hurting your members feelings that you’re letting the site run itself into the ground.

    1. happy clappy? well, that’s a constructive comment, Amy. I’m a member. OTW/Ao3 communications are consistently clear and respectful. There’s been a huge uptick in usage. end of story.

    2. I think perhaps you need to reread that announcement. It very clearly states that they’re doing all that they can do, that they are doing it as fast as they can, and that they are sorry for the inconvenience. What more do you want?? It’s not like they can just snap their fingers and have the site up and running again! That’s absolutely absurd to expect! Let them do their jobs with patience. Don’t go demanding things that cannot be given to you. You look rude, childish, and foolish. They’re doing the best they can, that’s no reason to put them down.

    3. Clearly you didn’t actually read this update. They’ve outlined what the problems are and what they need to do to fix them. This is an all volunteer organization, so it’s not exactly shocking that they’re having trouble keeping up with the sudden influx in usage, and will need a couple of weeks to get things running smoothly again. If it was a for-profit, paid subscription service than your complaints (while rudely phrased) might have a point, but not with a volunteer organization.

    4. Amy, why don’t you offer some help, instead of putdowns? Criticism is one thing. Putdowns of people who are volunteering their time is something else again. Have you given the AO3 any money to help pay for better servers and so on? Do you have any expertise that you could volunteer to help speed up the process of fixing the problems? Do you have any suggestions other than ‘get your act together’?

  7. Let’s run a contest! I’ll bet that each of us has someone in our fandoms who could assist in obtaining new servers and getting them up and running. Rodney McKay could build the servers from scratch and get them running all by himself if you gave him enough coffee. Mycroft Holmes would have the servers delivered yesterday. In a black car. Merlin might be somewhat baffled as to what a server does but hey, a little magic never hurt anything.

    Maybe a team approach would work better. Would Rodney McKay and Tony Stark get the job done twice as fast – or would they kill each other first?

    Prizes for the most ingenious idea, team least likely to kill each other, etc.

    1. Ax(imili) would have no trouble upgrading existing servers. The real trick would be whether he could get one or have to break into the Yeerk Pool to steal from the Yeerks. Marco, once tempted by the prospect of Marco/Rachel and Jake/Cassie fic, would come along; Jake would chew them out afterwards; Rachel would be sad she missed it; Tobias has been stealthily reading FFN over other people’s shoulders and doesn’t need to; Cassie is afraid of getting sucked into another AU and has a judgmental attitude towards the whole affair.

    2. Lex would get the AO3 up and running quite brilliantly, with the best new servers and lots of money. πŸ™‚

      (However, Clark Kent would show up, accuse him of some evil plot, and bash the servers into scrap metal.)

  8. Thank you for keeping us in the loop. I’m sure that this whole thing is just as frustrating for you as it is for the users, so I really appreciate all the hard work you guys put into the site and make it as accessible as is possible under the current circumstances while maintaining contact to your user base as well. You guys are great!

    (Here’s to hoping that the hardware purchase and implementation will go as planned. I know these things can get tricky when you least expect (or need) it! Fingers crossed)

  9. Thanks for the detailed updates, not promising things you can’t do, and your absolute dedication to fanfic and to doing things the correct way. When there are slowdowns, you promptly let us know what’s going on and give us timetables of when it should be fixed. That’s really appreciated, especially since you are all volunteers.

    So, thanks for existing, and thanks for letting us know what is going on and offering us solutions we can effect ourselves while waiting for the “pros” to do the heavy lifting, as it were. πŸ˜‰

    1. A belated thanks for your support! We really appreciate it, and we’ll do our best to keep people in the loop as we continue to work.

      Lucy
      AD&T / Communications / Support

  10. Just want to say thank you. Your hard work is definitely appreciated, and those of us versed in reading comprehension (unlike some like the Amy above) can tell that you are working very hard to get the issues sorted. Working in functional systems support, I have experiences of a short term fix making things worse in the long term, and I really appreciate that you’re taking the time to do this the right way, even with the overwhelming pressure. I wish I could offer you more than my words and money, but my coding is limited to C++, VBA and SQL, and most of that was taught to me by Google (well, C++ by uni, but that was loooooong ago); I’d probably do more harm than good! Also, working in IT support for a multi-billion dollar corporation, I’d like to add that 2 weeks to review, select and install/migrate servers is phenomenal!! Wish we had turnaround times like that. Finally, although some people may feel otherwise, no one will die for not getting their fandom fix. Ladies and gentlemen of the AO3 community, if you’re reading this, remember: breathe, count to 10, and click refresh.

    1. Seconded with enthusiasm (except for the “knowing C++ and all that coding stuff” part.) You folks are doing, as Ai says, a phenomenal job. Thanks for all your hard work. I’ll be patient.

    2. Thank you! We’re always happy to welcome coders, even if you don’t have experience of Ruby on Rails (lots of our coders had experience in other areas and learnt RoR with us).

      We may not get a new server installed within two weeks, but we can expect a reasonable turnaround because we only recently upgraded our servers, so a lot of the research we did then is still useful (and we were already thinking about it as a system which might need to have more machines plugged in).

      Thanks for your support!

      Lucy
      AD&T / Communications / Support

  11. but the site is crashing bad.

    If it’s clear that more servers aren’t the answer, maybe it’s time for some experts? I’m not saying hire pros fulltime, because not sustainable, but maybe hire someone to fix the immediate problem ASAP and someone else to audit the code so far and tell you what the next several problems will be, and where to learn how to fix them. I mean, you guys keep saying you need more coders, but since they haven’t shown up because of the cause or the challenge, maybe it is time to offer a different incentive?

    1. We’re looking into all our options at the moment – we’ll be doing some server upgrades in the short term, and thinking about where we need to focus our efforts most and how we can do that.

      Lucy
      AD&T / Communications / Support

  12. I’m not a member yet, but I’m on the waiting list (currently expecting to be a member mid-late July). I love the site and want to thank you for your work to make it what it is. Also, whatever you’re currently doing must be making some improvement, because this week I hardly ever have trouble getting the site to load and last week I could hardly get it to load at all.

    Thanks for the updates and for making the site fun and (usually) easy to use.

    1. Thanks for your support! It’s much appreciated.

      Lucy
      AD&T / Communications / Support

  13. In my previous fandom (SGA) most creators publish on their own or comm journals (LJ/DW/IJ), and then they also upload to the Archive.

    For my newest fandom (Sherlock BBC) most creators seem to publish on the Archive first and only. I know I’ve been command-R-ing a lot recently on fully-formed links from announcement comms on the journals to AO3.

    Was your original plan to provide a permanent archive as a backup? Has that plan changed?

    1. Regardless of whether it was their original plan or not, it’s unrealistic to expect them to A) police their users and B) enforce using it as a backup archive only, especially at this late date. It would be extremely unfair to both OTW/AO3 and the userbase, not to mention quite literally impossible even if they weren’t a volunteer organisation.

      Also, it’s rare to find two fandoms that have similar cluster patterns – but I could definitely put this huge difference down to the age of the fandoms. Stargate is old; the fandom’s been around almost as long as LJ, if I have my dates anywhere near correct (which I might not because frankly too tired to go compare launch dates :P). BBC Sherlock is, by comparison, very, very new, and the fandom was birthed at a time when many people were starting to go “Oh hey you sass that hoopy AO3? There’s a frood that really knows where its’ towel is.” while also getting really fed up with LJ of the who-knows-how-many-rings circus. *cough*

      Also, fandom as a generic whole is not big on DW; I know, I’ve looked. (This is very sad because I know for a fact that a lot of people who put time and effort into building, running, and maintaining DW are very fannish.) (Similarly, it only has as much presence as it does on IJ, I suspect, because it was the default during Strikethrough and the aftermath thereof.)

      TL;DR Lots of people will – and do – use AO3 as a primary/sole archive because it’s a viable option for that, and in fact designed for such usage. It can be frustrating when people only post in one place, though, I freely admit that.

      (Disclaimer: Not an AO3 volunteer; just an informed person with internetty-things volunteer experience (Hi from DW!) and possibly entirely too long spent in fandom. :P)

    2. The purpose of the AO3 was to provide a home for fanworks which would not be subject to the kind of takedowns which have plagued fandom in the past, and would be purpose-built and so offer features that fans want. While when the site was first launched, a large part of this was providing a backup for fandoms who already had a thriving home elsewhere, the intention was never that it should be restricted to being a backup, but that fans should be able to see it as a secure home which they could use in whatever way suited their fandom. As the commenter below noted, for newer fandoms it was somewhat inevitable that some people would see it as the primary home rather than the backup, and we’re totally fine with that! Our commitment to permanency is the same whether the fanwork was posted exclusively on the AO3 or crossposted in many other places (although we always urge people to keep a personal backup of their works, wherever they choose to post them). Even if we wanted to make the Archive only a backup site, there wouldn’t really be any way of enforcing that.

      The recent expansion (which has been partly driven by people looking for a safe backup after takedowns on another site) has been more rapid than we expected, but we’ve got lots of things we know we can work on to make the site more robust, so we’re confident we’ll be able to get back on track. πŸ˜€

      Lucy
      AD&T / Communications / Support

  14. The main reason for this expansion is that fanfiction.net are doing a story hunt and are *finally* enforcing all their guidelines. Leaving in there wakes a lot of angry readers and writers. I’m a victim myself and I am thankful that I had the foresight to create an account here 3 years ago.

    Looks like new waves will be coming up for the next couple of weeks so good luck to all.

  15. I just wanted to thank you guys for working so quickly to try and fix the situation.

    I, myself, was affected by the purge on fanfiction.net of my more explicit fanfictions. Lucky, I finally got an invite and now I direct my readers over here.

    Thanks for all the fandom love!

Comments are closed.