Some of our customers have contacted us about a slowness in our backend system.
We want to not only say “we’re sorry,” which of course we are, but we also want to spend some time explaining what happened and how we’re fixing it.
While for anyone it is a bit uncomfortable to admit where we come up short, we believe we need to be transparent in our efforts to make WP Engine the best WordPress platform for your website. Nothing is more important to us than our customers and our entire office wants every single one of you to have an amazing experience.
This post addresses a recent technical problem we had with our platform. What looked like a one-off engineering problem to begin with, ended up being a back-end issue affecting a number of you. We understand that this has been incredibly frustrating for you. Our aim here is to clearly explain the technical difficulties and what we’re doing to rectify them.
The Short Version of the Problem
Our system was experiencing random back-end slowness which displayed itself often as 50X errors.
For those that want just the basics:
1. We started looking into this several weeks ago.
2. We have deployed several releases as of this morning to help with this particular issue, as well as a general class of related issues.
3. We have added to our robust monitoring tools to make sure this issue or related potential issues don’t happen again.
4. We do listen and care about every one of you. We prefer to have these discussions in public and will continue to do so because you—our customers—are the thing we care the most about.
For those that want to dig in deeper, to learn about the why and how, I ask you to go on a journey with me.
Let’s Move Back in Time
It started with a phone call.
One of the most experienced developers on our platform reached out to us to say he was having some issues with our system. We took some steps and it died down. We felt awful he had a hard time getting to where he needed to be.
Then another call came in. And an IM. And an email. What were one-offs became a troubling trend and before we knew it, Engineering knew we had a problem.
While we’ve always actively monitored the uptime and performance of WP Engine’s systems as we’ve scaled, incremental changes in performance can go unnoticed for longer than anyone would like. A recent example is that until only about a month ago we were unaware of how slow the WordPress user interface had recently become for many of our customers. Thanks to a few customers raising it to our attention we looked more closely at the performance and our measurements and realized that although our high-level metrics, didn’t indicate a problem, that’s because we weren’t measuring certain aspects of our platform at sufficient detail to show that in fact there was a problem. We are very grateful to the customers that came forward and in the future encourage you all to reach out to us as well.
For those customers and many others, the experience of using WordPress while logged in had become unsustainable. While customer’s sites were still delivered in a speedy fashion, our customers had a poor experience every time they logged in to WordPress. That is not the standard we hold ourselves to, and was a complete mismatch with our expectations. This is especially true since we’ve invested in additional systems (not to mention faster hardware) to improve performance over the last several months, as you can see in this graph showing the average ratio of customers to systems.
Number of Customers per System
Our Response to the Red Flags
After we were contacted by the several customers we spun up a SWAT team of engineers to look into a number of issues we suspected to be related to this backend slowness, and our Support Team helped us find a sample set of customers who were able to reproduce the issues.
Every day since then we have held a standup of those engineers as well as the senior leadership from across WP Engine to discuss the current state of the investigation, research, communication as well as the progress that has been made since the day before.
Initially the problem looked quite impossible to diagnose, as each site we investigated had different symptoms, however as we continued to look at more sites we saw patterns emerging. When we look at the percent of wp-admin page loads that exceeded three seconds, excluding things like admin-ajax and other system or background requests, we finally found a metric that held a smoking gun:
Slowness of the Back End Over Time
We can see that they took a large step up in March, and some changes we released in early May helped to bring those back to “normal,” but we weren’t convinced that we had found all of the issues.
Another Suspect Appears
We began to suspect there were issues with transient variables and object caching throughout our system early on as we watched plugins make external API calls on every single page load instead of caching those results. Additionally, when we disabled object caching some sites performance became significantly better, which (obviously) hinted that something was wrong with our object caching platform. As it turns out, shim code we put in place several years ago—to mitigate login issues some plugins faced with wp-cache—was clearing the object cache any time a user attempted to login. We have since released a number of changes to the object caching layer to help mitigate and prevent this issue going forward.
We Need to Have Even More Metrics That Matter
First and foremost, we updated our measurement and monitoring of the object cache. Previously we would simply validate that the system would accept a new cache key, and that we could retrieve that value a second later. This metric lets us know that the cache is available, but not whether it is effective. It’s akin to checking whether a hospital patient has a heart, but not looking to see if they have a heartbeat. We now regularly use metrics for the object cache that look much deeper, including verifying that the cache is serving keys many times before they either expire or are expunged.
Additionally, we’ve updated our platform code to remove the shim code (which is no longer necessary, as plugins have updated their code to handle wp-cache better.) Finally, we’ve tightened up the code that requests an object cache flush to make certain it is only removing elements that absolutely must be removed, as opposed to much more liberal default object cache flushing.
You can see below that recent changes have nearly halved our cache miss rate in a very short period of time, providing much better performance for sites using object caching. This chart did bring us joy, but it is just a first step.
Miss Rates Due to Object Caching After Today’s Deployment (showing effects)
We Wish There Was a Universal Answer
It’s been said many times in software development that there is “no silver bullet,” but rather just a lot of lead bullets, and it’s absolutely true in this case. We’re convinced this was a significant improvement, and worth speaking about publicly, but we also wanted to assure our customers that we’re committed to providing a delightful experience when hosting with WP Engine, and we’re not stopping with just this one set of updates. We’re continuing to investigate a number of other ideas, and adding or updating our measurements as we go.
Our Commitment to You
This has been a very important learning experience for us. When something of this scope happens it is never a proud moment at any company.
If you’re willing to submit your site as part of the pool of data we use to check against, because it has experienced back end slowness, please fill out the below Google Form. Of course our support system is there for you as well, but more data helps us validate changes.
Additionally our Community Team is always there for more organic contact and please feel free to ping them at any time. They are introduced in this blog post here:
To every customer reading this we care and we hope to continue to grow, learn, and do great things together.