From Jason Cohen, founder of WP Engine.
On June 1, 21:12:00 GMT, seven of our customers’ sites simultaneously experienced either major slow-downs or 500 errors.
But we didn’t know that at the time. We did know that those seven sites were in trouble, but not that it was coordinated. When you host 30,000+ sites, having seven issues within a small timeframe is within the normal margin of random fluctuations in ticket-arrival-times.
Still, any time a site is in trouble we consider it top priority.
One of the benefits of our recent rapid growth in headcount is that when an emergency happens, we have more heads and hands to tackle the problem. So we split up these sites between four people and got to work.
We discovered that all seven had been hacked in exactly the same way — PHP code injected into the top of wp-config.php (the main WordPress configuration file) which was making outbound network connections to a server in India.
When we discover malware, there’s a bunch of things we do, including:
- Alert Sucuri*, our awesome partners in scanning and eliminating malware, so they can investigate in parallel. (P.S. If you’re not hosted with us, we highly recommend that you check them out and become a customer yourself!)
- Grab snapshots of all relevant logs (front-end web, back-end PHP, SSHD authentication logs, even kernel messages.
- Change filesystem permissions to disallow anything (except root) to write to disk, thereby “locking down” the blog against further changes, in case the attack is active or the code left behind is continuing to infect files. (Note: The blog functions completely normally during this time, including adding new posts and comments. It’s just the filesystem that’s locked down, so for example you can’t upgrade plugins or modify a theme file.)
- Remove the malware by hand, including looking through custom code in themes and plugins. (Sucuri does this too, so this amounts to a “double-check.” Two heads are better than one!)
- Investigate the “vector of attack,” which is security-speak for “the vulnerable code which was exploited to allow the hacker to gain access to the filesystem in the first place.” This is hard because the location of the malware (i.e. in this case wp-config.php) is very rarely associated or correlated in any way with the vector (i.e. the plugin which allowed the hacker to get in and be able to modify wp-config.php in the first place). For example, the most frequent place for a hacker to insert code is inside WordPress core (e.g. something like wp-config.php or a file inside wp-includes), whereas we have never in the past two years found an attack vector which was itself in core — it has always been in a theme or plugin which then altered core files.
- Eventually, when we’re confident we’ve found the vector, plugged the attack, and removed any malware injected, we’ll un-lock-down the blog for normal operation.
Step #5 is always the difficult one, because it’s open-ended. How do you “prove” there’s no more malware? How do you “know” the attack vector?
Often it’s clear, like malware with an obvious signature which we can add to our existing scanner and scan all files. Or an attack vector where we look in the web logs around the time the site behavior changed or when the malware-infected files were last modified, where an unusual POST request leads to a plugin which writes code to disk without sanitizing parameters.
But in this case — the seven sites — the vector was elusive. The web logs showed nothing special during the time-frame, the SSHD/SFTP logs show nothing at all during that time, etc..
This is also the point where we noticed something that couldn’t have been a coincidence — that all seven sites experienced problems at exactly the point in time. Now that’s unusual. It implies the hacks are related, even coordinated.
Except: these sites had nothing in common — located not just in different clusters, different hardware, but even in different data centers in different cities! They had different themes, different plugins (except for extremely common ones also shared by 1000’s of other customers), and so on.
Also, how was the attack that coordinated? Even if someone were targeting us — which we are certainly not paranoid enough to believe without overwhelming evidence — you’d still expect slightly different times as the attack hit different IP addresses for all sorts of different sites. And why these seven blogs?
It turned out it wasn’t coordinated at all. The root cause of the site errors was that the remote server which the malware was hitting, itself went down (possibly, ironically, because it couldn’t handle all the traffic we were unwittingly sending them). The remote server going down caused all seven blogs to behave differently at exactly the same time.
But that also meant these sites were hacked not at 21:12:00 GMT last Friday, but at arbitrary points in the past. Which means finding the vector will be difficult. And if those seven were affected, surely others might be affected.
So for the past seven days we’ve been in Code Red Turn Ourselves Inside Out mode to do these things:
- Immediately lock down sites across the board. If we don’t understand the vector, we can’t take the risk.
- Enhance our own malware scanner with many more related cases and deep-scan every blog.
- Engage Sucuri for all this as well.
- Also engage SecTheory — a highly-respected security firm in Austin.
- Develop a long-term solution to the general problem of attackers able to write files to disk.
Items #1-4 we did quickly and continuously. And we did find (and automatically fix) some additional issues in a few dozen sites. (Which is still fewer than 1% of all our customers, but at the same time is a large absolute number of sites.)
Item #5 is the trick.
The obvious thing to do is just lock down the filesystem completely, but then you can’t do things like upgrade plugins, install new plugins, edit theme files, and so on, and lots of popular plugins and themes don’t work either.
We had one idea which we pursued for a few days before deciding it wasn’t going to work well enough — it locked things down, but locked down so much normal activity that it wasn’t tenable. I mention this only to explain why it took a whole week to implement the final solution, which was actually attempt #2.
But now we’re there! If any code whatsoever (PHP or otherwise) in any web server attempts to write PHP to disk (or other files like .htaccess), it is rejected at the kernel level. However we also have a set of exceptions which allow the usual administrative functions to still be able to write to disk, as well as common plugins and themes. For example, CATPCHA plugins typically need to write a temporary PHP file to disk to generate images and validate input, and the Pagelines theme needs to be able to write packages to disk.
Of course there will be more exceptions. We’ll err on the side of locking down too much, then slowly add exceptions if we feel they’re warranted. And we’ll be stingy about adding them.
If you have any questions or concerns about all this, please send an email to support at wp engine dot com. And please be patient if, after this post goes up, we get 200 emails in 10 minutes and need to take an hour or two before answering, giving priority to normal support tickets. 🙂
P.S. We have a new feature in beta test now with 50 customers which completely solves this issue (among others) if you’re willing to develop WordPress in a certain way, and another feature coming after that, so this isn’t the last you’ve heard from us on this subject!
* Please note that while we work closely with Sucuri, our relationship with Sucuri does not include all of the services that Sucuri provides when you’re a client of theirs. For maximum security, WPEngine recommends you use the highest security measures that you can afford, which includes website security services provided by Sucuri, DDoS and proxy services like CloudFlare and CloudProxy (Sucuri product), and WordPress platform hosting services like ourselves.
For the most updated information on WP Engine’s security policies, please see WP Engine’s Security Environment.