Ready for the gory details? This is your last chance to hit “back” and get back to happy-soft-marketing-speak.
We serve billions of hits and hundreds of terabytes per month, so it takes a serious infrastructure to make sure we can serve lots of traffic, really quickly, with a minimum of down-time.
Here’s how we do it.
CLUSTERED FOR HIGH-AVAILABILITY AND SCALE
When you’re running on a single server — even one with redundant hard drives and power supplies — there’s always down-time. Hardware upgrades, network upgrades, software upgrades, software faults, and other people on your server causing trouble.
You run on a cluster of machines. That means a set of machines, all identical, all serving your site traffic all the time. If any one (or two) goes down — intentionally or unintentionally — your traffic is still served from the rest.
The front of the cluster is a collection of appliances and servers for security, caching, and distributing the load to the web servers. This bit is described in detail below. Next is a set of web servers which run PHP. Because of the efficiency of the caching systems, our PHP servers are CPU-bound, meaning they run out of CPU before running out of RAM or I/O. Therefore, those servers have 16 cores of top-of-the-line XEON processors. Finally there are MySQL servers separate from the web servers, which are also described in detail below.
…BUT NOT TOO CROWDED
The problem with clusters is that is really easy to become overcrowded. Some of our competitors also use clusters but put 100s of customers on a cluster. This doesn’t work because of a rule we have: If it doesn’t fit into RAM, it’s too slow. Once there’s too many files to fit in the kernel filesystem cache, too much table data to fit into MySQL’s memory, too many objects to fit into memcached, then suddenly all sites become much slower because only the few, largest sites will stay in memory.
Using a cluster doesn’t help matters, because all the web servers and database servers have the same data, which means they’ll all exhibit this behavior at once. A cluster helps with high-availability and with scaling with traffic surges, but put on too many sites and all falls apart.
Therefore, we’re very careful to balance clusters with the right number of customers. The exact number depends of course on the nature of the traffic and the size of the sites, but we actively monitor the relevant system details to produce an internal “capacity” score per cluster. We never let it get too high, and we’ll migrate sites between clusters if needed to ensure everyone has enough resources. (And no, there’s no site down-time when we migrate!)
If all this sounds good, sign up today and see for yourself.
…AND NOT IF YOU DON’T NEED IT
The other problem with clusters is they’re really expensive. Taking the logic from above, if the smallest possible cluster has two web servers, one primary database server, all replete with lots of CPUs, RAM, and SSD disks, plus associated load balancers and firewalls, plus a redundant shared filesystem, you’re up to $2500/mo in raw server costs. If we put 10 customers on a cluster, that’s $250/mo per customer — even before IT staff, tech support, CDN bandwidth, backups, and so forth. So if we charged you $250/mo, we’d lose money!
Since you can see from our pricing page that we have account options starting as low as $29/mo, clearly we have other tricks up our sleeve. For our smaller — but no less valued! — customers, we modify our architecture so that we run exactly the same server structures, but we do it on a single, large machine.
To mitigate the problem of being highly-available, we actually run a virtual server, which means problems in the underlying hardware are masked — bad disks, replacing processors, and so on. So you still have almost the same level of availability without the large additional costs.
If we overcrowded these servers, of course you’d also be in trouble — back to the problems of the shared hosting providers or even some of our direct competitors who stuff 100 customers on one box (even a large one). So we do the same capacity-logic that we do for our multi-machine clusters, ensuring that servers always have enough capacity, even if you see a spike in traffic.
Even on these “pod” clusters (as we call them) we process 1,500 hits/second! To put that in perspective, customers hitting the front page of Yahoo typically see 100 hits/second.
We have another rule: if it doesn’t fit into RAM, it’s too slow. Maintaining this rule on pods is another way we ensure high capacity and fast response times at more affordable prices.
MULTI-PART FRONT-END: SECURITY, RULES, CACHING, AND STATICS
All your web traffic comes through a sophisticated front-end system involving multiple components.
First is a DoS appliance — hardware that detects Denial of Service attacks where hackers try to make a website unavailable by flooding it with packets from servers around the world. This appliance blocks all such attacks from reaching the back-end system (keeping it healthy), and is rated for 1.5 million packets per second of attack while still allowing normal traffic through.
Next is a firewall. We’ve got things locked down. We don’t even have port 22 open for SSH! (We log in using a VPN.) ‘Nuff said.
Next is an IDS (Intrusion Detection System) — a hardware appliance that scans packets for malicious activity such as known application exploits, email-harvesting, cross-site scripting attacks, SQL-injection attacks, etc.. This comes from a database which is regularly updated by security agencies. Right now we’re blocking about 7,000 attacks per day!
Next is our rules engine. We look at the request and decide which customer it’s for (which can be fairly complicated with wildcard domains and domains configured for WordPress Multisite). This decides not just the home server, but a large set of behavioral rules, many of which are common to all WordPress installations but some of which are specifically tuned for your site. These rules are part of our “secret sauce” so we can’t get into too much detail, but to give you a simple example, we handle various 301 redirects that WordPress will do anyway, but we can do them at this layer incredibly quickly and without touching the web servers or databases.
Next is our EverCache system. Again this is part of our “secret sauce,” so we can’t divulge everything. But, what we can say is that we run the most sophisticated caching system in the world, which we’re able to do because we support only WordPress. This makes common pages (like home pages and feeds) load in 30ms (not including the time to make it all the way to the end user’s browser), protects the back-end system from most traffic and specific traffic from popular plug-ins and 3rd-party services, and yet is tied back into your WordPress system (via our system plugin) so that when the cache needs to be refreshed (e.g. post is updated) we refresh immediately so there’s never stale data. There’s also a lot of special logic protecting your database from various bots and spiders, which for larger sites can represent 50% of the site’s traffic.
Finally, for requests that make through this gauntlet and really do need to be served up by your home server, a load-balancer does the job of routing your traffic to the rest of the cluster.
Phew! Sign up today, and get all this for your site too.
Of course we run beefy servers for MySQL, and again the rule is “everything must run out of RAM” for it to be fast.
Plus we use SSD drives for MySQL data, which results in lightning-fast queries even with large data sets.
On top of that, we shard reads and writes out of WordPress so that the reads — which comprise the overwhelming majority of database activity — don’t hit the master server but are distributed over read-only slaves. This speeds up queries, increases cache-coherency on those queries, and allows the master to scale further than it would otherwise.
The read-slaves can become a master if the master becomes unavailable, so this doubles as a live backup.
That’s just the stuff outside of WordPress itself, but a lot of the optimizations happen inside the page-request, thanks to plug-ins, some of which we wrote and some 3rd-party ones we use and even contribute code back to (open source, of course!).
There’s a lot of parts here, some of which is again secret sauce, but you can get a flavor for the types of things we do by checking out W3 Total Cache. We actually don’t use this plugin because we have better ways of handling each of the things that it does, and in a way that doesn’t require you to learn how to configure it yourself, but those are the same categories of things that we do.
Although there are big things we do here with caching database calls, pages, changing HTTP headers, etc., sometimes it’s the little stuff that makes a big difference. For example, WordPress comes with support for most versions of jQuery; plugins and themes can declare their dependence on jQuery, and WordPress will automatically serve up that library. However this means you’re serving this standard library yourself (through us), and Google has graciously decided to host jQuery for free for the whole world if you use their URL instead. This is better for your site, not just because Google’s servers are fast, but because so many websites use Google’s URLs it’s also likely that the browser already has that file cached, which means it’s even faster to load your site! So, one of the zillions of optimizations we make is to automatically detect when WordPress is supplying jQuery, and we re-write the URL to go to Google instead.
Little things add up! That’s our job. Are you ready to sign up yet?
CDN (Content Delivery Network)
The effect is two-fold: Content loads faster because it’s closer, and your site is more scalable because content is coming from many different servers.
WP Engine is the only hosting company on Earth (that we know of!) who bundles a CDN as part of our base service. So this awesome technology that’s normally reserved for large sites who want to spend a bunch of money, is now accessible to everyone.
On top of that, we do all the configuration for you, so you don’t lift a finger. You just run your site normally — uploading content to our servers, nothing weird — and we automatically change the HTML links to use the CDN and automatically ensure your content is loaded into the CDN network.
We run separate memcached servers to support caching page content, database queries, and the WordPress “Transient API.”
We run daily backups which also get transferred off-site to S3.
We run third-party website uptime/speed monitoring services so we know before you do if your site is down or performing badly.
We run through your files once a week, using Yahoo’s SmushIt to squeeze every last byte out of it with no loss in image quality. Makes your site that much faster to load and hey, it takes less space on our hard drives too.
Our hardware is awesome too — 10,000 RPM drives in RAID-10 configuration, XEON processors, running in VMWare so we can switch off of bad hardware, and top-quality network infrastructure.
I MUST HAVE MISSED SOMETHING…
We do a lot, so maybe you still have questions and maybe I’ve missed something in this long description. Give us a ring, or talk to us on Twitter @wpengine so we can fill in the blanks.
If you have questions, just ask! If you’ve read all this, and you still want to learn more, it’s time to give us a call: at 1-877-WPENGIN (1-877-973-6446). Ask your hardest questions. We studied for the test.