Scale Fail (part 2)

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

May 20, 2011

This article was contributed by Josh Berkus

In Part One of Scale Fail, I discussed some of the major issues which prevent web sites and applications from scaling. As was said there, most scalability issues are really management issues. The first article covered a few of the chronic bad decisions — or "anti-patterns" — which companies suffer from, including compulsive trendiness, lack of metrics, "barn door troubleshooting", and single-process programming. In Part Two, we'll explore some more general failures of technology management which lead to downtime.

No Caching

"Your query report shows that you're doing 7,000 read queries per second. Surely some of these could be cached?"

"We have memcached installed somewhere."

"How is it configured? What data are you caching? How are you invalidating data?"

"I'm ... not sure. We kind of leave it up to Django."

I'm often astonished at how much money web companies are willing to spend on faster hardware, and how little effort on simple things which would make their applications much faster in a relatively painless way. For example, if you're trying to scale a website, the first thing you should be asking yourself is: "where can I add more useful caching?"

While I mention memcached above, I'm not just talking about simple data caching. In any really scalable website you can add multiple levels of caching, each of them useful in their own way:

database connection, parse and plan caching
complex data caching and materialized views
simple data caching
object caching
server web page caching

Detailing the different types of caching and how to employ them would be an article series on its own. However, every form of caching shares the ability to bring data closer to the user and make it more stateless, reducing response times. More importantly, by reducing the amount of resources required by repetitive application requests, you improve the efficiency of your platform and thus make it more scalable.

Seems obvious, doesn't it? Yet I can count our clients who were employing an effective caching strategy before they hired us on one hand.

A common mistake we see clients making with data caching is to leave caching entirely up to the object-relational mapper (ORM). The problem with this is that out-of-the-box, the ORM is going to be very conservative about how it uses the cache, only retrieving cached data for a user request which is absolutely identical, and thus lowering the number of cache hits to nearly zero. For example, I have yet to see an ORM which dealt well with caching the data for a paginated application view on its own.

The worst case I've seen of this was an online auction application where every single thing the user did ... every click, every pagination, every mouse-over ... resulted in a query to the back-end PostgreSQL database. Plus the auction widget polled the database for auction updates 30 times a second per user. This meant that each active application user resulted in over 100 queries per second to the core transactional database.

As much as a lack of caching is a very common bad decision, it's really symptomatic of a more general anti-pattern I like to call:

Scaling the Impossible Things

"In Phase III, we will shard the database into 12 segments, dividing along the lines of the statistically most common user groupings. Any data which doesn't divide neatly will need to be duplicated on each shard, and I've invented a replication scheme to take care of that ..."

"Seems awfully complicated. Have you considered just caching the most common searches instead?"

"That wouldn't work. Our data is too dynamic."

"Are you sure? I did some query analysis, and 90% of your current database hits fall in one of these four patterns ..."

"I told you, it wouldn't work. Who's the CTO here, huh?"

Some things which you need for your application are very hard to scale, consuming large amounts of system resources, administration time, and staff creativity to get them to scale up or scale out. These include transactional databases, queues, shared filesystems, complex web frameworks (e.g. Django or Rails), and object-relational mappers (ORMs).

Other parts of your infrastructure are very easy to scale to many user requests, such as web servers, static content delivery, caches, local storage, and client-side software (e.g. javascript).

Basically, the more stateful, complex, and featureful a piece of infrastructure is, the more resources it's going to use per application user and the more prone it's going to be to locking — and thus the harder it's going to be to scale out. Given this, you would think that companies who are struggling with rapidly growing scalability problems would focus first on scaling out the easy things, and put off scaling the hard things for as long as possible.

You would be wrong.

Instead directors of development seem to be in love with trying to scale the most difficult item in their infrastructure first. Sharding plans, load-balancing master-slave-replication, forwarded transactional queues, 200-node clustered filesystems — these get IT staff excited and get development money flowing. Even when their scalability problems could be easily and cheaply overcome by adding a Varnish cache or fixing some unnecessarily resource-hungry application code.

For example, one of our clients had issues with their Django servers constantly becoming overloaded and falling over. They'd gone up from four to eight application servers, and were still having to restart them on a regular basis, and wanted to discuss doubling the number of application servers again, which also would require scaling up the database server. Instead, we did some traffic analysis and discovered that most of the resource usage on the Django servers was from serving static images. We moved all the static images to a content delivery network, and they were able to reduce their server count.

After a month of telling us why we "didn't understand the application", of course.

SPoF

"How are we load-balancing the connection from the middleware servers to the database servers?"

"Through a Zeus load-balancing cluster."

"From the web servers to the middleware servers?"

"The same Zeus cluster."

"Web servers to network file storage? VPN between data centers? SSH access?"

"Zeus."

"Does everything on this network go through Zeus?"

"Pretty much, yes."

"Uh-huh. Well, what could possibly go wrong?"

SPoF, of course, stands for Single Point of Failure. Specifically, it refers to a single component which will take down your entire infrastructure if it fails, no matter how much redundancy you have in other places. It's dismaying how many companies fail to remove SPoFs despite having lavished hardware and engineering time on making several levels of their stack high availability. Your availability is only as good as your least available component.

The company in the dialog above went down for most of a day only a few weeks after that conversation. A sysadmin had loaded a buggy configuration into Zeus, and instantly the whole network ceased to exist. The database servers, the web servers, the other servers were all still running, but not even the sysadmins could reach them.

Sometimes your SPoF is a person. For example, you might have a server or even a data center which needs to be failed over manually, and only one staff member has the knowledge or login to do so. More sinister SPoFs often lurk in your development or recovery processes. I once witnessed a company try to deploy a hot code fix in response to a DDOS attack, only to have their code repository — their only code repository — go down and refuse to come back up.

A "Cascading SPoF" is a SPoF which looks like it's redundant. Here's a simple math exercise: You have three application servers. Each of these servers is operating at 80% of their capacity. What happens when one of them fails and its traffic gets load balanced onto the other two?

A component doesn't have to be the only one of its kind to be a SPoF; it just has to be the case that its failure will take the application down. If all of the components at any level of your stack are operating at near-capacity, you have a problem, because a minority server failure or even a modest increase in traffic can result in cascading failure.

Cloud Addiction

"... so if you stay on AWS, we'll have to do major horizontal scaling, which will require a $40K consulting project. If you move to conventional hosting, you'll need around $10K of our services for the move, and get better application performance. Plus your cloud fees are costing you around three times what you would pay to rent racked servers."

"We can't discuss a move from until the next fiscal year."

"So, you'll be wanting the $40K contract then?"

Since I put together the Ignite talk early this year, I've increasingly seen a new anti-pattern we call "Cloud addiction". Several of our clients are refusing to move off of cloud hosting even when it is demonstrably killing their businesses. This problem is at its worst on Amazon Web Services (AWS) because Amazon has no way to move off their cloud without leaving Amazon entirely, but I've seen it with other public clouds as well.

The advantage of cloud hosting is that it allows startups to get a new application running and serving real users without ever making an up-front investment in infrastructure. As a way to lower the barriers to innovation, cloud hosting is a tremendous asset.

The problem comes when the application has outgrown the resource limitations of cloud servers and has to move to a different platform. Usually a company discovers these limits in the form of timeouts, outages, and rapidly escalating numbers of server instances which fail to improve application performance. By limitations, I'm referring to the restrictions on memory, processing power, storage throughput and network configuration inherent on a large scale public cloud, as well as the high cost of round-the-clock busy cloud instances. These are "good enough" for getting a project off the ground, but start failing when you need to make serious performance demands on each node.

That's when you've reached scale fail on the cloud. At that point, the company has no experience managing infrastructure, no systems staff, and no migration budget. More critically, management doesn't have any process for making decisions about infrastructure. Advice that a change of hosting is required are met with blank stares or even panic. "Next fiscal year", in a startup, is a euphemism for "never".

Conclusion

Of course, these are not all the scalability anti-patterns out there. Personnel mismanagement, failure to anticipate demand spikes, lack of deployment process, dependencies on unreliable third parties, or other issues can be just as damaging as the eight issues I've outlined above. There are probably as many ways to not scale as there are web companies. I can't cover everything.

Hopefully this article will help you recognize some of these "scale fail" patterns when they occur at your own company or at your clients. Every one of the issues I've outlined here comes down to poor decision-making rather than any technical limits in scalability. In my experience, technical issues rarely hold back the growth of a web business, while management mistakes frequently destroy it. If you recognize the anti-patterns, you may be able to make one less mistake.

[ Note about the author: to support his habit of hacking on the PostgreSQL database, Josh Berkus is CEO of PostgreSQL Experts Inc., a database and applications consulting company which helps clients make their PostgreSQL applications more scalable, reliable, and secure. ]

Index entries for this article
GuestArticles	Berkus, Josh

(Log in to post comments)

Scale Fail (part 2) Hardware often works ...

Posted May 20, 2011 14:58 UTC (Fri) by mrjk (subscriber, #48482) [Link]

I used to buy hardware at the level (or above) that is kind of derided in this article, and I'll defend it. Just throwing in a bunch of hardware works a tremendous amount of the time for several reasons.

First it can be capitalized with no arguments so is moved off the books for expenses. You can do this with software and consulting too, but not without wrangling. This isn't a technical but it is a real organizational issue.

Second it is mostly immune to staff changes. If you actually apply thought and put in a smart caching system and perfect redundancy, when you leave and your wonderful documentation leaves with you or some background knowledge is lost, I as your thickheaded successor will have gaps in my knowledge of the system. They will bite back at some point. That is how single points of failure creep in over time. With just a cursory look at some of the most obvious issues many many applications will live in a big chunk of hardware, if the hardware is 90% unused, so what? It will likely be less overall cost, because the unthinking apps just roll along without all the manpower needed to maintain them.

Third it actually makes the IT department more resilient. They are always putting in some new server or moving applications to a new one. This to me, is really more important to disaster recovery than staged tests. The Systems folk and the DBA's know directly what is going on and where stuff is. This also gives slack for growth and the ability to ride the technology wave forward.

The fact that systems might be more standard (except in areas directly worked on) and not smart saves a bunch of time and when you do get the boss with the bright idea, it isn't going to crush what you have.

The thing is, the cost of computing power has crashed and killed companies (hi Sun ...) over the last 20 years. This goes back decades. I remember looking at our major database server footprint around 2005 and realizing it was 20 times bigger and ten times cheaper than it was in 1994 ... When that is true and consultants are no cheaper ... I think the "dumb" throw hardware idea made and still makes a lot of sense.

Not that bad settings shouldn't be corrected and decent modeling of scale and flow shouldn't be done!

Scale Fail (part 2) Hardware often works ...

Posted May 20, 2011 15:43 UTC (Fri) by jberkus (guest, #55561) [Link]

Mrjk,

First, you make a lot of good points.

The problems with the "just throw hardware at it" solution are two-fold:

1. The cost balance is often extremely disproportionate. That is, it's frequently the case that $20,000 worth of smarter software will save you $200,000 worth of additional hardware, rack space, cooling and sysadmin time.

2. In the fairly common cases where dumb software consumes geometrically increasing quantities of hardware for a linearly increasing workload, the "more hardware" solution is a very temporary measure.

Obviously there are cases where the tradeoff of "let's just buy more hardware" is completely viable, and I've implemented a few. But that only works if you're making an informed tradeoff, where you actually calculate the costs and capabilities of each path.

Scale Fail (part 2) Hardware often works ...

Posted May 22, 2011 3:56 UTC (Sun) by willy (subscriber, #9762) [Link]

One extreme example (in the opposite direction :-)
http://thedailywtf.com/Articles/That-Wouldve-Been-an-Opti...

Scale Fail (part 2) Hardware often works ...

Posted May 26, 2011 1:30 UTC (Thu) by adavid (guest, #42044) [Link]

Twice in my career as a sysadmin I have pushed back on applications that wanted to spend more money on hardware to rush a project that had major memory problems. One project saved one hundred thousand dollars and the other saved nearly a million dollars by tracking and fixing memory leaks.

What it means to capitalize something

Posted May 20, 2011 18:51 UTC (Fri) by jhhaller (guest, #56103) [Link]

Many people don't understand what it means to have a capital asset. Yes, capitalizing something does take much of the expense off the books. But it affects cash flow, as you still have to pay for the equipment when you get it. On top of that, a portion of the capitalization comes back as expense every fiscal period, as the purchase still has to come off revenue eventually. This shows up on budgets as depreciation.

Think of buying something on capital as like buying a car. There are payments every month, and your outstanding loan amount counts against your available credit. You still have to pay off the car, but not all at once. Capitalization is slightly different in that there is no one loaning the money up front, but to an organizational budget, it looks more like the loan, while corporate worries about where the money comes from, and if they can meet the payroll next month.

With capitalization, it's quite easy to get oneself into a bind, where last year you bought a huge amount of equipment, this year you fixed the software bottleneck which makes the extra equipment no longer necessary. But, your organization will be paying the depreciation on that now useless equipment until its fully depreciated or you sell it. The first organization I worked at had a seven year depreciation schedule, and everything was worthless by year four. We eventually started buying everything on expense accounts, as it allowed us to adjust to different staff and work levels. But, we were stuck with that depreciation for quite some time.

What it means to capitalize something

Posted May 30, 2011 0:43 UTC (Mon) by giraffedata (guest, #1954) [Link]

With capitalization, it's quite easy to get oneself into a bind, where last year you bought a huge amount of equipment, this year you fixed the software bottleneck which makes the extra equipment no longer necessary. But, your organization will be paying the depreciation on that now useless equipment until its fully depreciated or you sell it.

And that's the whole reason buying stuff with capital dollars is often better than buying stuff for which you have to use expense dollars. You certainly won't hold onto that equipment you're not using anymore and keep paying for the depreciation. You'll sell it and get back some of what you originally spent. But if you had originally spent money on consulting instead of hardware (and the consulting wasn't capitalizable), you're stuck. The money is gone forever. That's why Management is more willing to authorize hardware purchase than consulting.

Obviously, you can do the accounting incorrectly and make it look like some decision is better when it's not -- for example, depreciating equipment over 7 years when you know it will be worthless in 4.

What it means to capitalize something

Posted May 30, 2011 10:41 UTC (Mon) by nix (subscriber, #2304) [Link]

That's why Management is more willing to authorize hardware purchase than consulting.

Not anywhere I've ever worked (though perhaps this is because ostentatiously pointless expenditure is perhaps *the* way to retain empires in the City).

Scale Fail (part 2) Hardware often works ...

Posted May 29, 2011 14:53 UTC (Sun) by hein.zelle (guest, #33324) [Link]

Some good points in there, although coming from a company which has long suffered from your proposed approach, there's some important cons to consider too:

> Third it actually makes the IT department more resilient. They are
> always putting in some new server or moving applications to a new one.

In our case, exactly the opposite. We had one (old/ancient) machine per web service, where we should have instead been using one virtual machine on a properly maintained piece of hardware. IT didn't know a single thing about these machines, because they'd been running so long that noone dared to touch them. Hardware failure equaled disaster, almost by definition.

Once we switched to a cluster for computations and virtual machines for web services, IT has finally become able to deal with all these machines and services.

> This also gives slack for growth and the ability to ride the
> technology wave forward.

That also worked exactly the wrong way around, as we were ending up with ancient machines. It's still running - why replace it?

Our management used to think buying hardware was more cost-efficient. If you add up the required hardware support contracts, installation and configuration costs though, it turned out rather badly. We now clone machines for expansion regularly, either in a VM or as part of a cluster. Expanding has become a matter of hours instead of weeks.

I suppose you could argue that now we've finally caught up with some state of the art systems (cluster / virtualization), hardware expansion will actually work well again. I've yet to see how that will work out in the future. The first proponents of installing cheap, single machines for a single service are already coming up again.

Scale Fail (part 2)

Posted May 20, 2011 15:27 UTC (Fri) by nix (subscriber, #2304) [Link]

I've actually embraced single points of failure when there is no monetary alternative (e.g. on my home network). But if I'm going to have a SPOF, make it the *only* SPOF. Thus I have a central server with a RAID array with my home directories and most of my computing power on it. If that machine dies, I'm screwed -- but since failures are rare and there is no alternative (there's no way I can afford *another* huge expensive central server just in case the first one fails, and distributed filesystems aren't good enough to let me do the same thing with fewer than three or four not-very-much-smaller systems), I embrace the SPOF and just make damn sure there is a site-replacement warranty. It *will* fail and I *will* have downtime -- but it will cost less than avoiding the SPOF would.

But for other things (e.g. domestic Internet access), where avoiding the SPOF is easy and failures are common, I'm avoiding like hell.

For corporations past the startup phase, with more than a few people relying on their services and no longer horribly cash-strapped, retaining SPOFs once identified is foolishness. The problem is often identifying the bloody things before they strike, and making sure they don't creep back in afterwards. They can be very hard to spot :(

Scale Fail (part 2)

Posted May 20, 2011 20:28 UTC (Fri) by b7j0c (subscriber, #27559) [Link]

indeed. embracing SPoF at some level is fundamental to getting on with things.

everything has downtime. there are no 100% solutions. your bank's site will be down. the stock market suspends trading. power to your house fails. no water comes out of the faucet.

remember amazon's last big outage? what was the date? bet you can't tell me unless you look it up. the public moves on, there isn't much to be gained by engineering a path around these scenarios.

engineering around certain types of failure states is pointless, you create a huge opportunity cost with regard to allocating resources to new features that make your service more attractive. being pathological about availability is unrealistic and can be deadly to a business.

Scale Fail (part 2)

Posted May 25, 2011 16:58 UTC (Wed) by baldridgeec (guest, #55283) [Link]

Wasn't their last big outage something like 4 days before your comment?

Release It

Posted May 20, 2011 15:44 UTC (Fri) by cpeterso (guest, #305) [Link]

For more stories about scalability patterns (and amusing anti-patterns), I recommended Michael Nygard's "Release It!: Design and Deploy Production-Ready Software":

http://pragprog.com/titles/mnee/release-it

Scale Fail (part 2)

Posted May 21, 2011 3:30 UTC (Sat) by kjp (guest, #39639) [Link]

Ironic, but our single point of failure is postgres. I really don't want to have to move to cassandra. But I also don't want to be paged in the middle of the night if an ec2 datacenter goes down.

Scale Fail (part 2)

Posted May 21, 2011 14:38 UTC (Sat) by jberkus (guest, #55561) [Link]

There are things you can do about that. HA solutions exist ...

http://wiki.postgresql.org/wiki/Clustering

http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial

Scale Fail (part 2)

Posted May 23, 2011 21:51 UTC (Mon) by kjp (guest, #39639) [Link]

Nothing there is as attractive as a simple (self healing) quorum system that works over encrypted WAN links. No quorum = down. Else, it works. We have no STONITH, we have no redundant network links. Thus the typical stuff that seems to be used (DRBD) and even the future postgres stuff (postgres r and xc) do not appear like strong candidates.

Scale Fail (part 2)

Posted May 23, 2011 21:56 UTC (Mon) by dlang (guest, #313) [Link]

you would use the quorum system to control the DRBD of postgres stuff.

this is available today with the linux-ha project.

Scale Fail (part 2)

Posted May 24, 2011 13:58 UTC (Tue) by kjp (guest, #39639) [Link]

Thanks. The quorum server looks promising.

Scale Fail (part 2)

Posted May 26, 2011 5:58 UTC (Thu) by ferringb (subscriber, #20752) [Link]

Really been enjoying these articles for the rant factor, and past experience- had a few years of paying work doing similar sort of scalability work, and lots of wtf moments along the lines of what you're describing. Django in particular *always* seemed to pop up ;)

Scale Fail (part 2)

Posted Jun 11, 2011 23:17 UTC (Sat) by apollock (subscriber, #14629) [Link]

So what do you advise for startups then, if the public cloud options run into scaling problems and migration headaches down the track? Are they better off not using them in the first place and taking the development and deployment productivity hit?