Scaling Up vs. Scaling Out: Hidden Costs

You’ve equated “Open Source” with “Free as in beer”, which isn’t necessarily true… buying RedHat Enterprise Linux is plenty expensive on its own.

Just for the fun of arguing :
are 4 virtual servers on 1 big iron (let’s say 8proc/8Gb mem/8Tb hd) more interesting than 4 pizza box with the same specifications (2proc/2Gb/2tb) ?

It seems like you ought to be able to find some middle ground: somewhere between the big monster and 83 smaller systems. Perhaps 6 medium-level systems spread over 3 different sites, to improve both performance and redundancy. What would a setup like that cost? What happens when you throw cloud computing into the mix?

Even scaling out on open source software isn’t frictionless. The cost of maintaining one (or two servers in an active/passive setup) is far cheaper than a hundred smaller servers… It’s the difference between one big problem a year and five small changes a week. Also, the cost of developing software that gets the full benefit from distribution versus horsepower is typically more. State management and distributed transactions can consume at least one developer’s full time efforts depending on the nature of the business.

Only people who have never tried scaling out a non-trivial DB app participate in these (stupid) discussions. Fact is, unless your DB can be easily partitioned, by design, you will have one heck of a time scaling it out and you will need a TON of hardware. Not everything fits into BigTable. Most applications have certain latency, throughput and concurrency requirements which often make partitioning / sharding difficult or unfeasible. If you want decent response times on a large partitioned DB and you don’t want to scale up, you will have to consider things like Netezza, Teradata, Vertica, Greenplum, Aster, etc. For large amounts of data, those can cost $1M a year or more. The cost of scale up becomes easier to swallow in this case, assuming your problem can indeed be solved by scale up.

My point is, just because Google solved some of their (very specialized) problems by scaling out 10 years ago doesn’t mean you should do so now. It usually makes sense to scale up within individual nodes (i.e. have 8 cores and 32/64GB of RAM instead of the typical 4 and 16GB of RAM) and have more cores and fewer nodes overall. It also usually makes sense to have a good RAID controller and a wide RAID array (600-800MB/sec read / 400-500MB/sec write). It does not make sense to use Hadoop for anything unless you can cough up more than 15-20 nodes, significantly, orders of magnitude more. See, Hadoop only begins to beat one decent 8 core box when you have 10 nodes in the cluster.

I mean, Jesus tittyfucking Christ, if you haven’t done anything large scale, just STFU and don’t regurgitate the same old bullshit you’ve heard five years ago.

That is all. Thank you.

I haven’t seen anyone mention using an in-memory data grid with a scale-out/scale-up approach. An IMDG scales out to handle more load while lowering the load on the database box. Buying a smaller number of 1U servers with maxed out memory and a database box a fraction of the cost in the post is more predictably scalable.

Depending on your grid capacity a database failure may not be as catastrophic as it once was. If the entire contents of your database fits in the grid then the application using it will continue to perform finds and updates just the same as if the database was up. The grid batches all updates and will update the database when it is available again.

IMDGs also offer a ton of features for graceful failover, data availability and redundancy and self-healing topologies. All this works like magic.

Two great resources for IMDGs are Billy Newport’s blog (http://devwebsphere.com) and Nati Shalom’s blog (http://natishalom.typepad.com).

As many have said, yes, this is slightly ridiculous. There are very few database loads which will scale from one machine to 83 without quite a lot of messing, and frankly most of them are key-value oriented and you wouldn’t want to be using SQL Server anyway.

It’s a very different story with application servers, of course; it’s usually the more the merrier there and you can greatly reduce admin overhead by using a standard image and code pushing method.

And this is why Microsoft HPC will never be a serious offering. Who would want to pay that kind of licensing for 32, let alone 1000 servers?

This is why Windows HPC edition is quite a bit cheaper to license. In fact, if you buy software with support (as large clusters often do), even plain Windows would be cheaper.

Good points. I think in the end you could justify either approach. There are some other factors as well:
Scaling up will typically involve less man-hours deploying the app. Scaling out will require more time unless you fully automate the deployment process (Azure for the enterprise anybody?)

Scaling up will involve less network complexity and infrastructure. Scaling out will involve more.

And this is why Microsoft HPC will never be a serious offering. Who would want to pay that kind of licensing for 32, let alone 1000 servers? I maintain a small (14 nodes, 64 cores) cluster for a couple of labs, and there would be no way they could afford that kind of iron if they had to pay to put Windows on there.

Also, your story of scaling up rather than out means that unless you also increase the number and bandwidth (like myrinet or infiniband) of your connections, they will be maxed out over a very small demand on CPU, given most web applications.

“But what if you scaled out, instead – Hadoop or MapReduce style, across lots and lots of inexpensive servers?”

None of the calculations take into account the cost of rewriting a large portion of an existing software base so that it can leverage a distributed computing environment instead of a classic 3-Tier approach. Not to mention the reduced time to market that would be accrued in that time period. I doubt Markus was willing to wait a year while his developers rewrote everything. Even if his group did decide to leverage distributed computing he would need some machine to run everything until the overhaul was completed.

This article has some good points, but it completely ignores the most important part of scalability: figuring out what problem you’re trying to solve. When you scale up, the marginal utility of each core decreases rapidly.

When you’re solving a lot of “Google-Scale” problems over extremely large data sets, scale-up architectures are absurd: you’re quite I/O bound. My company needs to crunch through dozens of terabytes of data on the social media space, comprised of billions of multi-KB records. Hadoop and HBase are the only way to tackle this problem.

If you absolutely need to have a 100% RDBMS like I believe the OP is talking about, then there’s companies like Greenplum and Vertica, who recommend $25k boxes or so, and can process petabytes of info far more efficiently than a homegrown SQL server farm.

As for, “As everybody knows, open source is only free if your time is free.”, we’ve spent $0 on maintaining our Linux farm besides the salary of 1 person. The marginal cost for supporting another box is 0, because of the amazing variety of cluster management tools in the Open Source world. UNIX was meant from the beginning to be a highly available, network-enabled environment.

I just wrote an article on http://www.roadtofailure.com on why Scale-Up is dead to a large portion of the problem space :slight_smile:

Scaling out isn’t frictionless with open source software. There’s not exactly a zero-configuration scaling package that I’ve seen, you’d have to spend big money on your developers/admins getting it all to work together.

Two words: Cray CX1

7U
16 processors
384 GB memory
$77,000+

Geek factor: Priceless

Keep also in mind that not every application is originally designed to scale out. Scale out means you have to design your code and having an architecture to avoid server affinity.

I have seen projects come to a point when there a serious performance problem and they want to scale out. Then they conclude that their architecture is not well designed to work across multiple tiers, and then you have a real problem!

What happens to your comparison if you scale out to just 8 systems, giving you the same number of cores as the tricked-out DL785?

That would give a similar performance (distributed scaling issues notwithstanding) but at far lower capital cost and much more reasonable licensing and running costs.

I think Joshua Ochs hit’s it best when he talks about the right hardware solution for the right application. You can argue MS SQL vs MySQL clustering and on what OS and what type of architecture, but bottom line is it depends on the application. Everyone wants a fabric network of servers like Google’s or amazon’s where host added and host failures are seamless to the processing demands. But until Linux or MS build us that better mouse trap to have and hold as our own, we are stuck with weighting our needs on an individual need. …and don’t point your fingers at those high end servers until you’ve read the specs on the redundant power supplies, nic, backbones and raid arrays. Sometimes you do get what you pay for when the application will not scale out. On the other hand much is to be said for failover one can archive with pizzaboxes as you call them. IMHO, The happy medium is scale as you grow, pay as you go and never pay for more than you need. The old saying the right tool for the right job, applies to hardware also. Good post Joshua Ochs

And when that one monolithic machine dies due to any number of reasons, which one is better then?

HEAR. HEAR.

captcha: lesson tinfoil :_)

For 100K$ you can get 4 SuperMicro 2U Twin², which would then have:
128 Xeon Cores
768 GB Ram
16 TB Diskspace
8U Space requirements
With a maximum power requirement of 4800W
That doesn’t look to bad for me.

And this is why 2010 will be the year of linux.