Scaling Up vs. Scaling Out: Hidden Costs

codinghorror · June 23, 2009, 12:00am

In My Scaling Hero, I described the amazing scaling story of plentyoffish.com. It's impressive by any measure, but also particularly relevant to us because we're on the Microsoft stack, too. I was intrigued when Markus posted this recent update:

This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2009/06/scaling-up-vs-scaling-out-hidden-costs.html

larry1 · June 24, 2009, 12:00am

Seems like a ‘scale up’ crowd, but a couple points:

As pointed out, you probably could get away with a lot less than 83 servers to replace this one beefy server. This will cut down on a lot of the costs you are touching on, including the admin headache. Completely agree you have to go open source for this to work though.
Scalabity: Yes, it is harder to architect your system to be able to scale this way, but once it is set up it becomes trivial to scale. Want to add 25% more capacity? Just add a couple of more machines. At some point your ‘beefy’ setup is at capacity, and you just can’t scale any more, at any price.
Machine failures/uptime: You are really going to have a single machine that cost you 100k? What if you really really care about failures/uptime? Hot backup? Now we are talking 200k. With the scaled out setup you don’t have to worry about this as much. If one machine out of your 83 fails, it’s less important. The system will still be up and running. And, it’s not only failures but hardware/OS/software updates/upgrades…

Yeah · June 24, 2009, 12:00am

fifty-seventh!

Mark · June 24, 2009, 12:00am

EnterpriseDB also scales out very nicely, at a 1/6 of the cost of oracle.

Mark · June 24, 2009, 12:00am

By the way, Thank you Jeff for fixing the site where it can be accessed without running scripts. Thank you very much.

mhuyck · June 24, 2009, 12:00am

Douglas McClean: check your dictionary. Spend is both a noun and a verb.

AndreasK · June 24, 2009, 12:00am

Maintaining the 83 servers with open source isn’t quite as difficult as it seems. You ‘just’ make an install CD, and whenever a box has a serious problem you swap parts until the install comes through again. Getting windows to reinstall that painlessly may be a bit more tricky and require pretty identical hardware.

Or swap the whole box.

Vincent · June 24, 2009, 12:00am

Ok I like the spirit of this article, but the dollar comparison doesn’t take into account which one can handle more traffic.

I’d like to see a cost per user/hit/connection/transaction, because its very possible that the dollar per unit of measure isn’t linear for either of these two options.

Does MS SQL even scale up to that beefy of a box? Can you have that many instances in a cluster?

Josh · June 24, 2009, 12:00am

“a fixed spend of $100,000”

Spend is not a noun. Did you mean “budget”?

Douglas McClean on June 24, 2009 9:45 AM

Indeed.

Douglas McClean: check your dictionary. Spend is both a noun and a verb.

mhuyck on June 24, 2009 1:51 PM

dictionary.com doesn’t show a noun definition for “spend”. This looks like a simple error.

JM18 · June 24, 2009, 12:00am

Okay, but what if Monty Hall opens a door to reveal a netbook? Do you swap your first door, in hopes of getting the DL785?

Martin · June 24, 2009, 12:00am

“open source is only free if your time is free”

I’m curious about what the original commenter meant with this. If I use SQL Server instead of, say, PostgreSQL, will a Microsoft guy come home and install it for me? I agree with the phrase in some other contexts, but I’m not sure about what it means in this case. Which advantages would I get for using Microsoft instead of open source? Is the Microsoft tech support realy that good?

Note: I know the question sounds like trolling, but I really mean it. It sounds like the kind of scenario where open source is supposed to suck, according to non-open source guys.

James · June 24, 2009, 12:00am

Larry’s point about resilience to failure is important, but the other things you need to worry about that nobody seems to have mentioned are:

network infrastructure - the moment your 83 servers aren’t sharing a bus you have to tie them together.
systems management - if 6 of your 83 servers are broken somehow, will you notice? You’ll also neet to monitor the network too. And your software as well as any vendors’ software.
you’d better hope that you can buy bigger hardware before you reach the performance limit of your brute box; it has a hard ceiling somewhere. On the other hand hitting the ceiling with a many-boxes approach is a squishy affair; you start to notice that you’re getting diminishing returns for adding new boxes quite a while before you get no benefit at all. (actually, larry mentioned this point)
If you’re going to do the design work to scale to 83 boxes, you may as well fold in the relevant measures to span data centers, at least for failover. That gives you a big reliability boost, at least for many applications.
no question, there is more engineering work in the 83-boxes approach. How many software engineers are you going to have to pay?

Steve · June 24, 2009, 12:00am

I’m convinced that if Netflix ran the world we would all be happier. What a great combination of vision, UI, and infrastructure.

Does anyone know what their setup is?

Balls · June 24, 2009, 12:00am

James Wrote:

no question, there is more engineering work in the 83-boxes >>approach. How many software engineers are you going to have to pay?

None, hombre. That’s a hardware problem

Mark · June 24, 2009, 12:00am

@Martin, while in most cases I do not care that much for open source software, here I agree with you. PostgreSQL is much more user friendly than SQL server, as long as you don’t want to do performance tuning. Most DBA’s that I have meet could not performance tune PostgreSQL. Then there is EnterpriseDB, built on top of PostgreSQL, which does not generally need to be performance tuned.

artsrc · June 24, 2009, 12:00am

Nobody scales out their databases that way. That’s crazy talk.

http://en.wikipedia.org/wiki/BigTable

Scaling out on low cost hardware provides more computing power at lower cost. People who can conquer the issues of managing and virtualizing the large numbers of machines reap a reward. Then they can, and sometimes do, use this capability to make a big profit in businesses which otherwise could not exist.

As pointed out by the comment above, some systems can’t leverage low cost deployment. This is justified by their ability to support legacy programming models.

What are some possible outcomes:

Low cost deployment takes over and we change our programming model is changed.
Low cost deployment takes over and it is enhanced to support the current SQL programming model
High cost deployment remains and we keep our current programming model.

I don’t like the current programming model. So I hope for 1.

artsrc · June 24, 2009, 12:00am

“Scaling up will typically involve less man-hours deploying the app. Scaling out will require more time unless you fully automate the deployment process (Azure for the enterprise anybody?)”

http://code.google.com/appengine/docs/python/gettingstarted/uploading.html

I don’t where outsourcing hardware operation stops making sense. If it does at some point there is/was an opportunity for a vendor like Sun, to support the ability to set up an in house cluster that handled that sort of thing. Assuming of course Google does not productize their offering.

Marshall · June 24, 2009, 12:00am

Great comments here, very insightful for the most part.

I totally agree, the upfront cost is only one part and definitely not the most significant factor.

The personal and power costs are going to exceed any software and hardware costs pretty rapidly.

MarkB · June 24, 2009, 12:00am

Your licensing costs are way off. I’ll ballpark our Select prices (they’re likely cheaper than most, but ratios would be close).

Sql 2008 Ent with SA: ~$7500/cpu
Server 2008 Ent with SA: ~$450
Server 2008 Std with SA: ~$150
Server 2008 Web with SA: ~$45

I assume you’re hosting a public website - since you’re not authenticating against AD you don’t require a Core CAL, but Sql Server needs CPU licenses.

So, for the HP:

$7500 * 8 CPUs + $450 = $60,450

For the 83, we’ll figure 41 are DB, 41 are web, and the extra is standby (no additional Sql license needed). We can use Svr 2008 Std for the DB servers, and Svr 208 Web for the web:

($7500 + $145) * 41 = $313,455 (DB)
$45 * 41 = $1845 (Web)
$145 = Standby

Of course, if you’re going to run 41 DB servers, maybe you can get by with Sql Std (~$1800). That brings the DB tier down to a more reasonable $79745. That makes the software within ballpark of each other.

Of course, it’d be quite an odd configuration to have 41 DB servers and 41 Web servers. A more realistic scenario might be 4 quad DB servers ($7780) and 40 web servers ($1800) - which makes the licensing a non-issue. Assume the quad DB servers cost 4 x your Lenovo servers, and I’ve got just enough left over for the SAN I’m going to need to get all those DB servers working together.

I am not a Microsoft licensing expert, but I was a reseller many years ago. However, this is only a blog comment - do not take reliance on any of this. If you have questions about software licensing, please contact your Microsoft reseller and/or Microsoft Licensing. Or, use Linux.

Toby · June 24, 2009, 12:00am

Most comments have already covered my concerns. Here’s some separate ones:

Are the power costs of the big iron really only 1200w? My home desktop consumes over half that. Also, with a big iron you’ll need a bigger more powerful UPS to keep it running in the case of power failure. Simpler battery backup works for the scale-out solution.
People have been quick to mention sysadmin and developer costs on the scale out solution. It’s not like these are non-existent for a big iron as well. In fact sysadmin factors may be just as high on a big iron as you’ll probably pay a premium to have someone sufficiently qualified to handle such an exotic beast. Meanwhile a sysdmin that can deal with a cluster of standard metal is a dime a dozen.
How well can you share a big iron between developers? If you have a team of 12 developers, which would you rather have, one big server or a cluster of 83 small ones? Keep in mind that a bug by any single developer could take down an entire server.