Scaling Up vs. Scaling Out: Hidden Costs

Les_Stroud · June 27, 2009, 12:00am

In the absence of trackbacks, you can see my complete response at:
http://www.mindmeld.ws/blog/2009/06/27/costs-of-scaling/

Scaling is about software. These days, virtualization is all the rage for scaling. It comes with it’s own share of issues, but it combines the scaling up and out alternatives in such a way that it becomes a question of licensing. By allowing you to add low cost hardware and combine them into larger virtual hardware, it can give you the best of both worlds. It drives the question from how big does the hardware need to be to questions of how quickly one should add additional hardware to the cluster. Or, what is the cost of scaling my software licenses? Does the software I am using have per cpu/core licensing, concurrent licensing, etc? The software, in most of these cases, become the most expensive factor. Energy use and network bandwidth cost typically run a close second and need to be considered. However, for technical decisions in the mid market, software drives hardware decisions…

Chris · June 29, 2009, 12:00am

The cost of scaling out MS software there is horrible. Makes me so glad that all my stuff runs on open source software, means that scaling out instantly becomes more attractive.

Richard · June 29, 2009, 12:00am

I have always wanted a goat!

argatxa · June 30, 2009, 12:00am

it’s all good. but nobody has answered the most important question (from Brandon)

“Yeah, but does it run Crysis?”

EdB · June 30, 2009, 12:00am

The more I see MS SQL costs the more I wonder how many shops could get away with using MySQL as a no-frills RDBMS. Alternatively how many projects even need an RDBMS?

Document databases are going to get a lot of traction soon I think…

MarkR · July 2, 2009, 12:00am

Scaling up versus scaling out is a dead horse so by all means- let’s beat it some more. Certain things scale out, certain things scale up, some things can scale either way.

Management and monitoring of 80 servers is a lot of work. Even with automated, constantly current kickstarts, system management tools like opsware or blade logic or cfengine (or puppet or chef if you are a masochist) it’s still a lot of work.
The OP also forgot the added cost of the network. Port costs are higher, wiring costs are higher, you need to add more cooling, maintenance and monitoring costs for those 80 additional ports.
If your app wasn’t written to scale horizontally- it’s going to cost a fortune to scale- mostly in in software rewrites.
Splitting your data is a terrible idea for most RDBMS’s. If you can’t enforce referential integrity on your database then you are playing with fire. Certain applications (like google) can clearly be scaled horizontally. I doubt plentyoffish has the sort of data that can be scaled that way. Can you do it? Sure- put different users on different db’s and things like that- but you lose a lot of the safety that comes with running an RDBMS in the first place.
At the last datacenter I was in we had a 1.6kw/rack limit. Doubling that to 3.2kw/rack (which is about the current average- yes you can find larger limits) you’s talking 5 racks for 83 servers. Versus 1 for the HP.
Would you need 83 servers? Doubtful- but the problem with databases is usually RAM not CPU power. You would still need a lot of systems to equal the amount of RAM in that HP server.
Could you build a cheaper system using ? Sure. Why? Plenty of fish makes plenty of money and the cost of this hardware is a drop in the bucket. Their bandwidth and datacenter costs alone probably dwarf the cost of this server- add in the development costs and it’s a no brainer. HP servers are generally works of art (iLo2 is wonderful), the quality is top notch and the service is awesome. My company is a Linux shop but we still use HP servers.

If you haven’t had to actually tackle the same sort of problem facing plentyoffish then keep quiet. Plenty of fish is a big success and probably not because they made a lot of bad decisions.

MarkR · July 2, 2009, 12:00am

@bron:

I’m curious what you use to build your systems- it sounds intriguing. How do you keep your systems up to date once they are deployed? What tools do you use to monitor these systems? It’s sounds like you have a largish setup- I’d be curious to know what tricks and tips you use.

ack · July 2, 2009, 12:00am

God, not sure why I bother but here goes:

As an (ex) sysadmin for a medium sized business (about 100 employees) with a big-ish HPC cluster I can tell you that scaling out does not cost you a full-time sysadmin. I was the sole admin for this company and spent about three full days each year on the cluster. The rest of the time went to Windows Server, XP, etc etc.

about 120 PC’s + servers: ~300 days
248 nodes in a cluster: ~3 days

My two cents

RDog · July 7, 2009, 12:00am

Why doesnt POF use Microsoft Azure, I would think their cloud platform could be used with POF ASP.NET design pretty easily? I am not sure the cost though? I believe there is an internal platform also now that can be run locally in their environment. Anyone familiar with this technology?

Great discussion, I would think that Microsoft has scaling out in the works. I just have not seen any distributed storage technology from them yet?

Bradford · July 10, 2009, 12:00am

As if scaling-up isn’t already becoming irrelevant, I’ve written another article about scalability and the true revolution coming in Data Warehousing – for the first time in history, Big Data is cheap to process… and vendors are scared. Check it out: http://www.roadtofailure.com

PT1 · July 31, 2009, 12:00am

The server isn’t the most expensive part of the overall TCO equation, but you can build a similar 8-way with Tyan parts for much, much less than the DL785 and have a real killer machine. (Look into the VX50.) I think I spec’ed one with 8 procs (six-core Opetrons!), 128GB RAM, 8 Intel X-25 SSDs off an Areca RAID controller, and Windows 2008 Enterprise. The whole thing came to around $20-25K.

Venkat_G · September 21, 2009, 12:00am

Is this really brings down the concept called distributed computing in a mis matched manner.

Without a DB no project is exist now and just going back to 90s. In 90’s i used to see there is some goodies like “Day-end” process and archive ( not backup of full or none!) and keep the transactional data is very less so that all these non-sense data is accessed always. Un refrenced cache and all-in-one solutions splitting the scalability(yeah… programmer knows only leanier version of it!!)

Afterall these people are generating work for each other (shame)!!

JonG · February 6, 2010, 12:00am

Nobody scales out their databases that way. That’s crazy talk. You might partition out your databases, but you’d have at most 3 DB servers. And for the web servers, I sure hope you’re assuming Windows Web Server 2008, which is about half as expensive as the full Windows Server 2008: http://store.microsoft.com/microsoft/Windows-Web-Server-2008/product/508FCC29

JonG · February 6, 2010, 12:00am

Can you list out your license cost assumptions so we all win? If you’re right, we’ll know the right answer. If you’re wrong, we’ll all learn when commenter #785 fills us in.

Brandon · February 6, 2010, 12:00am

Yeah, but does it run Crysis?

AnonC · February 6, 2010, 12:00am

Did you forget that you can’t just decide to “scale out, MapReduce style”. You need to design your software to support that approach right from the start…

BrandonF · February 6, 2010, 12:00am

“As everybody knows, open source is only free if your time is free.”

While I’ve laughed at this in the past because it certainly feels true at times, I have to say it makes the assumption that the admin doesn’t know what he’s doing. While it definitely costs in time to learn to use much of that software it’s a fixed cost when you’re talking about scaling out whereas license fees continue to rack up (and you’d still have to learn Windows anyway right?).

After running both Linux and Windows in production for about a decade I have to say that it’s been the Windows boxes that keep demanding time, not the Linux ones (I’ve had >1 year uptime on some of those).

That said I do agree with the people saying that the true hidden cost is redevelopment to be able to scale out. I think a lot of people are starting to design things to be able to do that more easily these days but it still costs, and often hurts.

BronG · February 6, 2010, 12:00am

To those who think managing 83 machines is harder than managing (say) 20 machines, you’re doing it wrong.

All our machines can be reinstalled from a netboot image in under 10 minutes. The netboot installs a bootstrap filesystem, downloads and unpacks all the required packages, checks out the subversion repository of the configuration system and runs “make install”.

So long as you don’t make individual changes by hand on a server, but instead update the definition of that class of machines and then update the configuration from subversion, it’s all clean. It takes some discipline, but the results are worth it.

And if a machine fails, you get a similar spec machine from anywhere (lead time, a couple of days at worst, and it’s cheap to keep a spare or two sitting in the cupboard) - transcribe the MAC address into the central config file next to a hostname and a list of classes and plug it in.

Generally I just get the technicians to plug in a machine (on the other side of the world), watch the dhcp server’s logs for a new address being rejected, edit the file and install a new dhcp server config (“make install” in the dhcp server config directory, obviously) and watch while it gets a response and netboots itself. Trivial.

If only my home machines were so well set up. Unfortunately, it does take too much work to get to this point. Operating systems are still designed around the single install rather than the repeatable configuration - both in the windows world and the open-source world. And even more so for home use. You run an installer for this instance on this machine, not something that edits your minimal rebuild instruction set and repeatably applies it to the machine so that a “reinstall” is just a replay of the setup log.

BronG · February 6, 2010, 12:00am

Oh, I should add to the “rebuildability” - obviously database machines have files that need to be not-deleted on a reinstall. Ditto file servers. In our environment the file servers are Cyrus IMAP, so the cyrus data and meta files are on their own partitions. Similarly on the database machines, the data is on its own partitions. Both lots get backed up, so in a drive unit failure situation you have to restore from backups as well - but all the other partitions get wiped on reinstall. We also keep a partition that doesn’t get wiped by default because logs go on there - it’s nice not to lose logs over reinstalls.

Anon · February 6, 2010, 12:00am

The mentioned $ 100000 server is far from a “monster”. For example, the Sun M9000 (http://www.sun.com/servers/highend/m9000/specs.xml) puts 64 quad-core CPUs and 4TB of memory in the same cabinet and less business-oriented offerings like the Cray XT5 (http://www.cray.com/Products/XT5/Product/Specifications.aspx) reach even higher.
Whether the importance of data and the value of high availability justify a complete remote replica, a replica of the database only, a good backup procedure, or no reliability provision has nothing to do with scaling up or scaling out.
Different parts of the system, not only different tiers, have different needs; separating them can be a very effective way to avoid resource waste and to scale horizontally without expensive complications. For example overnight batch processing can tolerate relatively slow servers and storage, reducing the size of high performance databases.