Twitter: How Not To Crash Responsibly

Looks like their looking for Operations Engineers… http://twitter.com/help/jobs

Operations Engineer

Twitter is seeking a seasoned Operations Engineer to join our Operations team.

Key areas of responsibility

  • Continually improve the performance and scalability of the service

Bonus
* - Experience with Ruby on Rails stack performance tuning

As nice as it is to write “funky community” sites in “bleeding edge” languages and frameworks, some poor chap has to keep the blinking-lights flashing 24/7. The ops guy’s here must be racking up the overtime.

As for informative “site down” messages, when you have a sprawling architecture it can be difficult to determine what has gone wrong and how long to fix it. A well engineered system should have no single points of failure (OK, OK, within economic reason) and a short (and tested) Return to Operations plan for any failures.

A Black Swan is a wonderful analogy, used for centuries as an example of a never found creature “All swans are white”, then Australia was discovered along with Cygnus atratus the black swan …

Reminds me of “Can’t happen” traps in code … that occasionally show up in error messages and logs…

The impossible happens more often than you think?

On webpages it makes some sense not to print out the exact error because:

  1. Since the application is running server-side, the site developers can get the exact error from their log files. No user error reporting is necessary.
  2. Printing out error details is a security risk because hackers will be able to recognize that something that they did was able to crash the application. This could be their starting point for further investigation.

As far as I am concerned, I would say this error page is ok. As a user there is nothing I can do at that point except reload the page or come back later.

One thing I’d like to see errors pages have is a reload. So if you leave the page open it’ll reload every few minutes with an updated error message and eventually the page you originally wanted.

Especially good if it’s a page users leave open 24x7 ( like gmail or bloglines) which is always being updated, because then if you are lucky a proportion of your users won’t be at their computers during the outage so they won’t ever see the error.

I really don’t care if a site had a 5 minute outage at 3am when I was asleep. But if when I wake up your site is still showing the error message until I hit reload then you just greatly magnified the impact.

Hey, look, Twitter changed their error page! Now it says

Twitter is Coming Back Online

For more information on what is happening and to follow the discussion visit Twitter on Get Satisfaction.

a href="http://getsatisfaction.com/twitter/topics/may_20_twitter_downtime"http://getsatisfaction.com/twitter/topics/may_20_twitter_downtime/a

The Twitter Team

The days of meaningful error messages in UI’s are probably dead. Run any security scanning software against an application that gives the user more than what Twitter is giving and you will get alerts. Why does the user care what the problem is?

  1. If the error page gives technical details about what’s going wrong, there’s the potential of giving people information they need to hack into the site. This would be a bad thing.

  2. If it’s scheduled maintenance they should say so. Otherwise, by definition, it’s an unexpected error and they don’t know when it will be fixed.

  3. A separate site giving information about the general status of the main site would be useful, but would have to be both physically somewhere else and also on a separate domain.

“Is this an ephemeral, temporary error or some kind of scheduled downtime? How do I tell the difference?”

My guess would be, that, if it was a scheduled Downtime, it would not be called “something is technically wrong”. So, it’s an error.

“Is Twitter down for everyone, or just me?”

Is there a reliable way in telling that automatically? It has to be very reliable, because a single error here will jeopardize the complete trust which could be build by telling the user if it’s just his account.
So, better be quiet before lying, log the error, and fix it asap, that the user hopefully has never to see this page again.

“Is there a place I can go to check Twitter’s current system health?”

I’d guess no, they would be telling, if they had. Do you want them to explicitly state nonexisting features?

I personally dont care why there’s an error, i just want it not to happen. I don’t care for explainations, i even dont remember them 5 minutes later.
So, for me, dont explain what went wrong, just admit it, solve it and i am happy customer.

I don’t think some of you were around when LiveJournal first came out, started growing, and experiencing crashes/overloads daily. They ended up establishing a “status.livejournal.com” domain, on a seperate network and server, to communicate exactly what technical issue was going on, and when the ETA was to resolve it. Granted, you’d have to go to this domain manually, but it was better than not knowing what was going on at all. They would even tell you really specific stuff, like “Hardware RAID Controller failed, swapping out”, etc…

If you ever want to read a cool history from Brad Fitzpatrick, who created LiveJournal, on how he dealt with scaling issues, here goes:

http://www.danga.com/words/2005_oscon/oscon-2005.pdf

500ServerError:
I can’t see why any error page other than 503 Service Unavailable has to be static html.

500 could very well log which page threw the 500 error and what GET/POST things were sent to it.

YouTube is also notorious for bad error management. They always manage to break something with every site update which results in numerous complaints (in video format). Now they’ve broken their comment system with a simple JavaScript error. You can’t report this problem or expect it to get fixed any time soon.

  • Is this an ephemeral, temporary error or some kind of scheduled downtime? How do I tell the difference?
    ~ Is it really so important to know what kind of error it is? The average user knows that it broke; this is generally enough. I dont think users will be contacting Twitter engineers to provide stack information.

  • If this is scheduled downtime, when will it be over? Can I view the maintenance schedule, or the current status of the maintenance work?
    ~ With the growing pains they are experiencing, would you want to commit yourself to a time? What average user wants details about the maintenance schedule?

  • Is Twitter down for everyone, or just me? Is there a place I can go to check Twitter’s current system health?
    ~ See above

  • Twitter has a reputation for unreliability. Where can I find out about Twitter’s ongoing efforts to improve their reliability?
    ~ To me this is only sensible point out of the four, Jeff. As a Twitter user, you want to know when you can expect stability and reliability going forward. As consumers and users, an answer is deserved.

But the other points for users are completely uselsss. Users want working applications. When they dont work, they get frustrated, annoyed, irritated. I dont think they care much about the maintenance schedule. Sure, its nice to know WHY the service is down, or even down repeatedly, but after a while, despite all the warm-and-fuzzies and pretty error images, the average user will just give up.

It’s obvious the website crashed because the robo cat lost his hand.

For not meaning to pick on them, you’re doing a pretty good job of picking on them…

~Sticky

If you don’t like that error page then I bet you wouldn’t have liked the old Twitter error pages:
http://www.uie.com/brainsparks/2007/06/04/twitters-fairy-doors/

Rember Friendster? I think they must have run into scaling issues, as I remember their site getting incredibly slow, often getting errors loading pages (just the default “page can’t be displayed” errors though). That’s probably the big reason MySpace and Facebook are huge while Friendster is an afterthought.

  • Is this an ephemeral, temporary error or some kind of scheduled downtime? How do I tell the difference?
    ~ Is it really so important to know what kind of error it is? The average user knows that it broke; this is generally enough. I dont think users will be contacting Twitter engineers to provide stack information.
    ** Is it broken and will be fixed later, or just broken for a moment and I can just reload?

  • If this is scheduled downtime, when will it be over? Can I view the maintenance schedule, or the current status of the maintenance work?
    ~ With the growing pains they are experiencing, would you want to commit yourself to a time? What average user wants details about the maintenance schedule?
    ** Is this an error or just the service offline, should I keep retrying see above

  • Is Twitter down for everyone, or just me? Is there a place I can go to check Twitter’s current system health?
    ~ See above
    ** See above …

Do I care why it is not working - no
Do I want to know if it is a temporary fault, or just for me, (and I can just restart/reload) or a system wide fault, offline for maintainence, and so I should give up and come back later… yes!

The Twitter error page is very reasonable. As a user I don’t need to be overwhelmed with details. I could give a rip what happened, but it’s nice to know that the problem is on their end and not mine.

I think it’s an AWESOME error page. If only I could get away with it because my apps had so many users.

Clouds, birdie, dismembered robot? Pure GENIUS!

The whole point is, you DO NOT NEED to know anything more, because you WILL BE BACK, you are a Twitter junkie!

I’m happy with seeing the error page. I am happy when I see ANY page served from twitter.com, because Lord knows how unstable it is! I’ve also gotten a very plain and ugly 500-Internal Server Error from Twitter before. They really need to get their act together.