Twitter: How Not To Crash Responsibly

I would agree with your assessment that their error page does need a small bit more information. I get that same screen once in a while and usually after refreshing the page, the error is gone and I can view the page.

But, why should Twitter, or any other site, provide me with the technical information on what happened on the back end when error occurs? They don’t owe me anything. I don’t pay for the service. If I did, I’d probably be complaining also.

As far as logging the error and someone at Twitter fixing the problem, they have a page for jobs (http://twitter.com/help/jobs). You should apply.

@Powerlord I guess I neglected to mention an assumption that all relevant stack traces, querystrings, viewstate, whatever are logged using a global exception handler at the point of the error (before the user is redirected to the static error page). The static error page is static because it’s not doing anything, and there’s nothing left for it to do.

My point was building on that assumption: the static error pages shouldn’t be doing anything else since they won’t necessarily know why they were served (aside from the obvious) and from their point of view, all bets are off.

The only time timeframes are needed is scheduled maintenance. It’s a planned outage, so it should be straightforward enough. A separate server is a nice idea since you conceivably don’t connect to the same data store. But if you’re on the same server, you’re manually coding up outage files. A good infrastructure would help here (IP load balancer, etc).

Coming up with a magical formula for unplanned outages is silly and a waste of time because you’d be better off fixing the darn problem in the first place.

And finally, managers and coders don’t invest a lot of time in error pages (other than making them nice to look at) because I really don’t expect my users to ever see them. I’ve seen people try to get cute, and at the end of the day, the change is rolled back because the error page failed in an unforeseen way, and caused an error, and redirected back to itself. Drat.

I’m afraid that I have to agree with Aaron in some respects. Whilst Twitter is obviously not essential for anyone I still cannot see why people in tech-circles still go on and on about this website. I have really tried to see some sort of use for Twitter, and I mean really tried!

That being cast aside, it is usually a good idea to have some sort of system health checker so you know whether “all” of a website is down before you keep trying.

Ah, the old “CloudsRobotSeveredRobotHandBird” error. I used to get that all the time when I was a kid. If I recall correctly I believe it was error #3.1415.2.7182818. Haven’t seen it for years.

/ smirk /

I think Twitter will be bought for an over bloated price and than disappear because people will realize how utterly useless this service is.

Does the user need to see anything technical?

When my company’s web site has an error, the web page actually sends an email to support with all the technical information necessary.

So while we capture the information behind the scenes, we just show the user a message saying we’re performing system maintenance and should be back momentarily.

As a side note, I recently started using Twitter. I definitely put myself in the “what’s the point” category before I used it.

Now that I’ve used it, I think it’s a great tool to log what you’re doing when you’re doing it. You can go back later on and analyze what you’ve done, where you’ve spent too much time, etc.

I wish I could get to it from via email as the web site is blocked by WebSense. :frowning:

this can only last for so long, twitter is the worst social network regarding uptime, and being honest about it; the only thing it has is popular people “backing” it up like this blog… but that can only last for so long… we’ll see, there are already a few other services that actually work that do the same thing that twitter tries to do, only… they actually do it

What’s with the robot giving you the finger?

I think it a bit presumptuous to be asking for things like access to maintenance schedules and reports, etc. You are using a free, non-critical service, not say, running your business with a managed hosting contract that has SLAs and the like.

As to the service itself, I find it amusing that people feel the need to prattle about their every insipid thought and activity. More so, that others are eager to lap up such drivel. Granted, you can argue that it is useful for communicating with colleagues and friends, but with email and instant messaging, Skype, etc. is this really necessary? Or as mentioned, that people want a log of their minutiae throughout the day?

I see twitter as a narcissistic feedback loop amongst the self-important, and/or another obsession with “information” that reminds me of the dawn of the Web, where people’s first exposure was one of awe and they would go from site to site for hours on end. In this case, however, there is far less substance to “tweets”.

Eh. Perhaps I am too judgemental, but whatever… People are what they are.

First people touching your screen and now this? Come on Jeff…

Jeff,

Thinking about crashing responsibly is thinking about the problem from the wrong end. When application dies, it remains dead until fixed/rebooted.

Real question here is why application die in the first place:

  1. Overzealous memory consumption. Even in the age of “automatic” memory management, badly written application will crash as soon as enough memory is leaked

  2. Scalability as an afterthought. This one is closely related to problem #1

  3. Ridicuolous volume of I/O, stemming from bad practices of n-tier application design (chattiness, crappiness…)

Recoverable errors are a completely different matter of course. Their causes and resolutions can be described and communication to both users and technical staff.

Not like IE, which restarts itself hoping that user won’t even notice the crash.

Error conditions are like unit tests. If you cannot imagine a recovery from an error (or at least a decent response), you are probably dealing with a potential “crash scenario.”

I think because you’re a developer, you’re asking for and expecting too much information. Even if they know what the problem is, why do they need to tell you the exact reason? If a hard drive crashed, NIC smoked, network issues, do you expect them to mention these reasons? It doesn’t resonate well with customers.

The most important piece of info I need is an ETA. This way I don’t bother retrying before that time and I have some idea of how severe the problem is. This will also keep people from contacting the company with the nagging question, “When do you expect to be back up?”. It’s very irritating and a waste of time when a receptionist or a highly paid engineer reply with the same answer.

Also in some cases, an engineer or more are frantically busy trying to figure out what the problem is and fixing it, than trying to update a status or error page.

That error page you’re showing doesn’t imply a maintenance. Maintenance is usually scheduled and known and not “something is wrong”.

uh-oh… someone divided by zero.

If they’re logging and reporting errors behind the scenes then the error page is not static. Therefore they should be able to show a customised error page.

The Web2.0 world has to take some lessons from the core hardware world. A software service is today no different from a HW chipset. It is used by as many millions as a chipset would. Just because we can cook up a great idea with half a dozen simple scripts, it doesnt mean we can escape the responsibility of writing scalable, crash free code.

Uh, giving too much info on crashes can give hackers too much information!

They should/probably are logging these errors in detail, privately, exactly as I would.

@Josh - A somewhat trite response that indicates you probably don’t know what you’re talking about (or at least how these things work) for web applications. I guess you’re missing the point that static error pages happen /after/ everything else.

  • you’ve already checked input
  • you’ve already accounted for /known/ issues (DB concurrency exceptions, no connection to DB, web services down, data parsed appropriately, checking references in methods, etc).
  • you write good error handling code that only catches specific exceptions

But the unimaginable happens. Timeouts, db connections that die, drives that crash, overloaded web servers, no memory, full disk space, file permissions don’t work, etc.

But you’ve got a global error handler. It’s already intercepted the message and done it’s best to notify someone in operations. SNMP traps, emails, log files, etc. Yay. Mr 24x7 in operations is on the job and degaussing servers like crazy. Job well done. Have a kit-kat man!

:stuck_out_tongue:

So what exactly do you want the error page to say? Some technical jargon? A stack trace? That node SN14 decided it’s no longer in sync with cluster NN233? That it’s been fed some bad data? That drive C is full?

Lordy, I guess I can sleep better now. I never did like NN233 (always smelled bad in school).

Tell us, and please be honest, did your mother appreciate the arcane message?
Does she feel better?
Does it help her?

Hmm, that begs the question: How do you customise an error page when you don’t expect an error to happen? I guess you really need a time machine to do that properly. Then you can make a fancy error page that really knows how to display just the right message when, on Jan 14 2020 at 4:45pm PST node SN14 desyncs with cluster NN233. Feel better? I do.

You know what’s too bad? We ran out of time with all this gee whiz time travel and fancy error page stuff and never did fix the code in node SN14 that stops the error from happening in the first place.

Then again, us mere mortals just chuck up a static page that says ‘Sorry’ because we’re busy fixing bugs. Some of us are a bit more graphically inclined (or have some senses of humor) and like to put up cute pages so the customer experience isn’t diminished too much.

Every time I get this error page I misread the beginning as “Thanks for nothing.”

Also, what slays me about almost all of the downtime for the last year is that Twitter almost always attributes the downtime to being the result of installing enhancements to increase future stability. If this is the case, they need to do much more due diligence in testing and planning for the installation of these enhancements. It also means that the downtime is not, per se, the result of scaling problems or immediate failure of the existing software infrastructure.

Ruby on Rails doesn’t scale well… Develop it with another language. Dare I say it on this blog… Develop it with PHP, it scales.