Twitter is seeking a seasoned Operations Engineer to join our Operations team.
Key areas of responsibility
Continually improve the performance and scalability of the service
Bonus
* - Experience with Ruby on Rails stack performance tuning
As nice as it is to write “funky community” sites in “bleeding edge” languages and frameworks, some poor chap has to keep the blinking-lights flashing 24/7. The ops guy’s here must be racking up the overtime.
As for informative “site down” messages, when you have a sprawling architecture it can be difficult to determine what has gone wrong and how long to fix it. A well engineered system should have no single points of failure (OK, OK, within economic reason) and a short (and tested) Return to Operations plan for any failures.
A Black Swan is a wonderful analogy, used for centuries as an example of a never found creature “All swans are white”, then Australia was discovered along with Cygnus atratus the black swan …
Reminds me of “Can’t happen” traps in code … that occasionally show up in error messages and logs…
On webpages it makes some sense not to print out the exact error because:
Since the application is running server-side, the site developers can get the exact error from their log files. No user error reporting is necessary.
Printing out error details is a security risk because hackers will be able to recognize that something that they did was able to crash the application. This could be their starting point for further investigation.
As far as I am concerned, I would say this error page is ok. As a user there is nothing I can do at that point except reload the page or come back later.
One thing I’d like to see errors pages have is a reload. So if you leave the page open it’ll reload every few minutes with an updated error message and eventually the page you originally wanted.
Especially good if it’s a page users leave open 24x7 ( like gmail or bloglines) which is always being updated, because then if you are lucky a proportion of your users won’t be at their computers during the outage so they won’t ever see the error.
I really don’t care if a site had a 5 minute outage at 3am when I was asleep. But if when I wake up your site is still showing the error message until I hit reload then you just greatly magnified the impact.
The days of meaningful error messages in UI’s are probably dead. Run any security scanning software against an application that gives the user more than what Twitter is giving and you will get alerts. Why does the user care what the problem is?
If the error page gives technical details about what’s going wrong, there’s the potential of giving people information they need to hack into the site. This would be a bad thing.
If it’s scheduled maintenance they should say so. Otherwise, by definition, it’s an unexpected error and they don’t know when it will be fixed.
A separate site giving information about the general status of the main site would be useful, but would have to be both physically somewhere else and also on a separate domain.
“Is this an ephemeral, temporary error or some kind of scheduled downtime? How do I tell the difference?”
My guess would be, that, if it was a scheduled Downtime, it would not be called “something is technically wrong”. So, it’s an error.
“Is Twitter down for everyone, or just me?”
Is there a reliable way in telling that automatically? It has to be very reliable, because a single error here will jeopardize the complete trust which could be build by telling the user if it’s just his account.
So, better be quiet before lying, log the error, and fix it asap, that the user hopefully has never to see this page again.
“Is there a place I can go to check Twitter’s current system health?”
I’d guess no, they would be telling, if they had. Do you want them to explicitly state nonexisting features?
I personally dont care why there’s an error, i just want it not to happen. I don’t care for explainations, i even dont remember them 5 minutes later.
So, for me, dont explain what went wrong, just admit it, solve it and i am happy customer.
I don’t think some of you were around when LiveJournal first came out, started growing, and experiencing crashes/overloads daily. They ended up establishing a “status.livejournal.com” domain, on a seperate network and server, to communicate exactly what technical issue was going on, and when the ETA was to resolve it. Granted, you’d have to go to this domain manually, but it was better than not knowing what was going on at all. They would even tell you really specific stuff, like “Hardware RAID Controller failed, swapping out”, etc…
If you ever want to read a cool history from Brad Fitzpatrick, who created LiveJournal, on how he dealt with scaling issues, here goes:
YouTube is also notorious for bad error management. They always manage to break something with every site update which results in numerous complaints (in video format). Now they’ve broken their comment system with a simple JavaScript error. You can’t report this problem or expect it to get fixed any time soon.
Is this an ephemeral, temporary error or some kind of scheduled downtime? How do I tell the difference?
~ Is it really so important to know what kind of error it is? The average user knows that it broke; this is generally enough. I dont think users will be contacting Twitter engineers to provide stack information.
If this is scheduled downtime, when will it be over? Can I view the maintenance schedule, or the current status of the maintenance work?
~ With the growing pains they are experiencing, would you want to commit yourself to a time? What average user wants details about the maintenance schedule?
Is Twitter down for everyone, or just me? Is there a place I can go to check Twitter’s current system health?
~ See above
Twitter has a reputation for unreliability. Where can I find out about Twitter’s ongoing efforts to improve their reliability?
~ To me this is only sensible point out of the four, Jeff. As a Twitter user, you want to know when you can expect stability and reliability going forward. As consumers and users, an answer is deserved.
But the other points for users are completely uselsss. Users want working applications. When they dont work, they get frustrated, annoyed, irritated. I dont think they care much about the maintenance schedule. Sure, its nice to know WHY the service is down, or even down repeatedly, but after a while, despite all the warm-and-fuzzies and pretty error images, the average user will just give up.
Rember Friendster? I think they must have run into scaling issues, as I remember their site getting incredibly slow, often getting errors loading pages (just the default “page can’t be displayed” errors though). That’s probably the big reason MySpace and Facebook are huge while Friendster is an afterthought.
Is this an ephemeral, temporary error or some kind of scheduled downtime? How do I tell the difference?
~ Is it really so important to know what kind of error it is? The average user knows that it broke; this is generally enough. I dont think users will be contacting Twitter engineers to provide stack information.
** Is it broken and will be fixed later, or just broken for a moment and I can just reload?
If this is scheduled downtime, when will it be over? Can I view the maintenance schedule, or the current status of the maintenance work?
~ With the growing pains they are experiencing, would you want to commit yourself to a time? What average user wants details about the maintenance schedule?
** Is this an error or just the service offline, should I keep retrying see above
Is Twitter down for everyone, or just me? Is there a place I can go to check Twitter’s current system health?
~ See above
** See above …
Do I care why it is not working - no
Do I want to know if it is a temporary fault, or just for me, (and I can just restart/reload) or a system wide fault, offline for maintainence, and so I should give up and come back later… yes!
The Twitter error page is very reasonable. As a user I don’t need to be overwhelmed with details. I could give a rip what happened, but it’s nice to know that the problem is on their end and not mine.
I’m happy with seeing the error page. I am happy when I see ANY page served from twitter.com, because Lord knows how unstable it is! I’ve also gotten a very plain and ugly 500-Internal Server Error from Twitter before. They really need to get their act together.