Primary links

Terremark vCloud Express Outage: How Not to Do It

I usually don't spend time talking about outages.  We all experience them, either as users or providers.  We make mistakes, hopefully they're correctable and we can get back about our day. But the dance taken during an outage has been well choreographed by now: The provider acknowledges an issue, users grumble, provider gets service back up and running, and finally sends out explanation of what happend, why, and steps taken to make sure it doesn't happen again.

Earlier this week Terremark's vCloud Express (VCE) service had some sort of outage. Latency and traceroute to my nodes were normal, no load on the server, no memory/cpu/disk utilization to speak of. But any interaction at all was painfully slow.Usually as a user, I wait 5-10 minutes to see if it's a hiccup.  Then I hit VCE's web site - in order of priority, I was looking for

  1. a system status page
  2. A location to submit a trouble ticket, or
  3. A phone number to call to see what was going on.

I found none of those. Their support link pointed me towards a forum, so I put up a post asking what the story was.  In the resulting thread with their support department, I'm told that the forum is, indeed, how they interact with their customers.

Looking over the forum as I write this post several days later, I see a post from their Director of Product Management white-washing the issue, claiming that the infrastructure was never fully down, a small subset of customers were effected, blah blah blah.

No matter how you look at this, this is a flubbed response:

  • Technically, I disbelieve the excuses given.  If a "core networking device" had a cpu spike, more than a few customers would be effected.  When a core networking device has a high cpu load, it's noticable in traceroutes. When I have an interactive ssh session opened to a server, press return, and it takes minutes to see a response - the service is down. Telling me otherwise is nothing less than BS.
  • Customer-support wise, this is a total failure. Twitter is great for announcing issues to customers.  VCE has my email address - send me the damn weightless apology, don't post it for a forumn and wait to see if I find it.  Don't bury the process for submitting a trouble ticket in a wiki that doesn't have an obvious link to it.

Terremark, you got a $20MM investment from VMware for this...even if it was all software licenses, hire somebody other than a high-schooler to run it if you want any type of real success.

I've moved my nodes from Terremark, I've taken it off my ilst of cloud services I recommend to cilents, and in general can't really imagine using their service again unless I see a massive improvement in how things are run.