Last Thursday we went into about 62 hours of unexpected downtime. No matter how you tried to reach Soluto, you got to an “Under Maintenance” message.
Many people were hurt by this and many contacted us complaining they cannot support their business customers, friends, or family members. Two groups were especially affected by the downtime: business customers using Soluto to provide professional support to customers or co-workers, and those who educated the people they help to regularly use F8 to ask their PC questions (as those questions weren't sent during the downtime).
Emotions ranged from sending us “get well” cards with pink hearts to nasty sarcastic comments. We get it. People using Soluto expect it to be up all the time to serve their own PC needs, their family’s PC needs, or their business customers’ PC needs.
So first and foremost, we’re sorry. This is unacceptable, and it’s certainly not the way we wanted to start 2013.
The downtime was caused by technical issues our cloud service provider experienced. In a nutshell - in order to provide our service, Soluto runs on hundreds of Internet servers that we “rent” in a large data center in the US. On Thursday, that data center went down and it took us with it.
We’re learning and analyzing this event and we’re taking serious measures to make sure such events either do not happen in the future or have much softer effects on the people using Soluto.
Again, we apologize for this downtime. We hope you’ll stick with us :)
Below we’ve included a much deeper technical analysis of what happened. If you’re into tech you may find it interesting. If you’re not, beware - it may be extremely boring for you:
If you consider yourself a tech geek or even mildly interested in technology, we’d like to explain a bit about what happened, and what measures we’re taking to prevent or reduce the effect of such future events.
Let’s start by talking a bit about Soluto’s high level architecture. Soluto has 4 main pieces:
1. An agent application installed on PCs.
2. A web application through which you can manage PCs.
3. A container for all the data gathered from PCs, so it can be served to those users through the web (there’s no personal data here, only technical aspects of the PCs).
4. A database where data from different PCs is analyzed and crunched together to reach smart conclusions and recommendations about PC issues.
These relationships are roughly illustrated below:
This is naturally a gross simplification. The rectangles titled “specific PC data” and “aggregated data” each comprise of tens of different types of servers and hundreds or thousands of different types of data elements, mostly residing in key-value tables and BLOBs. Soluto currently runs on >400 servers (and growing) and writes about 100,000,000 data points every day to the cloud infrastructure. When there’s a spike in traffic, we immediately add as many servers as required. When there’s a slow day, we reduce the number of servers.
When designing our architecture, we had to choose a cloud service provider to host our servers. More specifically, we wanted to go with a cloud provider with platform-as-a-service capabilities.
There were two realistic alternatives for us:
1. Amazon Web Services
2. Microsoft Azure
Amazon is the clear leader in this market, established and experienced, used by the likes of Dropbox, Netflix and Instagram. However, we decided to go with Microsoft Azure due to various reasons, most important of which was our belief that we could develop our solution much faster on top of it. Sure, choosing Azure was a risk because it was a less mature platform than Amazon. But we knew the people running and leading the technical side of Azure personally, and we knew they are top people. In addition, we got lots of help from Microsoft by being added to their BizSpark One program: we got both great pricing and the highest level of support.
This decision paid off big time - we implemented the entire complex server architecture very quickly and it has been serving the people using Soluto for over a year now.
Now’s a time to mention a key point about being a start-up. Our most precious resource is product development time. We can buy everything else. We prioritize our work by the hour, to move as fast as possible to improve our service. Whatever we execute is always measured against what we could have otherwise executed.
We could have obviously spent time building various mechanisms to make sure that whatever happens to Azure, we’ll be able to provide our service (the extreme example would be creating a fully redundant deployment in Amazon). But that’s not the startup way. Because by doing so, we wouldn’t have created hundreds of features for our users at the same time. And for well over a year, we hadn’t experienced severe downtime except for a single case of several hours in February, but once a year is acceptable.
And then came last Thursday. What happened was that the “storage service” in Azure’s main data center went down. Machines running code could still run code, but they could not access the data. And our service is all about access to data. So, for example, when you browsed to your Soluto account, the machine responding to your browser’s request was alive, but it could not fetch your PC’s data. If you clicked on the Soluto tray again - your PC’s agent was able to reach our web service, but the web service could not reach your PC’s data. Since we didn’t have any access to the data ourselves, we could not even move parts of it to somewhere else.
In the first hour we were not really sure what went wrong, because even Azure’s service dashboard was unavailable (it’s served from the same data center that went down). But as time progressed and we were able to contact people within Microsoft, we understood there’s a severe problem with the storage service and people are working over the weekend to resolve it.
One of the worst things about this downtime is that Microsoft didn’t know how long it would take to resolve the issue, and as a result we didn’t either. Deducing from our knowledge about Amazon downtimes, we assumed it would take a couple of hours, at most a day. It took much longer. In retrospect, had we known it would take so long, we would have taken various steps to ease the effect of the downtime for our users, but we were optimistic. Too optimistic.
Some people have asked us “why don’t you backup your data so it’s available in other data centers?”. Well, Azure has an option to pay about 30% more and get what’s called “geo-replication”, which means the data is backed up and can be restored in a different data center. Are you thinking to yourselves “those cheap bastards saved on geo-replication?” - well, you’re wrong. We do pay for it. But the issue is, that restoring an entire service from a backup is a process that takes Microsoft longer than the downtime we had. We were not aware of that fact beforehand, and now we treat the geo-replication as something very different from what we used to.
So what are we going to do?
First, we are going to start migrating some of the more critical elements of our architecture to a redundant solution, some of which will probably reside on both Azure and Amazon. In addition, we’re refactoring some of our service to be storage-independent. That process will take time. As we’re still learning the results and effects of the downtime, we will surely come up with additional improvements in the near future.
We have always been big believers of transparency, we hope the information here helps clarify the situation. If you have further questions you’re welcome to contact us at firstname.lastname@example.org
The Soluto Team