Last weekend, Snappic experienced an incident that resulted in a disruption to service for approximately 8 hours and 10 minutes. Events that had already been downloaded to the iPad and licensed could still be run, with the shares going out only after the downtime ended. If the resources were not downloaded to an iPad yet, or the iPad wasn’t licensed to run the event, it was not possible to start it. Additionally, no new events could be created, and existing events could not be modified. Micro-sites were also affected during this time, and could not be accessed. Ultimately, no data was lost, even for events that did manage to run during the downtime.
All of us at Snappic would like to sincerely apologize for the impact this caused to each and every one of you. We’re aware of the trust you place in Snappic and take pride in building high-quality services that you can rely on. We have failed you with this incident, and we are deeply sorry. While we cannot undo the losses to revenue, reputation, or any other factors you experienced, we can explain the events that caused this incident, what we did to fix the situation, the lessons we’ve learned, and the steps we’re taking to better ensure this doesn’t happen again.
Cause of the Issue & the Fix
At approximately 4 PM PST we started receiving notifications from our internal monitoring systems about degraded services. The web services were not responding, and we started looking into the issue. While looking into things, the server became totally unresponsive, and with the server being a dedicated server across the other side of the world, we had no physical access to it to diagnose the issue. So we had to call our hosting provider to request support in order to diagnose the issue. They responded that they had restarted the server because it was completely unresponsive. This did not give us any time to alert our users to the issue before the restart. The reboot did not help, however, as it turned out the server’s main hard drive storing the application and user data had failed. There was no easy fix for this – our only option was to set up a new drive in the server and redeploy the application. Once that was up and running, we had to restore a backup of all the data that was on that drive. Once we had consolidated all the backup data, the bulk of the downtime was spent copying all this data back. After that had completed there were still a quite few issues in the deployment that surfaced that we had to iron out, such as file permission errors, and making sure all the white label domains are routed correctly, and some SSL certificate issues. We also had to perform a bit of testing. We would have liked to have tested more thoroughly before putting the server back up, but we also just wanted it to be back up again so your shares could be processed, and that you could access the system again. There were still a few SSL issues we had to iron out that were preventing some things from working, but >90% of functionality had been restored at this point.
Lessons We Have Learned
Something we failed to do correctly was to manage expectations about the downtime, and we are very sorry about the way we went about this. At the points in time where we provided estimates about how much longer the issue would take to resolve, we fully believed that they were indicative of the expected downtime. A lot of the issues that surfaced were unforeseen and kept pushing back the estimates. On top of this, requests for timelines from our hosting provider about the data restore process were met with inaccurate reports. Again, we are very sorry about the way we handled this. As soon as we discovered the hard drive had failed, we should have let you know that we can’t provide time estimates and that you should start looking for alternatives. We plan to remedy this by providing a status page hosted by an external provider that will report details from our monitoring systems about the current status of our services. We will also be more conservative about timelines, especially in times of crisis. Please understand our intention was never to mislead you, we genuinely believed we could get the system up and running in that time.
One request that has come up frequently as a result of this incident is that we be more transparent with changes to the server and the application running on it. Historically, when we get a bug report or have a new feature to release, and especially if it is a quick and easy fix, we will update the application on the server as soon as it is ready to deploy so that you can continue with your work as soon as possible, or start using new features. Sometimes this requires us to put the server in maintenance mode, and we do this because the server is only down for a few minutes or less. We do understand that we can’t know up front about how this affects our other users, so going forward we will schedule all changes that require maintenance, giving everyone time to prepare. This will mean that bug fixes and features take a little longer to deploy, but everyone will be prepared in advance. Going forward any maintenance that is less than 15 minutes will be performed at 3 AM PST on Mondays and Tuesdays. If the maintenance is expected to take longer, we will post a notice 24 hours in advance.
One important note is that this policy wouldn’t have affected an issue such as the one that occurred on 12/08 as it was unscheduled and there was no way of anticipating it.
Mitigating the Issue in the Future
As some of you may know, due to the growth we have been experiencing this year we have been planning to upgrade our infrastructure so that it can automatically scale as the demand increases. However, due to availability issues we have been having with our current hosting provider, we decided to migrate to a high availability cloud platform at the same time to meet all our needs. We have chosen Amazon AWS as our cloud hosting provider, and have already migrated a lot of our services over to it. This will not only alleviate some of the availability issues we’ve been experiencing but also allow us to effortlessly scale up our services so that they are always available in a timely manner. Another benefit to this is the way AWS provides automatic failover for your data stored with us. If any hardware failures occur, there are multiple backups that can be instantly switched to, mitigating issues such as the one that occurred on 12/08. Although we have migrated some of the services already, unfortunately, the main server housing the hard drive that failed is one of the services that are yet to be migrated, as this will have to be one of the last things we can action a migration for. The hard drive status checks aggregated by our internal monitoring systems were not reporting any health issues prior to failure, and so there was nothing we could have done to foresee the issue or prevent it from happening. These things do happen and are impossible to avoid, but after we have migrated our infrastructure the issue will be mitigated because of the way AWS handles failover.
What You Can Do To Ensure Your Events Run During High Load
We are constantly optimizing the system and migrating services, but unfortunately, during times of high load, there may be slow or intermittent access to the system until we have completed the migration. It should slowly get better and better as we move more services over, but until then there are some steps you can take to ensure you can still run your events during these busy periods.
As long as the event resources have been downloaded and the device has a licence, you will still be able to run the event. Shares may only get sent out a little later, but the event will still run. There are a few things you can do to ensure that you can still run your events even during these times of high load though if you follow these steps:
- Sync your event to all your devices earlier in the week, at least a day or two before the event. Also fully start the event at least once. This will make sure it has a licence assigned to it, and that it has all the resources to run the event. Remember to re-sync any changes you make after this before the busy period starts.
- When starting your events up front like this, make sure you have enough licences to run all the events. If you don’t, starting one event may revoke the licence for another one. Then when it comes time to start that event, it may take longer than expected to license the device during high load. Contact us in advance if you have any worries about licensing issues.
- Try to keep at least 5GB of space free on the device. When an iOS device is running low on space, the OS can delete certain files from the apps on the device. So even if you have loaded all the resources on the device, the OS could remove them if other apps need the space.
- Try to avoid making changes to your event during times of high load. If the server is being slow, or access is intermittent, try not to make changes to any parts of the event. It could take some time to save the changes, and then you still need to refresh on the iPad and get it to download the resources, which could also take a while.