Incident with our servers
Last week, specifically on Friday 21st at night, we suffered a serious problem with the servers that made all Festhome services unavailable for several hours. Although we have already managed to have all the services up and running again, we want to explain what happened and the measures we are taking to avoid, as far as possible, another similar incident, the first one of this magnitude we have had in 13 years.
What happened?
Around 2 AM, all Festhome services became inaccessible. After a quick evaluation, we realised that the problem came from the main database server, on which the rest of the services rely to save and obtain the data pertaining to everything that is done in Festhome. Once the data center operators could physically access the server in question, they informed us that the hard drive was not working, so we started the process to recover the data from another hard drive and replace the failed one to put back the online service. This in itself is not a very big problem, and apart from the time lost without connection, which at this moment was a little over an hour, there would be no further incident. The serious problem arouse when accessing the second disk that contains the instant backups, as it did not respond either. From this moment on, we started working on trying to recover the data from the hard drives so that we could restore service with all the data, but hours go by without much progress and finding more and more problems with the hard drives. We believed that some kind of power outage or something similar had fried both hard drives at the same time, since it is extremely rare for two hard drives to die at the same time. In these 13 years we have never lost a single piece of information or registration and we are very ashamed of this episode.
What did we do?
Once the magnitude of the problem was known, we decided to work in parallel on using one of the backups external to the server to restore the service if we were not able to recover it with the data from the failed server. About 7 hours after the start of the incident, we had the service ready to be restored, but with the data from the last backup that was made external to the server, which is from Thursday morning, the 20th.
At this point, we had to make the decision to wait to recover the data from the corrupted hard drives so that there are no lost shipments or transactions, or to restore service without Thursday's data. After careful consideration, we decided that too many hours had passed without service, and it was important that users could continue to send and watch the movies, and whenever we had he data for Thursday, we would push it back to the new server manually.
This seemed to be the right decision, as several days later we are still trying to recover the other server, but we have less and less hope that we can indeed recover the data that was not saved on other servers.
What solutions have we found for now and the future?
Right now, our support colleagues have been manually rebuilding transactions that users have told us they lost, but it's an imperfect solution. We count on our users to notify us of problems they have had, to manually fix those problems and have all the data as it should appear in their accounts. It's a slow process, but right now it seems to be the only one possible.
As for the future, we're going to increase the redundancy of the databases in the live service with better point-of-failure protections, so that if a server suddenly fails, the service will continue to function. 13 years is a long time without losing data, but if we can be better, we must be, so we are also going to increase the frequency of external back-ups to the data servers, in case there is a similar problem in the future, we can recover data with less loss.
We want to apologise to all our users and thank you for your enormous patience in recent days. It has been many hours without sleep and with great nerves due to ignorance of what was happening. When you have a computer in front of you, it's already difficult at times but at least you can touch it. With servers hosted in bunkers thousands of kilometers away and with which you can only work with a Matrix-style command line, uncertainties and nerves increase exponentially.