| View previous topic :: View next topic |
| Author |
Message |
Stephen ANNO Support

Joined: 22 Apr 2003 Posts: 517 Location: Vancouver, Canada
|
Posted: Thu 1 Feb 2007, 7:49 Post subject: WWW-12 Server Failure - Post Mortem |
|
|
Early Tuesday morning (South African time) we had a primary hard drive failure in Linux server WWW-12. Websites and email services that reside on the server were unavailable for the whole of Tuesday while we were restoring from backups. The first services becoming available again on Wednesday morning, and related DNS problems were address during the day. All services were back to normal by midnight.
The purpose of this post is to respond to your questions:
Why the long delay in restoring service?
Our recovery process was delayed by a series of unforeseen events. The list below is not meant as excuses, but statement of fact. We accept that we solely are responsible for the wellbeing of the server and timely disaster recovery.
Attempts to build a new server with the old backup drive attached were foiled by hardware incompatibilities. A task that should normally take an hour, took nearly six hours to complete.
Towards the end of the restoring process early on Wednesday morning, the server started to slow and eventually halt under heavy load. [list=]The server was under a DOS attack and its firewall was using all CPU resources in defence. (I believe that our server was a random target.)
The data centre kindly assisted us by enabling a network firewall to handle most of the defence. The server became manageable again shortly thereafter. At that time we decided to halt the recovery process in favour of getting the server online with those domains that have been restored (some 100 domains were still not restored).
Configuration problems with the network firewall locked us out of our server management software and also prevented passive FTP connections. We could address some problems in this restricted work mode, but could not complete some critical processes.
The network firewall presented a second problem when it did not disable properly on command. Nearly eight hours after our first request, the network firewall was disabled early Wednesday evening, at which time we resumed the recovery process.
What are we doing to avoid a repeat?
We recently set mirrored hard drives (RAID) as minimum requirement for all new servers, to ensure speedy recovery from disk failures. Instead of a long outage while reinstalling software and restoring from backup, a faulty mirrored drive can be swapped out in a few minutes. Supplemented with our existing hard drive backups, we feel that we will have adequate redundancy and defence against accidental data loss.
Linux server WWW-12 is a ripe four years old, and has seen a marked increase in load during recent months. We were planning on replacing it in February, and have indeed placed an order for two new RAID-equipped servers last week. I feel this episode, a month before the planned upgrade to a more fault tolerant system, was a stroke of bad luck.
What lies ahead for Linux server WWW-12?
We plan on migrating all domains on WWW-12 to new servers during the weekends of 10, 17 and 24 February. We are working towards zero disruption of services during migration. Details will be announced in the coming weeks.
All services have been restored, and we are treating remaining issues as they are reported. If you detect any problems, please contact us without hesitation.
We apologise for the inconvenience cause by this episode. We also wish to offer a full refund of the January hosting fees to any client that has suffered financial loss in the process. To claim your refund, please write to billing@anno.com by 28 February.
If you have questions or comments, you are most welcome to air your view here in the Forum. If you have an issue of confidential nature, please contact us via the Helpdesk or phone.
PS: If you were unaware of our communications during this outage, you are probably subscribed to this mailing list using an email address that resides on the affected server. We recommend you use an off-server address (such as your ISP or Gmail) for your subscription; that way we can reach you even if the server is in trouble.
Last edited by Stephen on Wed 21 Feb 2007, 17:39; edited 1 time in total |
|
| Back to top |
|
 |
matrix Expert Member

Joined: 11 Jun 2003 Posts: 80
|
Posted: Thu 1 Feb 2007, 9:55 Post subject: |
|
|
I will repeat what I said before: I like working with Anno, but want to see an plan to avoid repeat of a day like Tuesday. Going RAID is a good start. And having ordered the new servers is confirmation that you are serious about this
I know you had serious secondary DNS issues too. What are your plans to guard against failure? |
|
| Back to top |
|
 |
Stephen ANNO Support

Joined: 22 Apr 2003 Posts: 517 Location: Vancouver, Canada
|
Posted: Thu 1 Feb 2007, 21:50 Post subject: |
|
|
| Quote: | | RAID is a good start. And having ordered the new servers is confirmation that you are serious about this | Yes, I believe we are on track to better reliability and redundancy all over.
| Quote: | | I know you had serious secondary DNS issues too. What are your plans to guard against failure? | We have discussed that too. The ultimate solution would be two or three servers dedicated to DNS only. We have no definite plans for additional hardware yet. In the meantime we have enabled added a third server to the DNS cluster for live zone mirroring purposes only.  |
|
| Back to top |
|
 |
Stephen ANNO Support

Joined: 22 Apr 2003 Posts: 517 Location: Vancouver, Canada
|
Posted: Tue 6 Feb 2007, 6:12 Post subject: |
|
|
After some initial performance problems, the restored server is stable and running smoothly now. The initial problem of occasional slowness could be ascribed to high "IO wait", i.e. the CPU slowing down to handle disk activities. That turned out to be a know bug in the specific Linux kernel -- we upgraded the kernel on Saturday and has not looked back since
We are going ahead with the planned server migrations in the next two weeks. Details will follow soon. |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|