Các thông báo liên quan đến hoạt động của diễn đàn
2 posts • Page 1 of 1
Nguyên nhân máy chủ chập chờn được xác định là do 1 đĩa cứng hỏng. Nhà cung cấp dịch vụ sẽ thay đĩa cứng mới. Trong thời gian thay đĩa cứng, mọi người sẽ không truy nhập được vào diễn đàn. Hy vọng tình trạng này không kéo dài.
Feb 28, 2007 10:09 AM EST
Update regarding problems on server
Important: A one month credit will be issued to your account as an apology for the past few days' problems. If this is not satisfactory, please let us know what is and we'll do our best to please you.
The problem (a bad RAID array) has been found and will be fixed in the early AM hours tonight. Downtime should be short. There is a small possibility that if the repair fails, a complete rebuild and restore will be needed which will take longer.
For the past several days, the server has been having sporadic bits of huge loads due to high IOWAITs. Our first course of action is always to check for hardware issues with disks, etc. The usual reporting tools however showed no problems with the drives or RAID array status. Once hardware is eliminated these types of issues are usually traced back to indexless mysql queries or large joins that tie up the server. These cause all other processes and queries to start piling up as they wait for server resources.
After several days of detailed monitoring and disabling of suspect databases we still had no results that we believed should be causing the problems that the server was having. We found small issues here and that which we repaired but nothing that should be causing the huge loads that the server was having.
At several different points (after helping users fix slow queries or other issues), we had hours were we though all was well with the server again. Still the high IOWAITS would re-appear. After deciding that we were making no progress, we decided to take the server totally offline to check the drive and raid status through the bios and not through the software tools. It was at this time that we discovered there were issues with the array.
We did a basic repair to get the server back online last night but a further drive replacement and full repair is still needed. This will be done in the early AM hours of Thursday to minimize disruptions. A tech is shelled into the server all day today to manually watch for any load spikes and cut them off before they cause visible slow downs. The repairs tonight should be pretty quick. However it is possible that the rebuild will not work and we will need to totally re-install and restore the server from backup. Fresh backups are being made through out the day to ensure that all data is as up to date as possible (normally backups are only run once a day). If a full restore is required it will take several hours to complete.
We are very sorry for these problems. We use the highest quality hardware to stop problems like this but there are still times like these when downtimes occur. We have been just as frustrated as you over the past few days as we monitored for every possible problem and couldn't seem to find anything wrong on the server to cause the problems. We will be issuing a one month credit to your account as a way of saying we are sorry for the problems. If the one month credit is not satisfactory, please let us know what is and we'll do our best to please you.
We would also like to thank those of you we've had conversations with over the past few days for your kinds words of support. It really does help! After several days of problems, it can be very discouraging as no progress is made. For better or worse we are in the end server geeks who take pride in fast servers, good uptimes and get a bit grouchy when problems come up that we aren't able to quickly fix. Your understanding has helped!
After we wrote the above update, we started to receive some reports of data loss. We are investigating this now. All users reporting data loss are being marked and we will restore backups to your account. Please update your ticket if you still have problems. If you have your own backup, you should restore it now for fastest recovery. Because the dataloss seems to be only on some accounts, we will need to go through the backups manually to compare which users were affected. This will be a lengthy process to complete.
Your support team
Who is online
Users browsing this forum: No registered users and 4 guests