At 019:00 UTC, on the 15th of November, 2018, the Monitoring Team received a notification was received that PHP applications installed in one of the systems in Tokyo (TOKYO 07) was having issues connecting to MySQL database.
The MySQL error we saw on the log was:
[Warning] Aborted connection 305628 to db: 'db' user: 'dbuser' host: 'hostname' (Got an error reading communication packets) [Warning] Aborted connection 305627 to db: 'db' user: 'dbuser' host: 'hostname' (Got an error reading comm
The Infrastructure Team were alerted to the issue within 3 minutes and began immediately investigating.
A quick investigation indicated that this was triggered by an automated upgrade that took place 17 minutes before the incident.
At first, it was thought that it was being caused by Softaculous Application Installer because most of the websites that were having this issue were the ones installed with this 1-click
And since we don't even want you to experience any noticeable downtime, a decision was made to migrate impacted accounts to a stand-by replica (a process that often takes less than 5 minutes to be completed), but we keep seeing the same error which indicated that another issue was in play here.
Now, whenever a database communication error such as the one above occurs it increments the status counter for either Aborted_clients or Aborted_connects, which describe the number of connections that were aborted because the client died without closing the connection properly and the number of failed attempts to connect to MySQL server (respectively).
The possible reasons for both errors are numerous but for the sake of brevity, we assumed that MySQL increments the status counter for Aborted_clients, which could mean:
- the client connected successfully but terminated improperly (and may relate to not closing the connection properly)
- the client slept for longer than the defined wait_timeout or interactive_timeout seconds (which ends up causing the connection to sleep for wait_timeout seconds and then the connection gets forcibly closed by the MySQL server)
- the client terminated abnormally or exceeded the max_allowed_packet for queries
To be honest, there are many things that could cause aborted connection errors which often made difficult to diagnose.
After trying out several pre-planned scenarios, we decided to start investigating each website individually.
Customers were also asked to either add the backup server IP (18.104.22.168) to their DNS record or use it in place of 22.214.171.124.
What We Found:
On deeper investigation, it was discovered that the issue is not related to the server configuration in any way.
It took a very long time to go through every website, tested them out to try to figure out what was going on.
In fact, data was copied/restored from redundant backup systems (2018-11-12) (we often keep data for up to 30 - 60 days before moving them to cold storage)
This briefly brought back all applications online.
The reason we used the one for the 12th was to ensure that the same issue didn't pop up after the restoration.
The Real Culprit:
We use CloudLinux, a hardened Linus OS that was built for enterprise hosting environment.
As part of our security suite, we use Imunify360, a preemptive all-in-one security tool powered by AI and Proactive Defense.
Proactive Defense protects websites running PHP, the most common programming language, against zero-day attacks.
It identifies attacks on Linux web servers in real time, then blocks potentially malicious executions automatically and with zero latency.
Proactive Defense uses a unique method of identifying security risks - it analyzes what scripts do rather than what is actually in the code or file.
Well, it turned out that this was actually related to an issue with Proactive Defense.
Imunify360 PHP extension was having real issues with MySQL connections from various web scripts, including WordPress.
The symptoms we saw (and customers experienced) ranged from slow loading websites, frequent HTTP 500 errors, to MySQL connection related errors.
What We Did To Get Things Under Control:
Working with the CL/IM360 team, we were able to find a way to make this tool behave the way it ought.
It wasn't only OCS Hosting Service that was impacted; in fact, every web server in the world than ran an update within that time-frame experienced the same issue.
We understand how inconvenient this would be, understand the impact website downtime can cause on your business and indeed very sorry, very sorry for this experience and hope you will accept our apologies.
To us, any production outage is a serious condition which merits significant introspection to help safeguard our customer's business efforts and one that we work against its re-occurrence so that we and our customers won't be impacted by a similar one.
But as much as we work diligently to ensure that your system is online 24/7, we also rely on third-party tools like Imunify360 as part of our effort to keep you safe.
And the humbling truth is that no matter how hard we try to avoid this kind of situation, these things do occur.
But when they do, we work swiftly to resolve them, inform you of what happened and work to make sure that it doesn't re-occur in such manner again.
Your data is also always safe in multiple locations within your data region and as a fail-safe, every machine has its replica that runs every 48 hrs to ensure rapid migration if the main fails or act erratically for any reason.
The current machine you are on is actually better, bigger and you will notice the better speed and a new feature, Malware Cleanup that enables you to clean up malicious scripts even without the help of our Security Team.
Malware Cleanup is designed to decrease the possibility of data loss and website malfunction after cleanup. It backs up an infected file before cleanup and trims a file instead of removing it.
The backup of an infected file let a user restore the file in a state, it had before cleanup.
File backups are stored in special folders outside user home directories and shouldn’t be managed manually. Names of these files are not altered.
We hope that as we work along to optimize this system to be efficiently super-fast and able to meet our SLA, we hope that you will find it a change for the better.
How You Can Help:
1. Please ensure that you have the IP: 126.96.36.199 as part of your DNS record or replace 188.8.131.52 with 184.108.40.206.
2. Keep the IP 220.127.116.11 as your backup IP.
If you are still experiencing this issue with your website, and have made the necessary changes, do let us know at once.
We will be here on stand-by in case you notice any other issues, or have any additional questions or concerns.
Again, we sincerely apologize for any inconvenience you may have experienced and appreciate your patience while we work diligently to resolve any remaining outstanding issues.
Thank you and have a great day | night.
The Infrastructure Team
Thursday, November 15, 2018