MyBlog – recent issues

edit

By Ruth Powell February 25, 2019

We have recently migrated the MyBlog service to new infrastructure.  The go-live on the new servers was Thurs 7 Feb 0900 (approx).  Since the migration we have been experiencing periods of downtime caused by the MyBlog database being corrupted.  Symptoms include:

  • The homepage not being available:
    The root level site becomes corrupted after which browsers get caught in an infinite redirect loop, ultimately crashing the web server (Apache)
  • Once restored we are then seeing the a combination of one or all of these conditions:
    • Editing and access prevented to some sites (this is caused by Jetpack plugin crashing)
    • BuddyPress settings being reset to default
    • Pages being posted to the wrong blog – we’ve seen student pages being applied to the root level site
    • Custom theme homepage content being reset – content reverting to the default theme content
    • Debug/error messages printed to screen

As far as we are aware there have been four occurrences of this particular issue:

·         12/02/2019 22:43:24 to 13/02/2019 11:51:24 (13 hours downtime)

·         19/02/2019 01:07:24 to 19/02/2019 08:51:24 (8 hours downtime)

·         19/02/2019 15:01:24 to 19/02/2019 15:16:24 (15 minutes downtime)

·         20/02/2019 13:37:24 to 20/02/2019 13:57:24 (20 minutes downtime)

We have a documented recovery process now so if this occurs again during support hours we can recover fairly quickly.  However, part of this involves restoring content from the old MyBlog server and there is the possibility that content may have been changed since the migration so some content may be lost.

What we’re doing:

·         Anaylzing logs files – we can see in the database logs the query which is causing the corruption but we haven’t yet been able to identify in code what is causing this

·         Increased logging levels – this may have a performance hit though but it will give us more detail on what’s happening (this does mean though that we have to wait for another reoccurrence first)

·         We have created a script which will restart Apache once the loading goes above a certain threshold which we hope will prevent database damage occurring

·         Reviewing all installed plugs-ins as there are some which are installed but not in use

·         Researching availability of diagnostic plug-ins

Next steps:

·         If we can’t find the root cause of the problem in the logs and it persists, we will have to start disabling plugins, but given the sporadic nature of the problem it could take a long time to isolate

·         We’ll then need to start taking daily backups (exports) of the database so we can do a full restore, however, the restoration will lead to the loss of changes

The DL Support team have been asked them to prioritize the investigations over all other work (unless the is an emergency with another service) so it may take us longer than usual to resolve other support tickets.

comments

Add comment

Your comment will be revised by the site if needed.

Skip to toolbar