How I Accidentally Torched My Wife’s Blog, and How I (Mostly) Recovered It
Let’s rewind a year. My wife’s first published book is in editing. We discuss her need for a professional author’s website. I take a look at what I’m paying for Windows hosting with Arvixe, and think about how irritating it’s been and how many things I can’t control, and how their service has been generally awful.
I’m a computer guy. I do random things, they generally work. Hardware, software — whatever. Rocks that do very fast math make sense to me. But… I wanted off Arvixe, and this was an excuse to make it a Project.
First I moved all our domains off Arvixe to Namesilo. Between their pricing and their functionality, I was happy. Specifically — because they didn’t offer hosting, no one was going to ever try to convince me to use their hosting services. Great! Then I started poking at VPS solutions. A step up from shared hosting and nearly as much control as physical hardware for substantially less cost. I priced out a bunch of configurations at different vendors, and at the end went with RamNode (I have an entire set of data as to prices and functionality, but it’s now a year out of date — aka, worthless). I will admit that RamNode’s documentation left something to be desired — but I was happy with what I managed to setup. With NameSilo providing nameservers, I didn’t need to worry about any functionality RamNode had in that regard — just a VPS.
And then I learned how to configure a VPS. I started with a LEMP stack. Yes. LEMP. Not LAMP. Instead of Apache, I used NGINX. Instead of MySQL, I used MariaDB. And, because I didn’t know better, instead of PHP I used HHVM. And I got it all working. A site per configuration file, linked from sites-enabled to sites-available. I setup duply to backup both the web content and the database. I scheduled crontab jobs to run duply incremental six days a week, and a full backup every Sunday. I backed up all the NGINX configurations, same schedules. I even checked that the backups were working, and that the logs had no errors.
Over time, I found out that HHVM made NextGen Gallery for WordPress not work, so I swapped out fastcgi-php for all the sites. I stripped out common code and put them into reusable snippets. I setup Let’s Encrypt SSL certificates for every site. When my father-in-law passed away, I setup a new domain with email address, hosting, a blog, and content in under an hour — including a brand-spankin’ new SSL certificate. Because I had setup everything with scriptlets in the past, it was easy. I even threw Cloudflare in the middle.
Then I ran into a problem. I was trying to configure a Mediawiki installation — should be minor, right? — and discovered I didn’t know the root user password for the MariaDB installation. Well, damnit. I tried everything I could think of, and went bust. But hey — if I uninstalled MariaDB, wiped the data directories MariaDB used, installed MariaDB again, configured it fresh (with the same WordPress-intended usernamed and password), then reimported the same tables from the existing SQL dump collected by the duply scripts — I’d be golden! I’d know the root user password to create a new user and new tables for Mediawiki, and the data would still be there for all the other sites. Perfect! I’d be using the backup strategy I so carefully put in place!
Note, gentle reader, that if that last sentence filled you with dread, you are WISE.
Now, I didn’t do this blindly. I double checked that the backups were there, the backup logs were clean, and that the blogs.dump.sql (such a great name!) was there from the last run.
cd /var/www ls blogs.dump.sql
Perfect!
So I uninstall MariaDB, wipe it’s data directory, backup it’s configuration, reinstall MariaDB, make sure to pay attention to the password this time, recreate the WordPress-expected user, and import blogs.dump.sql. Total time, under five minutes. The Uptime Robot alarm didn’t even trigger. I checked the sites, and everything still looked good.
I then had to spend entirely too much time getting test.office-monkey.com to work, as default ports and servers and mistakes in reloading nginx configs just annoyed me enough to fail to get it right — but that was a side project. The site was working.
Until two days later, when my lovely wife told me she couldn’t access the admin console any more. Weirdly, it took her to the “signup for a new network site” page — not a 404, not a domain registration error. Then I just started making mistake after mistake…
I did two things concurrently. I spun up the host default URL to check configurations for her domain, and launched Cloudflare. To address the problem with Mediawiki, I had spent a LOT of time Purging Everything to try to address wrong pages being served. So without thinking too much about it… I purged everything for my wife’s blog. Right after I clicked it, I tabbed over to the host URL — and saw nothing was there. You can’t cancel a purge.
Every monitor on her blog tripped. I took a look a harder look at blogs.dump.sql:
ls -l
blogs.dump.sql was dated from last July. It most definitely did not have anything from her professional website — except the placeholder post.
I immediately started scrambling. I grabbed the entire contents of the SuperCache directory, which had most — but not all — of her pages cached. I grabbed it for my blog as well, but then I made mistake number two — I turned on SuperCache for every blog in the network — turns out, that clears the previous state (bye, every other site I was hosting!).
At some point in the middle of this, I went upstairs to tell her I may have just lost everything from her blog. She was surprisingly calm.
Now, on the plus side, all images, themes, and plugins were still there. Posts were not. BUT!
The SuperCache files? They had hints. They indicated which theme was in use by what the URL to a CSS file was. They told me which icon she had used by filename. Between those, I was able to get the site to “look right” again. Then I started copying and pasting. I’d grab the raw HTML from the SuperCache page, put it into a new post, then set the “Publish Later” date to the historical date she had originally published it. That actually got me about 90% of the way there. But then I discovered not all the posts were cached.
But before I had started on this series of unfortunate events, I had poked at her Cloudflare analytics data — trying to get some idea “when” taking her site down for maintenance would be less impactful. While spelunking through it, I had noticed something everyone wants to see for a professional website: web crawlers.
So I started searching every search engine I could think of for the magic “cached page” result. Bing and Google helped. Baidu did not. And if you want to be depressed: https://en.wikipedia.org/wiki/Web_search_engine — take a look at how many are inactive. DuckDuckGo and Dogpile weren’t really helpful, either. But via judicious copy and paste — I was able to recover all but three blog posts. I even recovered her pages in addition to her posts, and linked them all to a menu that matched the previous results (yay, SuperCache!). Interestingly, the Way Back Machine had indexed her site — but not saved anything from it. Jerks.
All told, less than two hours after I screwed the pooch, I had her blog back up at 98% — and all the posts that were missing were from almost a year ago.
Unfortunately, that was one blog out of several.
My personal blog (this one!) was down. So was her old personal blog. So was her dad’s memorial.
I postponed that while I went to make sure everything was backed up. I checked to make sure the duply jobs were scheduled, and then that they were run. Everything worked perfectly.
So, why hadn’t they run in OVER A YEAR?
So, undocumented secret that caused all this mess (also, me not looking before I leaped, or checking ls -l to make sure the file had actually been updated):
10 1 * * * /usr/bin/duply /root/.duply/blogs incr > /var/logs/blogs.incr.log
Doesn’t look like much could be wrong with it, right?
Here’s the dirty part: duply {name} incr doesn’t run pre.
So everything was being correctly backed up, and versioned, and old versions cleaned up weekly — but the MySQL/MariaDB sqldump command located in /root/.duply/blogs/pre was never run. I was particularly proud of it — I had setup php to render the username and password from the WordPress config files so the settings weren’t in two places. But without pre being run, the duply backups had only backed up the files. Admittedly, all the files, but none of the post data.
So first thing, I figured out how to fix that. I switched to daily use of the more basic “duply {name} backup” with a severely underdocumented full-if-older flag. I added a weekly purge command that would wipe older data than two full backups. I confirmed that duply {name} backup would always run pre.
Then I tried to figure out what to do…