Sorry for the late answer.
Thanks for proposing your help.
Normally, money is not so much of an issue : I try to be conservative with it, but on paper the server where I moved on June 30th is more than powerful enough for running the site, and still cheap enough for me to afford without thinking twice. AAO shouldn't need a more expensive hosting - and really, all my monitoring shows that the site uses but a small fraction of the server's processing power and memory.
The issue here is that I have no idea left as to what causes these crashes.
- Ever since they started occurring, I never was able to find logs corresponding to the crash - everything seems perfectly normal until nothing runs anymore.
- They keep occurring on the new server, even though I changed system architecture and the version of most software involved, doubled the RAM and tripled the processing power, so it shouldn't be a hardware issue - or so I'd hope.
- They occur seemingly randomly : sometimes a long time after the previous reboot, sometimes a very short time (eg. I was forced to reboot the server on July 9th for a short time, but on July 10th we had yet another severe crash... but nothing since I rebooted it on July 11th), and monitoring does not show any issue regarding memory usage, so it cannot be a memory leak, or more generally a RAM issue.
- The new server is a VM, not a physical machine, so it cannot be a processor overheat (otherwise it would take down the whole hypervisor, affecting dozens of other clients, and the host would reboot it)
- The kernel is configured to reboot automatically in case of kernel panic, yet it doesn't... So the crash seems severe enough to bypass even the most basic kernel recovery systems.
- I also have other sites operating on other servers with the same setup (in fact, one is identical to the previous AAO server), and the others don't have this issue. They have much less traffic as well, but aside from that are pretty much identical... So this would seem to indicate it's specific to AAO, either because of the amount of traffic, or something in the site code deployed on AAO itself.
But I'm still stuck on what kind of application bug (if it's an issue within the AAO site or forum code) could crash the server so severely that it doesn't even go through kernel panic... And why it didn't occur on the previous server (the one before April 2017's move), where everything was pretty much the same - and we had more traffic back then as well.
So yeah, right now I'm keeping a close eye on it, but it's still a mystery to me :-/
And no, so far it doesn't look like there is a risk of data loss. In any case, I have weekly backups being exported so I should be able to recover most contents in case of issues.
If knowledge can create problems, it is not through ignorance that we can solve them. Si le savoir peut créer des problèmes, ce n'est pas l'ignorance qui les résoudra.
( Isaac Asimov )