complex-problem.txt

Over the course of the last year, we were seeing all 3 of our on-premise application servers periodically spike in memory, followed by a CPU rise that would grind those servers to a halt. When this started happening, we'd have to pull that server out of the pool, restart it to get resources back to normal then re-enter it into the pool. This would have been simple if this wasn't happening with our only client that didn't want to be on the AWS-version of our software where we had more robust logging, replication, failover, etc., but I digress... Our normal debugging methods to understand the resource exhaustion failed us for months. We crawled many a log, searched through Java heap dumps for hours and hours looking for memory leaks or other stranger artifacts, and reverse-engineered possible failure states that could produce this behavior. We eventually just ran cronjobs on those 3 boxes that restarted them every hour, since it was steadily getting worse: we would see total mem/CPU exhaustion after only 3-4 days, rather than the multiple-month gaps we had at the beginning. These periodic restarts were obviously just a bandaid, and we were getting serious pressure from our client to get to the bottom of this, as it had resulted in a blip in service twice within rapid succession, and was very visible to the Chief (our clients were police departments). One day, right after the problem had surfaced again, as we were kicking around more possibilities as to how this could happen, myself and another engineer were randomly clicking through our New Relic metrics, which had so far yielded nothing clearly relevant to our issue. A few requests into the list of long-running requests from the last day, we saw one we had never seen in there before. It was a POST to an endpoint that our job servers hit once per day to clean up old files that were used as a transient copy of forms that had to be printed from our application. These transient copies were saved to our file server, then downloaded by the user to send to a printer. They had an expiration date set on a corresponding record in MySQL which the job mentioned above would query to determine which records to delete. The job would then delete both the MySQL record and the actual file on the file server in a transaction. Crawling through code, we noticed 2 obvious problems; the code for batching the deletion requests was buggy, so all of the file records ended up getting loaded into memory at once, no matter how many there were, and the work was being done on the application server, rather than the job servers, so any high resource usage that was caused by this daily file cleanup job would take resources from the box supporting the application users were actually clicking around on. Those were both problematic, and immediately fixable, but why had this suddenly become an issue when the 2 years prior to this, the behavior never seemed to have happened? After crawling through our historical data on command status and completion, I found that 1 run of that file cleanup job had failed in early March of 2017. Since that one failure, every single run had failed (which goes to show that our error-reporting in the on-prem deployment really needed some love). Each day around 11pm, that job would either time out, or get far enough into its process that it loaded many days worth of expired files into memory, causing a cascading resource grab that made CPU spike as the garbage collector ran around trying to make room for all of these file records. Sometimes, the JVM just puked on this job and shut it down. Other times, the memory/CPU rise slowly ticked up for hours, until suddenly the servers would lock up on users. After piecing together this image of what was going on, we disabled the file cleanup job, fixed the batching problem, made the job able to take a timestamp as a param for the upper bound of the deletion query, moved the workload onto the job servers, then very meticulously deleted a year's worth of stale files through periodic runs of the file cleanup server with steadily increasing upper bounds limiting the number of files being deleted to a reasonable amount. Finally, I was able to re-enable the file cleanup job on an hourly basis, so it would rarely have to chug through more than 1 thousand files at a time.