A Not-So-August August

I took over a new codebase at the start of the month. Yay! Well, it was for about the first day. The codebase has the highest WTF Factor™ I have come across. I have literally used up my expletives quota in the last 4 weeks. Part of the reason why rails is so easy to pickup is its “convention over configuration”. Once you’ve picked it up, you can usually jump into any rails codebase and be immediately at home. Each project will usually break from that convention in some way, but as long as enough care is taken, a new dev can be onboarded very quickly. This one, this one is special and not in a good way.

Getting started

The repo README? Never got an update after rails new. Nice, we get to setup blind! But at least there’s a Dockerfile, no docker-compose file though. Which of course was updated never, still runs as the root user and does some apt-key, gpg magic to install nodejs and yarn. That wasn’t going to be much use. Luckily I had a project with docker files I had updated recently. Copied those over then tried to build the image, over and over and …

Lesson learnt - before install, bundler processes all gems in Gemfile, even those in --withouted groups

I kept running into a missing gem error. The gem in question is a local gem - under lib/ - which was of course not available to the particular build step. What threw me for a loop was that the gem was listed in an :omit group, which was explicitly omitted from install with a bundle config --local without 'production omit'. Wish I could say I took less than an hour before I figured out what bundler was doing. The solution? Wrapping the line in the Gemfile with a check for if the folder exists. Hacky, but image built and on to the next challenge.

I created the database, ran migrations, loaded the seeds and built the assets. All good. Then I ran the server, and checked the seeds file for the development user logins and nothing! Not a deal breaker, just need to register. Even better to get a feel of the process a user follows to get started.

I filled in the registration form, hit submit and 500. Jumped to the logs and WTF! A network error? How? Why? Yes my internet had dropped but, what? I checked where the error was ocurring and I immediately knew this code was going to be a pain to work with. The plan, attached to the default subscription on an account was being fetched from the payment provider! That’s despite having a plans model and plans being the only objects created in the seeds file. If I was smarter, I would have quit the project there and then!

Into the code

Tests? Hah! There was an error on running rake test, and I don’t mean an error on a test case! The test_helper would not even load. Hopes of upgrading from rails 7.0 quickly evaporated. If you’re the type to learn a codebase through tests, you feel my pain. The last batch of model tests were written in 2022. The only other tests are those generated by rails.

Light controllers? Yeah, no! If they were any heavier, there would be a shift in the earth’s rotation axis. And as if all that computation is not enough, we’re literally doing math in the views! The code had gems like this;

<% commissions_total = 0 %>
<% total_amount = 0 %>
<%# plus a few more
<% invoice_aggregates.each do |row| %>
  <% commissions_total += row.amount %>
  <% total_amount += row.amount %>
  <%# table rendering
<% end %>

And no, not once or twice.

The main reason for the above? Reports. I have not yet gotten SQL aggregates down to a science yet. It is an art, with a lot of trial and error before the figures finally add up. And for a business, that is the whole point, figures must add up. Problem was, they mostly didn’t. I for one would rather have no report than an inaccurate one. Most of the backing code followed a similar process;

Run an aggregate query on the main model(say, invoices) - maybe grouped by user
For every user in the previous step, run a query to retrieve their commissions
Build a hash of the data from the first 2 steps
Finally calculate totals in the view You should have seen the number of queries in the development logs! Scary stuff.

The never ending job

About a week before I took on the project, my lead was troubleshooting an issue. The server was down. This had happened before and he knew where to look; the logs had filled up the drive! Best part, you couldn’t get a root console into the server to delete the damn logs. While trying to help, I managed to get a non-root ssh session going. We were able to clear up some space from old capistrano releases and the yarn build cache. That allowed him to start the root session and truncate the logs. A restart, and the server was back up.

At the time I still had the luxury of not having to think too hard about the problem. Happy days! Then it reoccurred on my first Friday with the project! Why is it always Friday? I followed the same script as before, to get us back live, then went digging for the why. Log rotation was configured - zip logs daily and keep 5 - no limit to log file size, but how could the logs for a single week top 30GB?

On tailing the log, I saw a sidekiq job logging every few seconds. Weird, we don’t have that kind of traffic. So I went to search for where the job was called. It was being enqueued after a subscription is updated. Nothing wrong there. Then I checked the code for the job, and there it was. The job made an update to the subscription status, but there was no guard to ensure that its changes did not enqueue the job again! Basically, the job ended up in a recursive call, only terminating when the server ran out of resources.

It was only after I had fixed the issue that I checked the sidekiq console; over 300 million jobs processed and about 500k failures! There were two more cron jobs that were running too often for the tasks they were performing. I adjusted those schedules and now no more server crashes, or 100% CPU utilization and the disk is consistently below 30% usage. My first major win!

To be continued… Maybe