Fixing One Bug Leads to Another

February 5, 2022

One thing that frequently frustrates me when I’m working is sloppy work. I’m using the word “work” here because it can refer to anything, but here I’ll talk about sloppy code.

Recently, someone from my previous team asked me to help them fix an issue on a system while the maintainer was not available. At first, I discussed this with my team lead and didn’t get involved in the issue, but after a week, the problem was still there, and they asked me for my help again. So I finally decided to give it a try.

The story

The issue was that one of the microservices was consuming messages twice from a Kafka topic. Kafka client consumes a message in this order:

Read a new message
Process the message
Commit the message Working with at least one delivery model where each message is delivered to you at least once, duplicates might occur in your system when a message is delivered twice or more. An example scenario of a duplicate message problem that can occur here is when your application reads the message, updates some business entity in the database, and fails before committing the message. You would see the message coming back again.

To overcome this issue, the application has to make the consumer idempotent with idempotent consumer pattern

Read message Begin database transaction INSERT into PROCESSED_MESSAGE (subscriberId, ID) VALUES(subscriberId, message.ID) Update one or more business objects Commit transaction Acknowledge message After starting the database transaction, the message handler inserts the message’s ID into the PROCESSED_MESSAGE table. Since the (subscriberId, messageID) is the PROCESSED_MESSAGE table’s primary key the INSERT will fail if the message has been processed successfully. The message handler can then abort the transaction and acknowledge the message.

So I checked if the code was idempotent or not, and hopefully, it had. With several problems:

Sloppy code, The steps were in the wrong order. The insert in PROCESSED_MESSAGE was the last step. It should be the first action because you want to assume the message is processed, and if any error occurs, the transaction will fail. The insert will be reverted, so you don’t have to manually decide where to mark the message as read in the code process flow. Also, The messageID was not unique, so two concurrent inserts of the same message would not cause any issue, while it should. It’s always better to crash than to be inconsistent; the latter will take much more time to find out and recover.

Not knowing about the business domain of your application, The third problem; the message ID was generated by the whole body of the message to string, literally a TextField in PostgresSQL for deduplication and searching! While it was inefficient and expensive to use your database for matching full text for deduplication. It’s also not working if you change the format of the same message(like updating the schema). I can guess that this happens because developers can be far from the business domain, but you should always understand why you are writing a piece of code. By asking questions like what unique values are we looking for in the message? From the product manager, you can use this information to do the deduplication for that use case.

I fixed the algorithm and changed the table schema to have a better deduplication key by discussing this with business people to discover the unique values in this message. Everything was done, and I started testing the code; while I was doing that, I checked the logs and realized that the application was consuming messages from 2 weeks ago. This should not happen when we are committing the messages. I dug more into the issue and found out the duplication bug was part of a more significant issue. Kafka consumer was not committing the messages correctly, so it caused duplicates even after a successful message process. And even worse than that, the application was not failing if the commit failed; it silently ignored the issue. Until that moment, the deduplication logic was not only deduplication but serving as Kafka offset manager!

After realizing that it was not easy to fix and needed proper investigation, I handed it to the original maintainer. But I did investigate why this issue was happening just for my curiosity and ended up with this hypothesis. The issue was that Kafka has a max poll interval configuration which determines how long it takes for the consumer to poll a message(read a new message from the topic). If a consumer reaches this time limit before polling a message, it will be considered unhealthy and replaced with a new consumer. Meanwhile, the consumer has a timeout to commit all the pending messages to offset, and if it cannot, then those messages are not considered as done and will be consumed again.

The point here is that the second bug was buried under the first issue, and I brought it up by fixing the first one. Kind of a rabbit hole, and who knows, I would not face Hydra if I continued fixing the problem?

Hydra, a see monster that that as soon as one head was cut off, two more heads would emerge from the fresh wound

Yak shaving

The more precise term about this situation is Yak Shaving. It’s a situation where you have to fix something else before working on the current issue. It is essential to be ready to go down this path before actually starting. For me, it was always a frustrating experience because You can’t get it done, and estimations become incorrect, then you have no clue where you are until you can resolve the final issue.

What did I learn

Some bug fixes seem easy to solve, like this one. You and the others might have no idea why some issue is happening. When a request like this comes to me from now on, I’ll be asking for a time to investigate the issue and how it is happening. It might take more time; in this case, I only had a couple of days to work on it, so I would not have a chance to fix the issue. I’m not saying all the bug fixes should be like this; if you worked with a system and know why something is wrong and are sure about the system behavior based on experience, go ahead and work on the fix. But for me, there were a lot of changes to the system after I left the team, so the system was not what I used to know.

And also, don’t forget the chance of ending up in a Yak Shaving. Investigate enough to see if anything else is wrong except the current issue. As said in pragmatic programmer

Is the problem being reported a direct result of the underlying bug, or merely a symptom?

It’s tempting to help others with something you are sure you can. But the bottom line is if someone asks for help, they expect the issue to be resolved, and this kind of help is not helpful anyway.