A guy from support added me in a ticket about an issue with a customer. Their API calls were failing. Some issue with incorrect API keys.
I was out of office at the time and the only other guy that could help was fired last week.
Until I came back, dozen messages arrived in my mail and chat:
- “The main issue is db replication”
- “This is a small data inconsistency issue”
- “We have seen this again in the past three times”
- “Just delete and recreate the replicas in other regions”
- “Customer escalated”
- “Please do it asap”
Well, I will certainly do NOT delete a set of production dbs, just because someone says so. Especially in this case which their setup was completely unknown to me.
I asked for evidence. Logs, dashboards, RCAs of the issues. Nothing.
Started checking myself. A dev guy helped me reproduce it and check logs from DB.
The error was something like: “Duplicate key xxxxxx in yyyyyy when executing query: INSERT INTO db ….”
Hmmm. This was from the replica. An INSERT query in the replica.
Check again. Oh, we have read AND write replicas.
Across regions.
Meh.