What Fintech Taught Me About User Trust

@safarslife·September 5, 2024·— views

At Uzum, we process a lot of payments. Not "a lot" in the startup sense where you're proud of your first thousand transactions - a lot in the sense that when something breaks in the payment flow, the support queue fills up within minutes and the ops team is calling before you've even seen the error in Grafana. That kind of scale changes how you think about trust.

The thing about payments is that the failure modes are asymmetric in a way that most product domains aren't. If the product recommendation engine serves bad results, users get annoyed and scroll past. If a payment fails silently - the money leaves the user's account but the order doesn't confirm - you've created a situation where the user has lost money and has no idea what happened. That's not a bad UX. That's a trust-destroying event, and the recovery path is painful for everyone involved.

What "idempotency" actually means for users

Early in my time working on payment features, I had to get comfortable with a concept that sounds academic but has very real product consequences: idempotency. A payment operation is idempotent if you can safely retry it without creating duplicate charges. This matters because networks fail. The user taps "Pay," the request goes out, the payment processor processes it, and then the response gets lost somewhere between the processor and your server. Your system doesn't know if the payment succeeded. Do you retry?

If your payment endpoint isn't idempotent, retrying creates a duplicate charge. The user gets charged twice. If it is idempotent - if you're passing a unique idempotency key with each request and the processor deduplicates on that key - retrying is safe. The processor recognizes the key, returns the original result, and the user gets charged once.

This is a technical detail that most PMs never think about. But it directly determines what happens to users in a failure scenario that occurs constantly at scale. We had a period where our mobile client was retrying on network timeout without idempotency keys. The duplicate charge rate was low - maybe 0.1% of transactions - but at our volume that was hundreds of users per day getting charged twice. The fix was a backend change and a client update. The root cause was that nobody had thought through the retry behavior when the feature was specced.

⚠️

Without idempotency keys on retries, a 0.1% duplicate charge rate sounds negligible. At scale, it's hundreds of users per day losing money and trust simultaneously.

The error message problem

There's a specific failure mode in fintech that I've come to think of as the "limbo state" - the user's money has moved but the transaction hasn't resolved to a clear success or failure. This happens when a payment is processing asynchronously and the status update is delayed, or when a webhook from the payment processor arrives out of order, or when there's a race condition between the payment confirmation and the order creation.

The technical solution involves things like event ordering guarantees and idempotent state machines. But the product solution - the thing that determines whether the user panics or waits calmly - is the error message. "Payment processing" with a spinner is fine for three seconds. After thirty seconds, the user needs to know what's happening. After two minutes, they need a clear status and a way to check.

We spent real time on this. Not on the happy path - on the failure taxonomy. What are the distinct states a payment can be in? What does the user need to know in each state? What action can they take? "Your payment is being processed - this can take up to 5 minutes for card payments. You'll receive a confirmation SMS when it completes" is a completely different user experience than "Something went wrong. Please try again." Both might be technically accurate. Only one keeps the user from calling their bank.

Trust is a database, not a variable

The mental model I've developed is that user trust in a financial product works like a database with very slow writes and very fast deletes. You accumulate it over dozens of successful transactions, each one adding a small increment. A single bad experience - a duplicate charge, a payment that disappears, a support ticket that takes a week to resolve - can delete most of what you've built.

This asymmetry should drive how you prioritize. At Uzum, we have a rule that any change touching the payment flow requires a full regression cycle, not a smoke test. Stakeholders push back on this constantly because it adds time to every release. My answer is always the same: the cost of a regression in the payment flow is not a bug report. It's users losing money and losing trust, and the support cost and churn that follows. The regression cycle is cheap by comparison.

What I look for now

When I review specs or PRs that touch anything financial, I'm looking for a few specific things. How does the system behave when the payment processor is slow? What happens if the webhook arrives twice? What's the retry behavior, and is it safe? What does the user see during each failure state?

These aren't edge cases. At scale, they're regular occurrences. A payment processor that's 99.9% reliable will fail on 0.1% of transactions - and at Uzum's volume, that's not a rounding error. Building for the happy path and hoping the failure cases are rare enough to ignore is a strategy that works until it doesn't, and when it stops working, it stops working loudly.

The fintech mindset I've carried forward is simple: design for failure first, because failure is guaranteed. The question is only whether you've thought through what happens when it occurs.