"Has to" is pretty strong. You folks have chosen to work that way, but other peo...

akerl_ · on April 27, 2014

I hope you're trolling, but rolling out high-impact code as is described in this article and comment thread without serious review and QA is akin to business suicide, as demonstrated by the article.

hft_throwaway · on April 27, 2014

The code was QAed, but they didn't test old and new versions against each other. Version A could accept a flag and run obsolete logic that would lose control of its orders but never sent it, so this problem never happened. Version B sent this flag and the receiver would send RPI orders with it. Put a Version B sender and a Version A receiver together and you end up with a disaster.

From a systems perspective, my takeaways on this are:

-Don't re-use a message for a semantically different purpose in a distributed system where you're running different software versions (even in cases where you don't plan to, really, since you may roll back or end up running the wrong code by mistake)

-Version your messages so anything that changes their meaning can only be accepted by a receiver that follows that protocol

-QA old and new builds against one another

If you really want to look at the root cause of this, it's cultural. Trading desks don't want to spend development time on things that don't generate PnL. Traders want to try lots of ideas so many features are built that don't get used. Code cleanup gets put on the back burner. Developers do sketchy stuff like re-purposing a message field because it's annoying or time-consuming to deploy a new format. If traders aren't developers themselves, they may underestimate the risk of pressuring operations & devs to work more quickly.

Things like this are probably the biggest risk faced by automated traders, and the good shops take it very seriously. I've never been scared of any loss due to poor trading, but losses due to software errors can be astonishing and happen faster than you can stop them.

wpietri · on April 27, 2014

I'm not trolling at all.

Let me take his points:

> QA has to review it

QA review is one approach to quality, but it's far from the only one. In Lean Manufacturing, heavy QA is seen as wasteful, covering up for upstream problems. Their approach is to eliminate the root problems. That let Toyota kick the asses of the US car manufacturers in the 80s.

>documentation has to be written/updated

This to me smells of a phasist approach, with disconnected groups of specialists. Some people work with cross-functional teams, so that everything important (e.g., both code and user documentation) is updated at the same time.

>marketing may need to write a press release, sales and customers may need to be notified

This is confusing releasing code with making features active for most users. You can do them together, but it's not the only way. Feature flags and gradual rollouts are two other options.

More broadly, in this case rolling out the code with serious review and QA was also business suicide. The "do more QA" approach is trying to decrease MTBF, with the goal of nothing bad happening ever. But there's another approach: to minimize MTTR (or, more accurately, to minimize impact of issues). Shops like that are much better at recovering from issues. Rather than trying to pretend they will never make mistakes, they assume they will and work to be ready for it.

MartinCron · on April 27, 2014

The biggest problem that I see with having heavy QA as a gateway to release (from a lean production perspective) is that it tends to encourage deploying large batches of changes at once. When something goes wrong (which it will) which one of the n changes (or combination of changes) caused the problem. How can you roll back/roll forward a fix to just the one problem?