Erich Schlaikjer, explores the virtues and complexities of technology, and how good technology is ultimately what allows Cantab to create profits for our investors.
When analysts enumerate the virtues of systematic funds, the first benefit in the list is usually the diversification that CTAs provide to your portfolio. Liquidity often comes next, followed by controlled volatility.
Actual systematic features are discussed last if at all.
Indeed, the word "systems" is often used to mean "models", forgetting about the technological implementation. This is a big oversight in a world obsessed with software companies like Google and Facebook.
For Cantab, tech does not just mean IT for operations. Tech is crucial to our front office as well. Tech (done right) is what makes systematic investing an intellectual property business, growing more valuable as the code base increases.
Tech is alpha, and moreover, it is alpha that does not walk out the door or wake up one morning with an irrational hunch.
Software is about managing complexity. Complexity arises from diversifying models across assets, sources of return, implementations, and timeframes.
Imagine a simple daily trending model on one asset. It might have two moving parts, a red line and a blue line.
Increase the model count to twenty, and likewise the asset count, and you have 800 moving parts.
Trade ten times a day, and there are 8,000 moving parts. Already this is beyond the safety limit of any spreadsheet.
Then add a complex weighting algorithm, or perhaps a few thousand cash equities, and you have a problem that requires very careful software engineering.
One can build small or build large. A spreadsheet is small, and can be very effective for one task and one person. But spreadsheets are not easy to share, and do not combine into a seamless system.
Build a large system for one task in procedural code, and you end up with a monolith. As time passes, as new requirements are added, as coders come and go, the monolith grows impossible to understand, and eventually gets thrown away.
A more enlightened approach is to choose one language, or a compatible pair like Python and C++, and build a set of libraries that can be combined in many ways. Such an open platform is powerful, but difficult to design.
Line count is a measure of complexity, if not quality. At Cantab, we manage a few million lines of our own code, not including third party libraries (for example Python itself, and the libraries NumPy and SciPy, add up to about 1.7 million lines).
In the graph below you can see the size of our Python code base more than tripling since 2010.
The chart over at Information is Beautiful is fascinating.
You can see how little code the Space Shuttle used (400k lines) as compared to the Mars Rover (5 million lines). Arcs indicate the code bloat of Windows from 4.5 million to 50 million lines.
In a well-run firm, line count should sometimes go down as modules are refactored and redundancy is eliminated.
Another way to look at the importance of software management in finance is to look at how bad software loses money.
In recent memory (2012) Knight Trading, the largest trader in US equities at the time, with a market share of 17% on NYSE and NASDAQ, lost $460 million in a few hours thanks to a botched software install.
The whole SEC report is here, and a chilling summary is extracted on
this Python blog.
Nanex has some heart-stopping graphs showing the test algorithm buying at the offer and selling at the bid as fast as ever it could, milliseconds apart.
Back in 1993, the London Stock Exchange lost up to £500 million in the old-fashioned but still popular way of writing a big system that never actually worked, Taurus.
And to pick a depressingly easy target, the UK's National Health Service Connecting for Health software project cost up to £20 billion and ceased to exist in 2013.
The Good and the Bad
When you read a fund presentation which makes grand claims about its reward per unit risk, at least you can check the track record (moduloCursed by Randomness).
But how can you check similar claims about "robust" or "world-class" technology? There are a few questions that you can ask.
Is the system open and shared? Does it leverage the work of others, or does new work require the reinvention of some wheels? A shared IDE is a good sign,
making the scientists more productive. Everyone coding in a different language could be a very bad sign.
Are all changes to code and data logged? Everyone should use a version control system for code. A virtuosic team will extend this principle to data as well, such that it is always possible to tell who changed any piece of code or data, at what time and on what computer.
Is there a rigorous testing and install mechanism? Frequent improvements to the system are a great thing, but also dangerous without testing: see Knight Trading above! Installs should be gated by large test suites, and any production failure should generate a new test to prevent it in the future.
At Cantab, we test on two different operating systems (Windows and Linux). And of course we test that every strategy produces the same signals over history, after any code change.
Can you roll back changes quickly? It should be easy to revert all or some of the software to a previous revision, if the unforeseen occurs. I once tried to pay my son's college tuition online,
only to be told that the fantastic new system was broken and they never had a plan for reverting to the old system. This is a terribly common mistake. And, alas, I still had to pay.
Of course, the primary coding principle is to "create profits for our investors". But there are many other technical desiderata, some of them covered by Cantab's Partner, Dr Tom Howat,
in this interview.
There are a lot of PhD astrophysicists and mathematicians out there, but a surprising number of them say things like "I only learned enough Fortran to write my thesis". To us, this is like an English degree holder saying that they cannot read,
but have a few books on tape. Coding skills are the modern scientific literacy, as well as an art. Even the brainiest quant should know (and care) that you do not need to sort an array of numbers to find their maximum and minimum.
An Example of Computer Science in Action
This wouldn't be a quantitative blog without some fancy words like Bayesian or heteroskedasticity.
But how can a techie compete? Maybe with a directed acyclic graph, among other things.
The World Wide Web is a directed graph, ie. a bunch of things pointing at other things. Google first became famous for its clever Page Ranking Algorithm
on the Webgraph.
A directed acyclic graph (or DAG) is a graph with no cycles in it. A spreadsheet's power arises from its DAG: if you change a cell, then other cells that use it, even indirectly, will also change.
A clever investment bank might use a DAG for complex derivative pricing, especially if they integrated it with a purpose-built language. In which case a risk report showing sensitivities to spot or interest rates
would be a handful of lines of code, even if the underlying portfolio is monstrously large and nonlinear. See "managing complexity" above!
Here at Cantab we love time-series, and we have a lot of them: of prices, volumes, open interest, depth, trades, news items, economic numbers, network latencies, computer memory usage, assets, employee count
lines of code, you name it. Almost every number in every report that we generate at Cantab is stored as a hyperlinked time series. Accessing and manipulating time series is critical to how we make money. A quant should not have to know where or how a time series is stored, so a persistent name for each series is a powerful model. So is a
domain specific language that makes it easy to add or divide time series, or to run correlations, regressions or profitability studies on them.
Combining time series has some important subtleties, like how you deal with missing or misaligned data
especially for asynchronous tick data. A good time-series model will make alignment issues easy but explicit. It will also counter the quant tendency to read everything into memory before starting analysis, which quickly becomes inefficient.
Finally, a back-testing system is also a DAG: the outputs (desired positions) depend on the inputs (prices or whatever). Multiple strategies might share inputs, and so by the miracle of "subgraph elimination",
which merges identical pieces of the graph, the system can read data more efficiently, resulting in speed-ups that are orders of magnitude. Speeding up backtests means more experiments can be run, whether it be new potential strategies,
an enhanced risk management algorithm or a portfolio allocation algorithm. More experiments means more information, leading to an enhanced understanding of a systematic portfolio and ultimately to improvements - and more money into the pockets of our investors.
In our experience, programmers like to argue about whether high-frequency back-testing systems should be identical with the live trading systems. Parsimony and code-leveraging would argue yes, but the counter-argument is that the two have quite different behaviours.
One focuses on processing huge amounts of data as quickly as possible. The other focuses on nimbly reacting to single-tick updates. One thing that excited us about Python (and C++) was that we were able to use generators
to handle both scenarios with the same code, leading once again to leverage and speed.
If we were to go on to talk about execution algorithms, we might talk about state machines, provability and formal verification.
All this is a far cry from hiring a quant, and handing her a DVD full of text-format data and a copy of Matlab.
We like to call what we are building our "cathedral of code". Everyone here is an artisan, or maybe an artist. This is our tool for implementing the fancy statistics for predicting the markets
and reducing risk. We will never stop improving it, and I hope we have communicated some of our enthusiasm about it all.