Data Science's Role In Capital Markets

Originally written for Street Contxt

Data is a hot topic, a monetizable commodity, and a ray of hope for business reinvention. As the online ecosystem and worldwide economy have scaled, so too has the sum total of data in existence. This data provides the basis for statistical modelling that can perform analysis, craft projections, denote patterns and answer questions.

In recent years there has been a markedly increased emphasis on “looking to the data” where ambiguity is concerned, and a business calling for enterprises to leverage the enormity of their data stores to chart the best way forward.

Capital markets is by nature a highly data-driven industry, but newly emerging technology and methods of data manipulation have made it so even the world of finance has some catching up to do. The highest growth firms are focused on high frequency trading, using mathematical models to determine which securities are good bets, and which should be avoided. Through probability and automation, many short-term transactions add up to high yields. This is all powered by data, computation, a rigorous understanding of statistics, and the ability to tell a story with numbers.

The Data-Driven Economy

A grand 2.5 quintillion bytes of data are produced each day; that’s eighteen zeros trailing. It’s hard to conceptualize that quantity of information, especially if it’s unstructured. Structured data has a schematic representation, such as a labelled table of dates, credit card numbers, and stock information. It contains an implied relationship, making it easy to format. Unstructured data is qualitative, such as text and imagery. It’s more difficult to derive relationships from this kind of data. Unstructured data comprises 80% of all data produced.

Then there’s data to be extracted from formats such as company earnings calls, requiring natural language processing. NLP is a prominent area of machine learning, allowing for rapid, thorough analysis of long-form written content. Rather than employing a team of grunts to comb through earnings calls (with a hundred distractions and a lack of sleep compromising their work), a computer program can do the job in a fraction of the time. With cost saving measures top of mind for firm managers across the board, utilizing machine learning in this context is a no-brainer.

Machine learning is a class of algorithms that get better with experience, and is a branch of artificial intelligence. The output of a machine learning algorithm is a mathematical formula that relates the value of a desired unknown variable to a set of inputs. Machine learning is found extensively in consumer facing technology, and is being used in enterprise settings for a great variety of scientific and business breakthroughs.

If data were easy to manipulate, every capital markets firm would be plugging in the numbers and receiving answers to their most crushing problems. The infrastructure needed to contain and model data on a large scale is expensive, and quantitative analysis problems can require trial and error before hitting on the right outcomes.

The Need For Data Experts

Part of the reason for this challenge is that there are only so many data science and machine learning experts circulating in the job market. Preeminent candidates tend to be quickly hired by major tech companies, or by defense and intelligence organizations. There is an enormous skills gap in the current labor market for skilled quantitative analysts. This deficit is exacerbated by inexperienced data scientists who don’t understand the deeper meaning to their own models, or who are only capable of analyzing data on a superficial level.

The phrase “correlation does not equal causation” bears stating here. There have been numerous machine learning experiments in which algorithms seemed to hit on highly accurate formulas, but upon further inspection were flukes. One such example is when a student at the University of California, Irvine, put together a neural network to distinguish dogs from wolves. The network appeared to be working perfectly, until it was discovered that it had only succeeded in recognizing one thing: snow. A correlation between wolves and snow had been discovered, but the network had not actually succeeded in learning to differentiate between canine species.

These types of errors are easy to create, but hard to intermediate. Without extensive knowledge of mathematics, statistics, and machine learning, a data scientist will have difficulty verifying their results, and repeating them. That’s where the “science” part of the equation comes into play.

Level Playing Field

The projected result of successful data science and algorithmic trading deployments is the creation of a transparent and level playing field in capital markets. This may be many years down the road, but progress has already been made in that direction.

If the prime objective of markets is to transfer capital to companies with a positive trajectory, then algorithms will be able to statistically help identify which assets are safe investments. Pulling from a wide variety of data sources and using finely tuned algorithms, investing will come to a more even keel. In short, the goal of data science is to transform data into information. As more quantitative models are created there will be fewer peaks and valleys in portfolio values, and investors will be able to hedge to a higher degree of certainty.

Capital Markets Use Cases

Quants are central pillars to capital markets organizations, calculating price and risk models in order to execute trades on structured products. Firms would be remiss to only leverage data for their immediate trading needs, however. Data and machine learning can be incorporated into operational strategy at many entry points:

At this point original article describes the technology behind Street Contxt's flagship application, which I will omit here.

Accurate, useful data science models are iterative works. Sometimes a problem can appear to be solved, when in reality only the tip of the iceberg has been acknowledged. Getting to know a problem, analyzing the impact of every aspect, and looking critically for blindspots is essential for producing statistical models that deliver verifiable, repeatable results.