Designing an AI Investment Framework

Investment approaches using artificial intelligence can inherit construction bias. In this paper, we discuss the fundamentals to set sane grounds for a machine learning driven methodology.

Mission statement

To guide the development of the framework, an initial step is to circumscribe its mission. That mission stems from the research hypothesis that forms the core of our purpose:

We can consistently add value versus a passive investment strategy by systematizing the investment process of a fundamental portfolio manager through the combination of artificial intelligence and proprietary financial know-how

The validation of this hypothesis facilitates the identification and assembly of the required building blocks, which are presented in the following schema:


This post focus on the two first layers: data gathering and feature creation. Once these roots are in place, the modeling tasks can be performed, a topic to be covered next.

Investment universe and benchmark selection

To guide the downstream data collection process, the investment universe first needs to be defined. Based on an agreed selection of the asset class, geography and sectors, a pool of assets forming the benchmark portfolio is picked. The inclusion rules should be specific, consistent and unbiased: the benchmark should always reflect those rules.

For the benchmark to meet the specificity criteria, a non-ambiguous definition is needed. For example:

Every first day of the month, select the top 100 companies by market cap in the Indian market and build an equally weighted portfolio from them.`

Such approach eliminates the survivorship bias of taking actual existing company and look back. It is similar to take index components at every point in time, except that we use an equally weighted portfolio for reference rather than one weighted by the market capitalization. A rigorous application of the specified rule should provide consistency.

Data collection

AI techniques are particularily effective at integrating heterogeneous data types so creativity in finding new sources is often rewarded.

Other than financial data, multiple forms of information can provide valuable information. Notably:

  • Company specific news
  • Investors reports
  • Web search trends
  • Customer surveys

Experience in the assessment of companies’ performance provides valuable insights on potential gaps in data sources and guide the reseach for new ones. It also in the curation information to avoid being flooded with redundant signal.


Chronology of events is an important consideration in data collection. A properly designed investment strategy will only use the data that was available at every historical point in time; an obvious principle that eliminates the risk related to the look ahead bias.

For example, a company financial results for the end of the year are not known on Dec 31st, but typically a few months later on its earnings release day. Most data providers companies provide database with point-in-time data, where the financial results are known after its earnings release day.

Another application of the concept is when data is subject to subsequent revision. Given that the market react to the initially published information, it’s on that first version of the information that the model be built.

Table 1 – Point-in-time macroeconomic data reorganization

Date measured Date published Date strategy
March 31st April 4th After April 4th

The following question can serve as a guiding principle on the appropriate versioning of the information: At which point in time would the information be known?

Finally, the ability to execute should be considered as well. For example, if the market reacts to a certain information release within 5 minutes, but the internal ingestion and execution process takes longer, such opportunity should be considered out of scope of strategy.

Although the events chronology comes with several caveats in the data preperation, it’s also a source of opportunity for the integration integrate time dependent features that enrich the context.

Validation process

Past events are easy to describe in light of current information. A dangerous pitfall for an asset manager is to overlook that part of the returns are merely due to volatility and take action on noise that is wrongly perceived has information.

In order to develop a robust framework that isn’t a self fulfilling prophesy of the past but provides a reliable indication of future performance, an evolutive learning test is performed.

Missing data

The presence of missing values is common, particularily for information not subject to regulation or open source data that is less strictly curated. As models need to operate on numerical values, treatment is needed.

A common approach is imputation, which implies the substitution by an approximated value. A vanilla replacement using the mean or the median from non missing records can be sound, though fancier alternatives using predictive models built from available features can be used. Techniques such as trees can handle missing value natively, nonetheless, it remains a good practice to feed the algorithm with information about the existence of missing values.

As it’s the case for any other variable transformation, imputation should be made on a cross-sectional way and using the data from the period in question. If a data is missing for the month of July for stock ABC, use the generic replacement metric from the universe for the month of July.

Data Normalization

Many algorithms, notably neural networks, are scale and distribution sensitive. That is, storing an information in meters rather than centimeters (scale) or in log base (distribution) will affect its behavior. Handling such sensitivity is the objective of normalization.

Many tricks can be applied. Examples include standardization (subtracting the mean and dividing by the standard deviation), scaling (squeeze value within [0, 1] range) and quantile normalization. The latter addresses both scale and distribution sensitivity by mapping features into a Uniform distribution. Normalization should always be done in a way it respects the data chronology.

It’s worth noting that some algorithms are robust to scale and distribution shift, most notably tree based methods such as Random Forest and Gradient Boosted Trees.

Long term planning

The methodology depicted above, despite forming a sane ground for a systematic investment approach, still relies on several strong assumptions.

For instance, several subjective decisions were taken, notably for the selection of the investment universe. Also, in the above scenario, updates were performed on a monthly basis. This can be reasonable from a fundamental investment perspective, but there are obviously other forces that drive the returns at both finer and coarser time scales.

Moving fast is important, but it shouldn’t impair future preparedness. Building a data architecture in a planned manner provides grounds for quick development and sustain the rapid innovation pace in machine learning.

Back to learning