Time bars vs dollar bars

2022-09-29

Theory

Time bars are bars built out of ticks (trades, quotes, whatever) where each bar covers a predefined time interval, e.g. 5 minutes.

Dollar bars have been described in the book Advances in Financial Machine Learning (AIFML) by de Prado. The idea is to have each bar contain fixed amount of information. I understood the information to be latent but it can be proxied by the number of ticks, dollar value of the trades, volume, etc.

Allegedly, time bars have wildly ranging amount of information because information does not arrive at a constant rate. Because of this, time bars have bad statistical properties. Garbage-in garbage-out means your models will be bad too.

I want to compare time bars vs dollar bars. How bad can it be?

Data

I collected 90,433,081 TRXUSDT trades using Binance API. I chose TRXUSDT because it had just the right amount of liquidity that I needed.

The collected trades are so-called taker trades. Each taker trade has one taker (i.e. aggressor) and one or more makers. A taker's order taking out multiple price levels results in multiple taker trades. So an order from a single taker results in one or more taker trades. Or something. That's my interpretation of Binance docs and data.

The trades occurred during [2021-04-18, 2022-07-10] period.

The first trade has ID 62478576 and the last trade ID is 152911656. There are no duplicates. The difference between IDs is 152911656 - 62478576 = 90,433,080. We want to count IDs so add one to avoid off-by-one error: 90,433,081 IDs. Number of IDs equals number of trades. There are no duplicate IDs. Therefore there are no missing trades, we have a contiguous sequence of trades.

Next I built the time bars and dollar bars using these trades. Dollar bars were parameterized by min_value. Once reached, the bar was ready. The created dollar bars virtually all have excess value because the last trade does not cover up to the min_value exactly. It overshoots it, sometimes by little bit, sometimes by a lot. Using different bar build approach it is possible to have bar's value to be exactly equal to min_value, but then you'd need to split a trade across two bars, which is less elegant in my opinion. Time bars were parameterized by duration.

Analysis

The return_bps is the bar's return in basis points--I prefer looking at 3.45 over 0.000345. We're not doing compounding here, so I am not using log returns. Log returns could be closer to Normal than simple returns though.

All the bars had the mean return confidence interval overlapping zero:

Activity duration is the difference between bar's last trade time and first trade time. If you're aiming to have 30 to 50 dollar bars per day, you'd need dollar_bars_2m:

Using mean value of bars we can find which time bars and dollar bars are comparable in terms of value. Here we see 12-hour bars are comparable to 80 million dollar bars:

To determine which dollar_bars are too small we can look at the percentage of bars whose value exceeds min_value by at least 5%. This will tell us that the value of the trades is too large relative to min_value. The red line is drawn at 1%. Bars with min_value of 2 million and less have frequent big excess value:

According to AIFML one reason to use dollar bars is that their returns resemble IID Gaussian better than time bars. If returns are IID then their volatility should be fixed. However this does not seem to be the case. Not for time bars and not for dollar bars. In our case time bars may be less heteroscedastic that dollar bars as shown by time_bars_6h vs dollar_bars_40m:

Time bars allegedly undersample during busy trading and oversample during lulls. This means the captured information varies more compared to dollar bars. Let's assume that the number of trades in a bar proxies the amount of captured information. Then it does look like time bars indeed sample more irregularly than dollar bars because their coefficient of variance is higher:

The histograms confirm that time bars oversample during lulls as the frequency is high for low trade counts, while dollar bars with higher value are bell-shaped:

Skewness wise time bars are pretty much same as dollar bars if we ignore dollar bars with low value:

Excess kurtosis is nonzero indicating that returns are fatter tailed than Gaussian:

Lag-1 autocorrelation story does look better for dollar bars, but only a bit:

The Jarque-Bera p-value of all bars is zero, indicating that all bars have non-Gaussian return_bps:

Conclusion

Dollar bars are better in capturing same amount of information. Their returns do not seem to be more Gaussian than time bars.

Nevertheless I wouldn't stop using time bars. For one, prediction using non-linear models, for example decision trees, does not require Normality. Second, you probably are not using return_bps as the target variable anyway.

I'd let backtesting decide which bars are better. It is going to vary per instrument most likely.