Three things everyone should know about analyzing time-series data

Rich Pang
2025-07-07

Data leakage through correlated training and test data

We usually fit models on training data and evaluate their performance on held-out test (or validation) data, which helps us avoid overfitting. Ideally, the test data is sampled independently from the training data but from the same distribution. With time-series, however, your data arrives sequentially and except in special cases timepoints are not independent. How do you divide a time-series dataset into training and test data?

The wrong thing to do is select a random set of timepoints to use as training data and use the rest as test data. This is because when signals change more slowly than your sample rate, neighboring timepoints may be very similar, breaking the independence assumption. This makes the test data correlated with the training data, and so good prediction of the test data can in fact be an artifact of overfitting to the training data. So what do we do instead?

There are various solutions, with different advantages and disadvantages. First, if you have independently sampled repeats of the entire time-series (for instance, many "trials", each corresponding to a sequence of measurements) then you can use a subset of those repeats as training data and the rest as test data. However, if you have only one long continuous time-series, then you'll need a way to break it up into approximately independent samples. One way of doing this is to estimate the autocorrelation function of the data and see if it has a characteristic timescale, e.g. by fitting an exponential and estimating tau. Then timepoints separated by much more than tau can be treated as approximately independent samples. As long as each test data point is sufficiently far from all training points, then it could be reasonable to treat them as independent.

Note, however, that the second apporach requires that the signal has both a well-defined and relatively short autocorrelation time, which may not be the case for e.g. strongly nonstationary time-series. Unfortunately, sometimes your data is simply too non-stationary to cleanly break into independent samples. This doesn't mean you can't model it, but test-set prediction accuracy might not be a good metric.

Multicollinearity and unstable filter weights

One common technique in analyzing time-series data is to fit filters that predict one time point, say y(t), from a set of predictor timepoints, say x(t-1), x(t-2), ... x(t-T). This means that one assigns a weight to each timepoint of the predictor series, w_1, ..., w_T, then weights the predictors x(t-1), ..., x(T) by these weights, sums them up, and then uses the sum to predict y(t), possibly including an additional nonlinearity or transformation.

However, when x(t) are highly correlated, i.e. x(t) is similar to x(t-1), then there is a fundamental ambiguity in how to assign the weights. If the ground truth is y(t) = 3x(t), then y(t) can also be predicted well by 3x(t-1), or 1.5x(t) + 1.5x(t-1), etc. If there is any noise in the data, then without additional constraints there can be huge instabilities when trying to learn these weights, making estimates of them highly variable and unreliable. This is true EVEN if they predict independent test data well. In other words, this is NOT a consequence of overfitting. This problem is called multicollinearity.

Alas, there is no "correct" solution for how to handle multicollinearity, since there is simply a fundamental ambiguity in which predictor timepoints matter most. The best one can do is instead to conciously choose an approach, while being aware of the consequences of your choice. One common option, for instance, is to add a regularization term to your loss function, such as an L2 ("Ridge") or L1 ("Lasso") penalty. A good rule of thumb is that L2 "balances", i.e. assigns similar weights to predictors with similar variance and predictive power, whereas L1 "sparsifies", assigning nonzero weights to only a number of predictors, and setting the rest identically to zero. (Fun fact: Ridge regression was invented NOT for overfitting, as we often learn in machine-learning classes, but in fact to handle multicollinearity.)

Other methods include building your filters from basis vectors functions, or performing a dimensionality reduction of your data before fitting your model. You can read up on more details about multicollinearity here. But remember: the key thing is to be aware of how you're handling multicollinearity, rather than choosing a method blindly and then accidentally mistaking a consequence of your method for a property of your data.

As you can see, most of the problems analyzing real-world time-series data come from temporal correlations.

Statistical testing

How do you argue that the result of an analysis you performed on your time-series data was unlikely to have emerged by chance? While there are entire books of statistical tests to use in different situations, the hard truth is that for many modern datasets, especially for time-series, none of these directly apply. Once again, this is largely because such data do not often contain clean i.i.d. (independently and identically distributed) samples, which is the baseline assumption of almost every statistical test.

What is true, however, is that all statistical testing boils down to basically one key question. What is the probability that an effect size greater than or equal to that you've computed from your real data would have emerged from a "null dataset"? Allen Downey, computer science professor and author of Probably Overthinking It wrote a great blog post on this called There is only one test!

The real question then, is not figuring out which statistical test to use, but what kind of null dataset to create. While there is no universal answer there are at least some good rules of thumb. First, what a null dataset actually means is an ensemble or collection of datasets, each formatted equivalently to your real data but created using a different random number generator seed. From each instantiation compute the effect size using the same analysis pipeline as your real data, then create a histogram of these effect sizes for the null data. The rule of thumb is to create enough null instantiations that you can accurately estimate how far away your real effect size lies from the bulk of the null distribution of effect sizes.

The second key rule of thumb is to create a null dataset that retains as many features of your data as possible, excluding the specific effect you're looking for. For example, suppose you're interested in whether two time-series x(0:T) and y(0:T) are more correlated than chance, but they each have slow autocorrelation time-scales. Slow autocorrelations can easily lead to a spuriously high cross-correlation even when x and y are completely independent. In this case, to create the null dataset you'll want to retain the autocorrelations but break the cross-correlations. This means that the wrong thing to do is shuffle the timepoints within x or y, since this destroys the autocorrelations. Instead, it would make more sense to randomly circularly shift x and y relative to one another, with each instantiation of your null dataset corresponding to a random shift amount. This will keep the autocorrelations but break the cross-correlations, giving you a more meaningful null distribution.