Three things everyone should know about analyzing time-series data

Rich Pang
2025-07-07

Correlated training and test data

We often fit models on training data and evaluate their performance on held-out test (or validation) data to avoid overfitting. Ideally, the test data is sampled independently from the training data but from the same distribution. With time-series, however, data arrives sequentially and except in special cases timepoints are not independent. How do we divide a time-series dataset into training and test data?

The wrong thing to do is select a random set of timepoints to use as training data and use the rest as test data. This is because when signals change more slowly than your sample rate, neighboring timepoints may be very similar. This makes the test data correlated with the training data, and so good prediction of the test data can in fact still be an artifact of overfitting to the training data. So what do we do instead?

There are various solutions, with different advantages and disadvantages. If we have independently sampled repeats of the entire time-series, e.g., many "trials", each corresponding to an independent sequence of measurements, then we can use a subset of those repeats as training data and the rest as test data.

However, if we have only one long continuous time-series, then we'll need a way to break it up into approximately independent samples. One approach is to estimate the autocorrelation function of the data and check if it has a characteristic timescale, e.g. by fitting an exponential and estimating TAU. Then it may be fair to assume timepoints separated by much more than TAU can be treated as approximately independent samples.

Additionally, it is good practice to not use a single training/test split, but rather a large number (e.g. > 30) of training/test splits, since there can be substantial variance in the test set prediction error. Estimating the mean test set prediction error over many training/test splits helps reduce this variance.

Note that the constructing training/test splits by splitting our time-series requires that the signal has both a well-defined and substantially shorter autocorrelation time than the experiment duration, which may not always be true, especially for strongly nonstationary time-series. Sometimes our data is simply too non-stationary to cleanly break into independent samples. This doesn't mean we can't model it, but test-set prediction accuracy might not be a good performance metric.

Multicollinearity and unstable filter weights

One common technique in analyzing time-series data is to fit filters that predict one time point, say y(t), from a set of predictor timepoints, say x(t-1), x(t-2), ... x(t-T). This means that we assign a weight to each timepoint of the predictor series, w(1), ..., w(T), then weight the predictors x(t-1), ..., x(T) by these weights, sum them up, then use the sum to predict y(t), either directly or through another nonlinearity or transformation.

However, when x(t) are highly correlated, i.e. x(t) is similar to x(t-1), then there is a fundamental ambiguity in how to assign the weights. If the ground truth is y(t) = 3x(t), then y(t) can also be predicted well by 3x(t-1), or 1.5x(t) + 1.5x(t-1), etc. If there is any noise in the data, then without additional constraints there can be huge instabilities when trying to learn these weights, making estimates of them highly variable and unreliable. This is true EVEN if they predict independent test data well. In other words, this is NOT a consequence of overfitting. This problem is called multicollinearity.

Unfortunately, there is no "correct" solution for handling multicollinearity, since there is simply a fundamental ambiguity in which predictor timepoints matter most. The best we can do is to choose an approach stay aware of the consequences of our choice. One common option is to add a regularization term to the loss function, such as an L2 ("Ridge") or L1 ("Lasso") penalty. A good rule of thumb is that L2 "balances", i.e. assigns similar weights to predictors with similar variance and predictive power, whereas L1 "sparsifies", assigning nonzero weights to only a subset of predictors, and setting the rest identically to zero. (Fun fact: Ridge regression was invented NOT for overfitting, as we often learn in machine-learning classes, but in fact to handle multicollinearity.)

Other methods include building the filter (the weight vector w(1), ..., w(T) ) from basis vectors functions, or performing a dimensionality reduction of the data before fitting the model. You can read up on more details about multicollinearity here. The key thing is to be aware of how we're handling multicollinearity, rather than choosing a method blindly and then accidentally mistaking a consequence of the method for a property of the data.

As you can see, most of the problems analyzing real-world time-series data come from temporal correlations.

Statistical testing

How do we argue that the result of an analysis we performed on time-series data was unlikely to have emerged by chance? While there are entire books of statistical tests to use in different situations, the hard truth is that for many modern datasets, especially for naturalistic time-series, none of these directly apply. Once again, this is because such data seldom contain clean i.i.d. (independently and identically distributed) samples, which is the baseline assumption of almost every statistical test.

What is true, however, is that all statistical testing boils down to basically one key question. What is the probability that an effect size greater than or equal to that computed from the data would have emerged from a "null dataset"? Allen Downey, computer science professor and author of Probably Overthinking It wrote a great blog post on this called There is only one test!

The real question then, is not figuring out which statistical test to use, but what kind of null dataset to create. While there is no universal answer there are at least some good rules of thumb.

The key rule of thumb is to create a null dataset that retains as many features of the data as possible, excluding the specific effect you're looking for. For example, suppose we're interested in whether two time-series x(0:T) and y(0:T) are more correlated than chance, but they each have slow autocorrelation time-scales. Slow autocorrelations can easily lead to a spuriously high cross-correlation even when x and y are completely independent (exercise for the reader). In this case, to create the null dataset we'll want to retain the autocorrelations but break the cross-correlations. This means that the wrong thing to do is shuffle the timepoints within x or y, since this destroys the autocorrelations. Instead, it makes more sense to randomly circularly shift x and y relative to one another, with a different random shift amount for each instantiation of the null dataset. This will keep the autocorrelations (excluding edge effects) but break the cross-correlations, giving us a more meaningful null distribution.

In general we will want to create a large number N of samples of the null data, each generated with a different RNG seed. We then run the same analysis we ran on our real data for each null dataset to create a distribution of effect sizes from the null data. To produce a P-value we then use the null distribution of effect sizes to estimate the probability that an effect size larger than or equal to that measured in the real data would have emerged by "chance", i.e. from the null data in which the phenomenon of interest is absent by construction. In general there may be more one than one reasonable type of null dataset to construct, and we can think of each type of null dataset as a different type of control.