Getting started with a dataset

Rich Pang
2025-07-14

Have a dataset in your hands you're not sure what to do with? It's great to follow your nose, but if you find yourself running in circles, the following may help expedite your journey.

Build a mental model of the data

The better your mental model of the data, the easier it is to simulate analyses in your head, at the bus stop or in the shower, to see which ones are more or less likely to produce something interesting vs hit dead ends.

Plot of bunch of typical and atypical examples in your data.

Knowing what your raw (or lightly pre-processed) data just looks like is crucial for interpreting downstream analysis results. Which examples represent "typical" system behavior that would help someone else get a feel for what you're working with? Which "atypical" examples are likely caused by artifacts vs something truly interesting? Make sure there is clear signal in the data, or if not, that you have strong reason for believing a clear signal can be extracted with the proper methods.

Learn as much as you can about how the data was collected

Datasets hide a ton of experimental context. The better you understand the experimental context the better you will understand your data.

Compute all the basic stats

Before you do anything sophisticated, compute all the basic histograms and correlations. How many trials are there? How many timepoints per trial? How big are the signals you're looking at? What are the basic timescales of the system? Which features are real vs artifacts? If you feel the urge to do anything remotely sophisticated, don't do it yet, but write down the idea and why it might be interesting.

Make an executive summary of the data

Collect all your data examples, histograms, and correlations, as well as all your notes about crucial experimental context. Write down everything that's an obvious artifact or confound. Save it all as a PDF, then print it out. This is your basic guide to the data that will serve as reference for yourself and anyone else who wants to work with the data later.

Internalize the executive summary

Memorize all the most important numbers and obvious features of the data.

Write down everything obviously interesting

Before you do anything else quantitative, make a list of everything that seems interesting about the data.

Interesting means not predictable

Either via your own intuition or according to existing literature and dominant scientific thinking. When it comes to publishing, it's the way of thinking in the field that should set the stage. It's quite possible for you to think some feature of your data is completely sensible, but the field thinks otherwise. Spend time reading and getting up to speed about what typical experts in your field would expect to see in the data. Sometimes the literature doesn't explicitly state the field's expectations, so watch talks (or talk to experts) to find out how people are really thinking beyond what they write in their papers.

Make predictions about your data

Go through your every element of executive summary again, making predictions about what an expert in the field would expect to see. Would the field predict that typical data examples look the way they do? What would the field predict the shapes of the histograms to look like? What correlations would experts predict? For each prediction, note whether it holds up or is violated by the data. Results that match your predictions are great sanity checks, and help convey soundness. The violations, however, are the interesting bits, so long as they're not caused by uninteresting confounds.

Data can be made interesting through model comparison

If there's nothing obviously interesting (unlikely), either select a new dataset, or alternatively, take a model comparison approach. Sometimes the data is high quality but the most interesting phenomena are hidden below the surface. A tried-and-true approach to making data interesting is to identify two mutually exclusive models that are both reasonable in the field, and ask which one better explains the data (or which models explain which aspects of the data). This allows you to "triangulate" where the data live in a space of models. If this is your approach, think carefully through what reasonable models to compare, and how you would compare them against the data.

Quantify the interesting phenomena

Statistics allow us to summarize how the interesting phenomena we have observed extend throughout the dataset, which conveying additional, precise information beyond that observable with the naked eye. The most straightforward path to a publishable result is to find something interesting (see above) and quantify it so that your audience can get a sense of how the phenomenon permeates your dataset.

Do simple analyses

It can be tempting to use sophisticated methods to quantify your results, but it's usually easier and more robust to do the simple thing first. Besides, when it comes to writing the paper the simple methods will serve as an important reference point, either to guide the more sophisticated methods or to clarify why the simple method is insufficient. If you want to do something more sophisticated make sure you can really justify why it's a necessary approach. Waiting a bit before starting a sophisticated approach can also help streamline the eventual tasks involved, since your unconscious may give you a lot of hints along the way before you actually sit down and start coding.

Make paper-ready figures, not just plots

One tried-and-true figure archetype includes one or a few examples of the interesting phenomenon, a schematic showing how you quantify an effect size or metric from the raw data, and a familiar type of plot (e.g. histograms, correlations, etc.) showing the effect size/metric across your dataset. Make sure to have clear text labels so that all elements of the figure are as unambiguous as possible.

Well put together figures are also crucial for presentations, even when you're not yet at the stage of writing up the paper. It may seem like unnecessary work to make nice schematics and add all the annotations for a lab meeting or committee meeting, but doing so will be extremely useful for your audience, since they all come with priors on what they expect to see in figures in their field. This can make the difference between having your presentation get derailed for an hour in a completely irrelevant direction, vs everyone understanding what you've done and providing genuinely helpful feedback.

Write why your results are interesting

For each interesting phenomenon, explain the context, and articulate what prediction or predictions an expert in the field would have made, ideally with rationale and citations. Then show how your results either violate the predictions or were not obviously predictable. Then explain the significance --- how the thinking in the field should be accordingly revised, according to what you've seen in the data.

Stay flexible

New questions may arise as the process continues. Research is highly iterative, and interesting data will be full of surprises. Keep a continually growing list of new ideas or curious observations, but hold off on diving in immediately. Letting ideas marinate for a while before execution can be extremely useful for pursuing them more thoughtfully and skillfully.

Beware the twilight zone

The twilight zone is when one applies a complex method to the data, not knowing exactly what to expect, then tries to interpret the results. Dimensionality reduction and clustering techniques are a common example. While these can sometimes be sound and powerful, and are often touted as letting the data speak for itself, rather than going in with overly biased hypotheses, it is easy to mistake a feature of the data for a feature of the method. For instance, quantities like "number of clusters" are often a consequence of a parameter in the clustering algorithm (e.g. specifying cluster size), rather than a reflection of the "true" number of clusters or hidden states in a dataset.

The best way to verify results obtained in the twilight zone is to run the same methods on artificial datasets of the same format as your real data but in which more ground truth is known, which serve as controls. Understanding what your method returns for the artificial data is fundamental for interpreting the results when you apply it to the real data. This will give your results much more meaning and soundness and will be crucial to convincing your audience that your findings mean what you think they do.