Tallinn University of Technology

In the digital world, there is often a default assumption that the more data we collect and store, the better. In reality, however, large volumes of data do not necessarily mean better analysis, smarter decisions, or more efficient systems. On the contrary, unconsidered data collection can make systems complex, expensive, and difficult to manage, while also increasing their environmental footprint. Associate Professor Kristina Vassiljeva from the Department of Computer Science explains why the goal should not be the maximum amount of data, but rather meaningful and informative datasets, and how to make conscious decisions about which data should be stored and which should not.

Kristina Vassiljeva
Kristina Vassiljeva, photo: TalTech

How do you approach managing data volume and digital space in your personal work as well as in the design of systems or solutions?

I work daily with system modeling and analysis using machine learning and artificial intelligence methods. To model a system, enough data is needed to observe its dynamics — that is, how the system changes in different situations. If the data is too uniform and the values hardly change, then we cannot learn the system’s real behavior.

At the same time, more data is not always better. In the case of neural networks, an excessively large dataset can lead to overfitting. In such a case, the model “learns” the specific details of the training dataset, including random noise. The result is high accuracy on the existing data but poor ability to generalize when faced with new data.

Good modeling does not mean the maximum amount of data, but rather an informative dataset that covers the important operating modes of the system and supports reliable generalization.

What have you gained from preferring smaller data volumes and thoughtful storage?

The main benefit is clarity. A smaller amount of data means that information is easier to find and understand. When there are fewer files and datasets, you don’t have to spend time sorting through dozens of versions or searching for important information among large amounts of irrelevant material. This speeds up decision-making and reduces the chance of mistakes.

“Technical debt” simply means that the more disorganized and poorly considered data accumulates, the more complex the system becomes. Later, more time and money must be spent organizing it. A smaller and well-considered dataset keeps systems clearer and easier to manage.

Storage and backups are not free — cloud space, servers, and maintenance cost money. The more data there is, the higher the costs today and in the future.

In addition, every piece of data also has an environmental impact. Data storage and processing take place on servers that consume electricity and require cooling. Therefore, every stored and processed byte actually uses resources — even if we do not notice it in everyday life.

Data accumulation is often justified with the thought “maybe we’ll need it someday.” How do you decide what is worth storing and what is not?

When deciding, I rely on three simple questions:

  • Is this information truly needed for making a specific decision, conducting an analysis, or presenting a report?

  • If we do not store it now, would collecting or recreating it later be difficult and time-consuming? This is especially important in modeling and forecasting, because forecasts are always based on historical data. If the necessary history has not been stored, it cannot be reconstructed later.

  • Does a law or contract require that this information be retained?

If the answer to all three questions is “no,” then it does not make sense to keep the dataset.

The thought “maybe we’ll need it someday” alone is not a sufficient reason. Every piece of stored information later means additional work, costs, and responsibility, so storing it should always have a clear and justified purpose.

What is one principle or practical recommendation for those who solve growing data volumes simply by buying more cloud storage?

Cloud space is not a strategy — it is a technical tool.

Before increasing storage capacity, it is worth thinking about:

  • which data is actually needed and for what purpose,
  • how often it should be collected (every second, every minute, or once a day),
  • how long it truly needs to be stored,
  • and whether some of the data can be processed immediately and only summaries stored.

Often it is not necessary to retain the entire raw dataset; aggregated indicators that support decisions and analysis may be sufficient.

The goal should not be the largest possible amount of data, but rather a dataset that is meaningful, manageable, and helps make better decisions.