Handling Large Datasets in Spring Boot: A Quick How-to Guide

When running queries in a Spring Boot application, your concern is not only the logic and structure of the data. You also have to take care of the limitations of the system in terms of memory. The standard option for this is paginated queries. You have an alternative, though, useful for certain scenarios: Java Streams. In this article, you will see how to make JPA repositories return streams and how to use them.

The Problem

Using Spring Data repository classes is the usual way to perform queries in Spring, coupled with JPA. For instance, you can define a standard query like the following one, where User is an entity mapped to some database table:

List<User> users = userRepository.findAll();

JPA with Hibernate will execute the query and load every row into memory.

In case of a large number of rows, you will get high memory usage to the point that the application could raise an OutOfMemoryError. Furthermore:

The GC will be slower
You will have a longer response time before even starting the execution of the business logic related to the query.

Query Pagination Solution

The most common option to deal with large datasets is to use paginated queries:

Page<User> page = repo.findAll(PageRequest.of(0, 100));

With this approach, you can traverse the dataset back and forth in chunks of a predefined size. This is a good approach to handle, for instance:

APIs
User Interfaces

Stream Solution

What if you don’t need to browse the dataset chunk by chunk, but instead start some processing continuously as the data comes in? To address this scenario, Spring Data provides a specific functionality: you can define a repository query method that returns a stream instead of a simple list, like in the following example:

public interface UserRepository extends JpaRepository<User, Long> {

    @Query("SELECT u FROM User u")
    Stream<User> streamAllUsers();
}

You can then use the repository method, like in the following example:

@Transactional(readOnly = true)
public void processUsers() {
    try (Stream<User> stream = userRepository.streamAllUsers()) {
        stream.forEach(user -> {
            // process one user at a time
        });
    }
}

Internally, the result is fetched in chunks by the JDBC cursor, and only a small subset is loaded in memory at any time. Your method can start the processing immediately.

Some of the scenarios in which stream results are fit are:

Batch jobs
Data migration
ETL pipelines

Avoid it when:

You need random access (e.g., sorting in memory)
You reuse data multiple times
You need the full dataset for calculations

Streaming shifts the system from memory-bound to I/O-bound.

Important Points to Remember

The code that uses the stream result must be inside a transaction:

Without it, the connection can close too early. So you have to mark the method with the proper annotation:

@Transactional(readOnly = true)

Always close the stream:

Streams hold DB resources that must be properly liberated.

Avoid persistence context growing, as even with streaming, Hibernate may keep entities in memory

A possible option to avoid this is to detach the entity after it’s processed:

@PersistenceContext
EntityManager em;

stream.forEach(user -> {
    process(user);
    em.detach(user); // prevents memory buildup
});

Another option is to perform a clear periodically:

AtomicInteger counter = new AtomicInteger();

stream.forEach(user -> {
    process(user);

    if (counter.incrementAndGet() % 100 == 0) {
        em.clear();
    }
});

Use fetch size (critical for real streaming):

Without it, some drivers could still load everything. You can set it with:

spring.jpa.properties.hibernate.jdbc.fetch_size=100

Implementation

As an example, we can summarize the above discussion with a Spring Boot application that, on startup, runs a high number of inserts into a table, and eventually launches some business logic on the generated data:

@SpringBootApplication
public class Application {

    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }

    @Bean
    CommandLineRunner run(UserRepository repo, UserService service) {
        return args -> {
            // Generate test data
            for (int i = 1; i <= 10000; i++) {
                repo.save(new User("User_" + i));
            }

            System.out.println("=== DATA GENERATED ===");

            // Process using streaming
            service.processUsers();
        };
    }
}

processUsers runs a repository method that returns a stream:

public interface UserRepository extends JpaRepository<User, Long> {

    @Query("SELECT u FROM User u")
    Stream<User> streamAllUsers();
}

And is defined as:

    @Transactional(readOnly = true)
    public void processUsers() {
        AtomicInteger counter = new AtomicInteger();

        try (Stream<User> stream = repository.streamAllUsers()) {
            stream.forEach(user -> {
                System.out.println(user.getName());

                if (counter.incrementAndGet() % 100 == 0) {
                    em.clear();
                }
            });
        }
    }

Note that the method is transactional, the stream is handled in a try block to close it automatically when done, and a clear is forced periodically by the Hibernate entity manager.

Conclusion

Java Streams are a viable alternative to paginated queries for handling large datasets in those scenarios in which continuous processing is required. You can easily handle them in a Spring Boot application using the feature provided by Spring Data repositories. You can find an example on GitHub.