40x performance improvement in this example is series of a few improvements combined.

I was trying to check what can be improved during working on a small project. As usual, I run intellij profiler from time to time to see the hottest methods. And that was part of my development workflow aside from TDD.

2x speedup – remove String.format() usage (technical optimization)

git commit

Remove usage of String.format("Move index %s from %s to %s", indexFrom, listFrom, listTo);

It’s not surprising it was slow, but discovering that is not obvious. Imagine you’re introduced to a new, big project – there is no way you would eyeball the code and say this is a bottleneck (hot method)

10x speedup – replace streams with for each (technical optimization).

git commit

Replace

return calculators.stream().mapToInt(c -> c.calculate(listA, listB)).sum();

With

  int sum = 0;
  for (var calculator : calculators) {
      sum += calculator.calculate(listA, listB);
  }
  return sum;

Bottleneck (hot method) found

2x speedup – add sum caching (technical optimization).

git commit

Replace

  private int sum(List<Integer> list) {
      int result = 0;
      for (var i : list) {
          result += i;
      }
      return result;
      return Math.abs(listA.sum() - listB.sum());
  }

With a sum caching java.util.List decorator

  public class SumCachingList implements SummingList {
      private final List<Integer> decorated;

      public SumCachingList(List<Integer> decorated) {
          this.decorated = decorated;
      }

      private int sum;
      private boolean sumCalculated = false;

      @Override
      public int sum() {
          if (!sumCalculated) {
              calculateSum();
              sumCalculated = true;
          }
          return sum;
      }

      private void calculateSum() {
          for (var i : decorated) {
              sum += i;
          }
      }

      @Override
      public boolean add(Integer integer) {
          sum = sum() + integer;
          return decorated.add(integer);
      }

      @Override
      public boolean remove(Object o) {
          var removed = decorated.remove(o);
          if (removed) {
              sum = sum() - (Integer) o;
          }
          return removed;
      }

     @Override
     public boolean removeIf(Predicate<? super Integer> filter) {
         var anyRemoved = decorated.removeIf(filter);
         if (anyRemoved) {
             sumCalculated = false;
         }
         return anyRemoved;
     }

      // ...
  }

After above optimizations flamegraph looks like that

Big picture optimisation VS technical optimisation

Ok, now you see that you can discover bottlenecks using async profiler recording. Above example is rather an easy one. For huge systems with tons of legacy code which is hard to maintain and change, it might be really hard.

Even discovering slow parts of big systems is hard, but that’s a topic for another blog post.

Technical optimizations are often easier to find.

Technical optimization

You don’t necessarily have to understand what the application is doing. You can read straight from the flamegraph that adding a cache or optimizing loops is going to help.

Big picture optimisation

You need to understand what your application is doing. You can optimize by organizing processing in a different way, e.g. discovering that you don’t have to fetch and process some data to in order to display particular piece of frontend.

Run integration tests with profiler regularly

That should be part of your development process. Like running integration tests.

In a perfect case, scenario of integration test should be as similar as possible to production scenario. This way you’d discover potential performance improvements or issues before production deployment.

Know the difference between async profiler sampling modes: `Wall Clock` and `CPU` usage only

Wall Clock (Total Time)

If you’re interested in real latency of your system, including

IO, e.g.

connection polling
DB transaction locks
reads/writes

synchronization, e.g.

waiting for critical section access
waiting for tasks in a thread pool

use Wall Clock sampling in async-profiler. This mode will collect events from threads in SLEEPING state also.

Beware it affects performance of measured process more than measuring only ACTIVE threads. Why? Because measured JVM has more threads to iterate over every interval.

CPU only

In other words, without Wall Clock mode, measuring sampling only JVM threads in ACTIVE state

In the examples above you can see hot method `httpClient.newCall(request).execute()`

Time spent measured with CPU Time is 120 ms
Time spent measured with Wall Clock (Total Time) is 9170 ms

Would you optimize CPU time in this case? Probably not. You’d first focus on IO as it’s the bottleneck.

Subject to optimize needs to be chosen case by case. In order to have this choice, you need have both CPU Time and Total Time in flamegraph – and it’s available only when using wall clock mode sampling.

Example of 40x speedup using async-profiler recording analysis (JFR file)

2x speedup – remove String.format() usage (technical optimization)

10x speedup – replace streams with for each (technical optimization).

2x speedup – add sum caching (technical optimization).

Big picture optimisation VS technical optimisation

Run integration tests with profiler regularly

Know the difference between async profiler sampling modes: `Wall Clock` and `CPU` usage only

Wall Clock (Total Time)

CPU only

In the examples above you can see hot method `httpClient.newCall(request).execute()`

Resources

Mikolaj Grzaslewicz

2x speedup – remove String.format() usage (technical optimization)

10x speedup – replace streams with for each (technical optimization).

2x speedup – add sum caching (technical optimization).

Big picture optimisation VS technical optimisation

Run integration tests with profiler regularly

Know the difference between async profiler sampling modes: Wall Clock and CPU usage only

Wall Clock (Total Time)

CPU only

In the examples above you can see hot method httpClient.newCall(request).execute()

Resources

Mikolaj Grzaslewicz

You Might Also Like

How to make performance benchmark 530% worse by relying on web app automation defaults?

No More Guesswork: Validating what improves First Contentful Paint upfront with No-Code Experiments

Is (website) performance improved after improving percentile 90?

Know the difference between async profiler sampling modes: `Wall Clock` and `CPU` usage only

In the examples above you can see hot method `httpClient.newCall(request).execute()`