You are currently viewing Example of 40x speedup using async-profiler recording analysis (JFR file)

Example of 40x speedup using async-profiler recording analysis (JFR file)

40x performance improvement in this example is series of a few improvements combined.

I was trying to check what can be improved during working on a small project. As usual, I run intellij profiler from time to time to see the hottest methods. And that was part of my development workflow aside from TDD.

2x speedup – remove String.format() usage (technical optimization)

git commit

Remove usage of String.format("Move index %s from %s to %s", indexFrom, listFrom, listTo);

It’s not surprising it was slow, but discovering that is not obvious. Imagine you’re introduced to a new, big project – there is no way you would eyeball the code and say this is a bottleneck (hot method)

10x speedup – replace streams with for each (technical optimization).

git commit

Replace

return calculators.stream().mapToInt(c -> c.calculate(listA, listB)).sum();

With

  int sum = 0;
  for (var calculator : calculators) {
      sum += calculator.calculate(listA, listB);
  }
  return sum;

Bottleneck (hot method) found

2x speedup – add sum caching (technical optimization).

git commit

Replace

  private int sum(List<Integer> list) {
      int result = 0;
      for (var i : list) {
          result += i;
      }
      return result;
      return Math.abs(listA.sum() - listB.sum());
  }

With a sum caching java.util.List decorator

  public class SumCachingList implements SummingList {
      private final List<Integer> decorated;

      public SumCachingList(List<Integer> decorated) {
          this.decorated = decorated;
      }

      private int sum;
      private boolean sumCalculated = false;

      @Override
      public int sum() {
          if (!sumCalculated) {
              calculateSum();
              sumCalculated = true;
          }
          return sum;
      }

      private void calculateSum() {
          for (var i : decorated) {
              sum += i;
          }
      }

      @Override
      public boolean add(Integer integer) {
          sum = sum() + integer;
          return decorated.add(integer);
      }

      @Override
      public boolean remove(Object o) {
          var removed = decorated.remove(o);
          if (removed) {
              sum = sum() - (Integer) o;
          }
          return removed;
      }

     @Override
     public boolean removeIf(Predicate<? super Integer> filter) {
         var anyRemoved = decorated.removeIf(filter);
         if (anyRemoved) {
             sumCalculated = false;
         }
         return anyRemoved;
     }

      // ...
  }

After above optimizations flamegraph looks like that

Big picture optimisation VS technical optimisation

Ok, now you see that you can discover bottlenecks using async profiler recording. Above example is rather an easy one. For huge systems with tons of legacy code which is hard to maintain and change, it might be really hard.

Even discovering slow parts of big systems is hard, but that’s a topic for another blog post.

Technical optimizations are often easier to find.

Technical optimization

You don’t necessarily have to understand what the application is doing. You can read straight from the flamegraph that adding a cache or optimizing loops is going to help.

Big picture optimisation

You need to understand what your application is doing. You can optimize by organizing processing in a different way, e.g. discovering that you don’t have to fetch and process some data to in order to display particular piece of frontend.

Run integration tests with profiler regularly

That should be part of your development process. Like running integration tests.

In a perfect case, scenario of integration test should be as similar as possible to production scenario. This way you’d discover potential performance improvements or issues before production deployment.

Know the difference between async profiler sampling modes: Wall Clock and CPU usage only

Wall Clock (Total Time)

If you’re interested in real latency of your system, including

IO, e.g.

  • connection polling
  • DB transaction locks
  • reads/writes

synchronization, e.g.

  • waiting for critical section access
  • waiting for tasks in a thread pool

use Wall Clock sampling in async-profiler. This mode will collect events from threads in SLEEPING state also.

Beware it affects performance of measured process more than measuring only ACTIVE threads. Why? Because measured JVM has more threads to iterate over every interval.

CPU only

In other words, without Wall Clock mode, measuring sampling only JVM threads in ACTIVE state

In the examples above you can see hot method httpClient.newCall(request).execute()

  • Time spent measured with CPU Time is 120 ms
  • Time spent measured with Wall Clock (Total Time) is 9170 ms

Would you optimize CPU time in this case? Probably not. You’d first focus on IO as it’s the bottleneck.

Subject to optimize needs to be chosen case by case. In order to have this choice, you need have both CPU Time and Total Time in flamegraph – and it’s available only when using wall clock mode sampling.

Resources

Mikolaj Grzaslewicz

Performance explainer. You can hire me to help you and your developers team improve your product performance. Passionate, highly experienced java/kotlin engineer. Highlights - JVM (java/kotlin) performance - websites performance - frequent deployment - solving the right problem (are you sure microservices will help you? :-) ) - code quality impacting cost of mid/long term project maintenance
0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments