Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][client] Add schema cache to improve performance #23808

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

yunmaoQu
Copy link

@yunmaoQu yunmaoQu commented Jan 3, 2025

close #23707

Motivation

Schema creation (e.g., Schema.AVRO(SomeClass.class)) is fairly CPU intensive. It would be useful it there would be a weak reference cache for caching the schema instance.

Modifications

  • Add SchemaCache implementation using WeakHashMap for schema instance caching
  • Add cache configuration and metrics for monitoring
  • Add cleanup strategy for expired cache entries
  • Modify Schema creation methods (AVRO/JSON/PROTOBUF) to use cache
  • Add cloning mechanism to maintain schema immutability

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

N/A

- Add SchemaCache implementation
- Add cache configuration and metrics
- Add cleanup strategy
- Modify Schema creation methods
- Add unit tests and performance tests

This closes apache#23707
Copy link

github-actions bot commented Jan 3, 2025

@yunmaoQu Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@github-actions github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Jan 3, 2025
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several inconsistencies in this PR. For example, the class names and the class file names don't match. Please test this PR in your own fork first to ensure that it passes tests.
It seems that this PR contains a lot of features related to the schema caching. Instead of adding a lot of features, it would be better to keep the implementation to the minimum.
I'm surprised by competing implementations for implementing the schema cache. There's currently already an open PR #23777.

@yunmaoQu
Copy link
Author

yunmaoQu commented Jan 3, 2025

ok,i test all and could you review it and give me some suggestions

@lhotari
Copy link
Member

lhotari commented Jan 3, 2025

ok,i test all and could you review it and give me some suggestions

Instead of adding more code to test everything, please reduce to a minimal implementation. This means to remove features to track cache metrics. That's not something that is needed. For the cache implementation, I'd suggest using a ConcurrentMap created with Guava's MapMaker. Instead of adding yet another abstraction, I'd suggest modifying the PulsarClientImplementationBinding interface and adding a new interface method <T extends com.google.protobuf.GeneratedMessageV3> Schema<T> newProtobufSchema(Class<T> clazz). Then we could keep the cache as an implementation level detail.

example of minimal implementation for newProtobufSchema using Guava's MapMaker with weak keys:

    private static final ConcurrentMap<Class<?>, Schema<?>> PROTOBUF_CACHE = new MapMaker().weakKeys().makeMap();

    public <T extends com.google.protobuf.GeneratedMessageV3> Schema<T> newProtobufSchema(Class<T> clazz) {
        return (Schema<T>) PROTOBUF_CACHE.computeIfAbsent(clazz,
                k -> ProtobufSchema.of(SchemaDefinition.builder().withPojo(clazz).build())).clone();
    }

There shouldn't be a need to ever clear the cache since it's bounded by the number of classes with strong references. It won't consume a significant amount of memory in the first place.

@yunmaoQu
Copy link
Author

yunmaoQu commented Jan 3, 2025

OK.Should i implement it based on the pre commit or what?

@lhotari
Copy link
Member

lhotari commented Jan 3, 2025

OK.Should i implement it based on the pre commit or what?

That's something you can decide. Please read my previous message and draw your conclusions.

@walkinggo
Copy link

walkinggo commented Jan 4, 2025

It looks like we're working on similar tasks. I've already created a pull request #23777 to complete this task. Should we work together to finish it, or what do you suggest? @yunmaoQu

@yunmaoQu
Copy link
Author

yunmaoQu commented Jan 4, 2025

Yes. We can work it together.@walkinggo

@yunmaoQu
Copy link
Author

yunmaoQu commented Jan 5, 2025

@lhotari

OK.Should i implement it based on the pre commit or what?

That's something you can decide. Please read my previous message and draw your conclusions.

I implement a minimal version. Could you review it and give me some suggestion. Thanks for your previous guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs
Projects
None yet
3 participants