-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17525: Text Embedder Query Parser #2809
SOLR-17525: Text Embedder Query Parser #2809
Conversation
I'll keep polishing it and finalise the documentation, but I think it's ready for review! |
Just as a reminder, currently the check fails with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this module introduce a competing "model store" to Solr's existing "file store"? The "file store" was developed with "models" in mind, in addition to initially being developed for plugin packages.
solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc
Outdated
Show resolved
Hide resolved
Let me elaborate here: Given that, I am not that familiar with the file store, so if it can help in having a better and cleaner solution I'll be more than happy to take a look at it (after the 5th of November) |
I would love to get a walk through on this exciting feature at the next community meetup... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through and mostly commented on the code. Having said that, I don't know that I grok the high level vision, since that's hard to convey with just a PR. I would like to learn more about what langchain4j brings to the party.. Do we see it being used more widely in Solr? To expose more thigns in Solr to Langchain4j? Or just as an api to some embedding models?
solr/modules/llm/src/test-files/solr/collection1/conf/stopwords.txt
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/test-files/solr/collection1/conf/synonyms.txt
Outdated
Show resolved
Hide resolved
} | ||
Files.delete(mstore); | ||
} | ||
if (!solrconfig.equals("solrconfig-llm.xml")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this appears to be some magic? isn't the way we set things up in a way so that you control which solrconfig you are using? Maybe I don't know the RestTestBase enough, but genrally I just specify what configs etc to use...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned up the tests a bit, before I resolve the comment, take a look if it's any better now!
} | ||
|
||
/** produces a model encoded in json * */ | ||
public static String getModelInJson(String name, String className, String params) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we use different approaches to building JSON elsewhere? One of my goals for Solr code bases is to have more consistency across all the areas... I know that isn't as helpful to the creator of the new code, like in this case, but in two years, when someone else has to come along and understand things, then it really pays off!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any pointer? In the meantime I keep this open and look up myself right now if I can find anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest not using StringBuilder here; it's needlessly verbose. I think IntelliJ will switch it for you.
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
||
public class TestLlmBase extends RestTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is RestTestBase
our future? @dsmiley didn't you have somewhat of a vision for where we are going with our hierarchy of testing classes? I am intrigued to learn more abut RestTestBase
, it's new to me. However, if it isn't part of our vision for how we handle testing, then I am nervous about depending on it for another new module!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave this comment open for @dsmiley to comment.
I took inspiration from the Learning To Rank module and it made possible with a decent effort to test the embedding model store, but I'm open to refactor it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used a ton so I sympathize with continuing to use it until there's renewed energy in more test renovation. Even then, it would likely stay. It's not deprecated even though its parent class is.
assertJPut(ManagedEmbeddingModelStore.REST_END_POINT, model, "/responseHeader/status==0"); | ||
// success | ||
final String multipleModels = | ||
"[{ name:\"testModel3\", class:\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we do string building for json, which I think is our dominant pattern... personally not having all the .appends
makes it more readable, even with the escaping around the double quotes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm I am missing the point here, can you tell me how you think this can be improved with an example? happy to do it then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
He was thinking-out-loud, not telling you to do differently or even implied. If I may do the same, I'll say I look forward to us using text blocks, now available on main due to Java 21. I suppose you will back-port this to 9.x so I suppose you won't bother using Java 21 features yet.
@@ -242,9 +242,9 @@ client.add(Arrays.asList(d1, d2)); | |||
|
|||
== Query Time | |||
|
|||
Apache Solr provides two query parsers that work with dense vector fields, that each support different ways of matching documents based on vector similarity: The `knn` query parser, and the `vectorSimilarity` query parser. | |||
Apache Solr provides three query parsers that work with dense vector fields, that each support different ways of matching documents based on vector similarity: The `knn` query parser, the `vectorSimilarity` query parser and the `embed` query parser. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be thinking about adding some call out text of "Here is when you use knn, here is when you use vectorSimilarity, and here is when you use embed". I'd personally like to see our ref guide move beyond being just a reference to look things up and also explain more the "why".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, take a look and if enough feel free to resolve this conversation
solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/schema/DenseVectorField.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/schema/DenseVectorField.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/store/package-info.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/store/rest/package-info.java
Outdated
Show resolved
Hide resolved
Our codebase is vast; I sympathize with almost nobody knowing about the file-store (I'd count me, @noblepaul , @chatman as knowing). As a project maintainer, I'm just concerned about growing duplicative mechanisms. Heck, maybe the blob-store is yet another, albeit we've identified that one as one to go. An early design review might have elucidated such an option before your sunk cost here Alessandro. @cpoerschke I tagged you for your review because I believe you have a good familiarity of LTR for comparison to its model store to see if you also think it makes sense to clone that one, but you might not know about the file store that I am proposing for this use-case. |
Thanks all for the feedback, in order:
|
When you look at the model store, it would be great to hear where it comes up short and we could look at improving it. What would make it an "easy" decision to adopt!?? |
I'm taking a look at the file-store from a code perspective and it's harder than expected, I'll keep digging but @dsmiley do you have any code references or tests that show the usage of this component? Are we talking about: org.apache.solr.filestore ? |
I was able to find: org.apache.solr.filestore.DistribFileStore -> implements FileStore it's not clear how they differ and there's no documentation I could easily find. Anyway, I'm progressing this route and I should be able to have additional insights tomorrow.
N.B. these are just pragmatic observations to improve it, no offence to whoever designed and implemented this (I have not even checked) |
As I was progressing on the route of the FileStore, a thought came to my mind: The embeddingModelStore currently handles the instantiation part, so when you access a model from a query parser or an update request processor (next on my to-do list), you get the object, ready to be used from the store, with no need to instantiate the object again (that could be expensive). From what I'm seeing If I understood correctly, the FileStore will only store the configuration file for the model, so can easily access it but if we want the singleton mechanism for the model object we need to implement it somewhere, if not every time we use the query parser we need to instantiate the model, from the Json stored in the FileStore. So... I'm not convinced anymore that I should spend effort in that direction for this specific use case, using the fileStore and deleting the embedding model store was seducing, but if I need to implement an additional mechanism to handle models to be singleton on top of it, I don't see much benefit especially from development time perspective. Please correct me if I missed something or if you believe the current implementation won't work in certain scenarios. Don't take it as provocative, it's just a genuine perspective of someone with limited time to dedicate to the project, if I was paid full-time on this I wouldn't have any problem in pursuing nice to haves. |
My partial feedback after 3-4 hours working on that:
This is a genuine not provocative feedback from someone with a decent experience with Solr and Solr codebase, I assume that a new developer would struggle even more in using it? |
solr/modules/llm/src/java/org/apache/solr/llm/search/TextEmbedderQParserPlugin.java
Outdated
Show resolved
Hide resolved
Co-authored-by: David Smiley <[email protected]>
Co-authored-by: David Smiley <[email protected]>
Co-authored-by: David Smiley <[email protected]>
Co-authored-by: David Smiley <[email protected]>
f0e1d26
to
5148c68
Compare
Thank you very much for all your help! @dsmiley @malliaridis @cpoerschke and @epugh! I'll create another Pull Request for the 9.x backport as it involves some gradle stuff, I would love to get some help from Christos on that :) |
* added support to store Large Language Models external endpoints config for text to vector encoding * added a query parser that can use such external models to encode text to vector and run a KNearest Neighbour search --------- Co-authored-by: Christine Poerschke <[email protected]> Co-authored-by: Christos Malliaridis <[email protected]> Co-authored-by: David Smiley <[email protected]> (cherry picked from commit e8803e6)
* SOLR-17525: Text Embedder Query Parser (#2809) * added support to store Large Language Models external endpoints config for text to vector encoding * added a query parser that can use such external models to encode text to vector and run a KNearest Neighbour search --------- Co-authored-by: Christine Poerschke <[email protected]> Co-authored-by: Christos Malliaridis <[email protected]> Co-authored-by: David Smiley <[email protected]> (cherry picked from commit e8803e6)
* SOLR-17525: Text Embedder Query Parser (#2809) * added support to store Large Language Models external endpoints config for text to vector encoding * added a query parser that can use such external models to encode text to vector and run a KNearest Neighbour search --------- Co-authored-by: Christine Poerschke <[email protected]> Co-authored-by: Christos Malliaridis <[email protected]> Co-authored-by: David Smiley <[email protected]> (cherry picked from commit e8803e6)
https://issues.apache.org/jira/browse/SOLR-17525
Description
Scope of this issue is to integrate a new module able to use LLM (through managed services) to enhance aspect of Apache Solr.
Specifically this first Pull Request relates to handle embedding models and automatic text vectorisation in Solr.
Solution
The functionality has been introduced through LangChain4J (https://docs.langchain4j.dev).
The are several aspects I would like feedback on:
To do that I added security exceptions in both 'solr/server/etc/security.policy' and 'gradle/testing/randomization/policies/solr-tests.policy'.
It works but I have no idea if it's acceptable or the best way to do it
Tests
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.