Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17023: Use Modern NLP Models via ONNX and Apache OpenNLP with Solr #1999

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

epugh
Copy link
Contributor

@epugh epugh commented Oct 10, 2023

https://issues.apache.org/jira/browse/SOLR-17023

Description

Code and BATS test that demonstrates downloading a sentiment classification model from Huggingface, converting it to Onnx model, uploading the model to Solr's FileStore, and then configuring a collection and associated pipeline to use the model for enriching content.

Solution

New DocumentCategorizerUpdateProcessorFactory that interacts with the model.

Tests

BATS test. Check this PR out and run: ./gradlew integrationTests --tests test_opennlp.bats

Checklist

Please review the following and check all that apply:

  • [x ] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • [x ] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [ x] I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide
  • Look at GPU support
  • Integrate with Films demo.
  • Can we have BATS tests that depend on external resources, or do I check in the ONNX model instead? (make a stupid one???)
  • Write a unit test?
  • Provide both a label but ALSO a score!

@epugh epugh marked this pull request as draft October 10, 2023 20:08
@epugh
Copy link
Contributor Author

epugh commented Oct 10, 2023

Check it out @jzonthemtn

Comment on lines 22 to 28
configurations.all {
resolutionStrategy {
force 'org.apache.opennlp:opennlp-tools:2.2.0'
force 'org.apache.opennlp:opennlp-dl:2.2.0'
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need for force a resolution? This shouldn't be necessary at all...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need apache/lucene#448 to be merged to get us to 2.2, or we ahve to work around Lucene... Is there a better way? This is what Jeff and I hacked in..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try org.apache.opennlp:opennlp*=2.2.0 in versions.props and remove this from here. It should force the upgrade to 2.2.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of errors when I took out the resolutionStrategy:

* What went wrong:
Could not determine the dependencies of task ':solr:modules:analysis-extras:compileJava'.
> Could not resolve all task dependencies for configuration ':solr:modules:analysis-extras:compileClasspath'.
   > Could not resolve org.apache.opennlp:opennlp-tools:2.2.0.
     Required by:
         project :solr:modules:analysis-extras
         project :solr:modules:analysis-extras > org.apache.opennlp:opennlp-dl:2.2.0
      > Cannot find a version of 'org.apache.opennlp:opennlp-tools' that satisfies the version constraints:
           Dependency path 'org.apache.solr:analysis-extras:10.0.0-SNAPSHOT' --> 'org.apache.opennlp:opennlp-tools:2.2.0'
           Constraint path 'org.apache.solr:analysis-extras:10.0.0-SNAPSHOT' --> 'org.apache.solr:core:10.0.0-SNAPSHOT' (apiElements) --> 'org.apache.opennlp:opennlp-tools:1.9.4' because of the following reason: Computed from com.palantir.consistent-versions' versions.lock in solr-root
           Constraint path 'org.apache.solr:analysis-extras:10.0.0-SNAPSHOT' --> 'org.apache.solr:solrj:10.0.0-SNAPSHOT' (apiElements) --> 'org.apache.opennlp:opennlp-tools:1.9.4' because of the following reason: Computed from com.palantir.consistent-versions' versions.lock in solr-root
           Dependency path 'org.apache.solr:analysis-extras:10.0.0-SNAPSHOT' --> 'org.apache.lucene:lucene-analysis-opennlp:9.8.0' (compile) --> 'org.apache.opennlp:opennlp-tools:1.9.1'

also, feel free to push up changes to this branch ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure to run ./gradlew --write-locks to make sure it takes into account versions.props changes.

versions.props Outdated
@@ -70,3 +70,4 @@ org.semver4j:semver4j=5.2.1
org.slf4j:*=2.0.9
org.xerial.snappy:snappy-java=1.1.10.5
software.amazon.awssdk:*=2.20.155
org.apache.opennlp:opennlp-dl=2.2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try org.apache.opennlp:opennlp*=2.2.0

@epugh
Copy link
Contributor Author

epugh commented Oct 10, 2023

Thanks @risdenk , your suggestions worked. One slight wrinkle that I don't know how to deal with is that the com.microsoft.onnxruntime:onnxruntime_gpu:1.14.0 file doesn't work on OSX, only linux and windows. So if you look at the BATS test, you'll see some code where I replace it with com.microsoft.onnxruntime:onnxruntime:1.14.0. I don't really know how to handle that going forward.. I mean, swapping a magic jar is probably a bad idea ;-).

@epugh
Copy link
Contributor Author

epugh commented Oct 10, 2023

./gradlew integrationTests --tests test_opennlp.bats

@epugh epugh requested a review from risdenk October 10, 2023 21:16
@epugh
Copy link
Contributor Author

epugh commented Oct 10, 2023

> Task :solr:packaging:integrationTests
Running BATS tests with Solr base port 53065
1..1
ok 1 Check lifecycle of sentiment classification in 20000ms

This is after uploading the model to the PackageStore and reference it in there!

versions.props Outdated Show resolved Hide resolved
@test "Check lifecycle of sentiment classification" {

# GPU versions is linux and windows only, not OSX. So swap jars.
rm -f ${SOLR_TIP}/modules/analysis-extras/lib/onnxruntime_gpu-1.14.0.jar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just always run on cpu? Why need to test if there is a capable GPU or not? Could be on a VM without a GPU too even w/ linux and windows...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the purposes of this, I totally agree.. I think I need some Gradle help to figure this out ;-( I've banged at it with no luck.

Copy link
Contributor

@jzonthemtn jzonthemtn Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just CPU is definitely a better choice here. And just a note, ONNX Runtime does not currently support Mac silicon so this will only work on x64 machines.

OpenNLP brings in the onnxruntime-gpu dependency. I will open a PR in OpenNLP to use onnxruntime instead for all of the stated reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is a workaround, please let me know...!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release is done. New artifacts should be available soon ;-) (not depending on gpu anymore)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!!

rm -f ${SOLR_TIP}/modules/analysis-extras/lib/onnxruntime_gpu-1.14.0.jar
# restore
#curl --insecure -o ${SOLR_TIP}/modules/analysis-extras/lib/onnxruntime-1.14.0.jar https://repo1.maven.org/maven2/com/microsoft/onnxruntime/onnxruntime/1.14.0/onnxruntime-1.14.0.jar
cp /Users/epugh/Documents/projects/solr-epugh/onnxruntime-1.14.0.jar ${SOLR_TIP}/modules/analysis-extras/lib/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using your path...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... I was in the back of David Smiley's car in the wilds of New Brunswick Canada while I was working on this and didn't want to burn up my data plan ;-). Likewise I need to figure out a better apporoach with the model since it's 600 mb.... I don't know if I can find a toy one that is super small, or if I add docs to download it from hugging facet etc. Longer term, I doin't even think this bats test gets committed the way it is because of all the external dependencies.. Though.. if I had a super small model that I could check into Git, maybe that owuld be okay? Like we have a binary package for some of the package tests..

@epugh
Copy link
Contributor Author

epugh commented Oct 11, 2023

@risdenk @jzonthemtn so, any ideas how to force the onnxruntime instead of the onnxruntime_gpu? I regenerated the licenses, and noticed that it created a "onnxruntime_gpu-1.14.0.jar.sha1".

@epugh
Copy link
Contributor Author

epugh commented Oct 11, 2023

also, i am seeing two errors on looking up licenses...

Execution failed for task ':solr:modules:analysis-extras:validateJarLicenses'.
> Certain license/ notice files are missing:
    - License file missing ('com.microsoft.onnxruntime:onnxruntime_gpu:1.14.0'), expected it at: /Users/epugh/Documents/projects/solr-epugh/solr/licenses/onnxruntime_gpu-LICENSE-[type].txt, where [type] can be any of [ASL, BSD, BSD_LIKE, CDDL, CPL, EPL, MIT, MPL, PD, SUN, COMPOUND].
    - License file missing ('org.apache.opennlp:opennlp-dl:2.2.0'), expected it at: /Users/epugh/Documents/projects/solr-epugh/solr/licenses/opennlp-dl-LICENSE-[type].txt or /Users/epugh/Documents/projects/solr-epugh/solr/licenses/opennlp-LICENSE-[type].txt, where [type] can be any of [ASL, BSD, BSD_LIKE, CDDL, CPL, EPL, MIT, MPL, PD, SUN, COMPOUND].


@epugh
Copy link
Contributor Author

epugh commented Oct 11, 2023

@jzonthemtn did open https://issues.apache.org/jira/browse/OPENNLP-1515, but that will require a release of OpenNLP before we can piggy back on it

@epugh
Copy link
Contributor Author

epugh commented Oct 11, 2023

I think the first draft is done, and it "works". There are some caveats and things to be figured out, like the onnxruntime_gpu versus onnxruntime issue. Based on my work so far, we definitly need tutorials, there are too many moving pieces. The PackageStore is great, at least as far as I've worked with it, but it does expect to load the data in memory as it moves it around, so we need to figure that out. Right now we only get sentiment labels back, and we really need sentiment scores, i.e a number. I don't know if that can be a 0 to 1, of if it has to be 1,2,3,4,5 labels. I'd like to experiment with the saem thing as a copyField into a text analyzer, so you copy "review" into "review_sentiment_score" and get a number there. Or, do we have a streaming expression you run that updates the review_sentiment_score for all docs that don't have it with an atomic update?

@epugh epugh changed the title SOLR-17023: Use Modern NLP Models from Apache OpenNLP with Solr SOLR-17023: Use Modern NLP Models via ONNX and Apache OpenNLP with Solr Oct 26, 2023
@cpoerschke
Copy link
Contributor

Looking at these build failures and the error message generated, it appears that it may be caused by us using Java 11 and OpenNLP being compiled with Java 17?? ... Is this a deal breaker for this PR?

Hmm, interesting. So we have:

So if the classes here were built as an independent plugin (with minimum Java17) and then deployed (with the relevant dependencies) into a Solr setup running Java17 with the original Solr artefacts (built with Java11) -- I wonder if that would work?

Also noting that apache/lucene#579 bumped Lucene to Java17 on main branch i.e. presumably then a future Lucene10 will be minimum Java17 version.

@rzo1
Copy link

rzo1 commented Jan 18, 2024

So if the classes here were built as an independent plugin (with minimum Java17) and then deployed (with the relevant dependencies) into a Solr setup running Java17 with the original Solr artefacts (built with Java11) -- I wonder if that would work?

I guess this should work.

@epugh
Copy link
Contributor Author

epugh commented Jan 18, 2024

Feels like what we should be doing is having Solr 10 target Java 17 since Lucene 10 will require it, and then this code goes on Solr 10, but not on Solr 9. This lets us have some more time to experiment with out dealing with the headaches of supporting an official release in the 9.x line (backcompat and the rest)....??

@cpoerschke
Copy link
Contributor

Feels like what we should be doing is having Solr 10 target Java 17 since Lucene 10 will require it, and then this code goes on Solr 10, but not on Solr 9. This lets us have some more time to experiment with out dealing with the headaches of supporting an official release in the 9.x line (backcompat and the rest)....??

I concur. Also a nice motivation for targeting Java 17 i.e. specific example of functionality that it would unlock. And in the meantime "independent plugin" approaches remain a possibility in the community, perhaps even in the https://github.com/apache/solr-sandbox if someone wanted to pursue that (haven't checked how that is built, just kinda "name dropping" solr-sandbox here).

@epugh
Copy link
Contributor Author

epugh commented Feb 13, 2024

Lucene 9 requires older version of Java than the minimum required version that OpenNLP requires. That means that this PR is pending a release of Lucene 10, and the adoption of Lucene 10 by Solr. It would be interesting to think about if there was a way for Solr main branch to somehow depend on the Lucene main branch release that jumps the minimum Java versions all around, and would allow this PR to be merged.

Resolved Conflicts:
	versions.lock
	versions.props
@github-actions github-actions bot added dependencies Dependency upgrades tool:build labels Feb 13, 2024
@@ -49,6 +49,7 @@ org.apache.httpcomponents:httpmime=4.5.14
org.apache.kerby:*=1.0.1
org.apache.logging.log4j:*=2.21.0
org.apache.lucene:*=9.9.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporarily within this pull request (pre-merge) we could change this to a Lucene 10 prerelease based on the "Update Lucene prerelease" steps in https://github.com/apache/solr/blob/main/help/dependencies.txt ...

... might be worth waiting though until Lucene 9.10 is out and https://issues.apache.org/jira/browse/SOLR-17157 has upgraded Solr to use it i.e. then solr/main will be closer to lucene/main than it is right now.

@cpoerschke
Copy link
Contributor

... It would be interesting to think about if there was a way for Solr main branch to somehow depend on the Lucene main branch release that jumps the minimum Java versions all around, and would allow this PR to be merged.

Technically I guess Solr main could continue to depend on whatever Lucene version and just jumping up the minimum Java version for Solr main to 17 would be sufficient? With all the ups-and-downs of main and branch_9x having different minimums.

@risdenk
Copy link
Contributor

risdenk commented Feb 13, 2024

@cpoerschke #1510 might be helpful here. I have a few wip prs for newer jdks

Copy link

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the [email protected] mailing list. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label Apr 17, 2024
Copy link

github-actions bot commented Oct 7, 2024

This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again.

@github-actions github-actions bot added the closed-stale Closed after being stale for 60 days label Oct 7, 2024
@github-actions github-actions bot closed this Oct 7, 2024
@epugh
Copy link
Contributor Author

epugh commented Oct 7, 2024

this is still valid ;-)

@epugh epugh reopened this Oct 7, 2024
@epugh epugh added exempt-stale Prevent a PR from going stale and removed stale PR not updated in 60 days cat:index closed-stale Closed after being stale for 60 days labels Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Dependency upgrades exempt-stale Prevent a PR from going stale tool:build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants