Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial for using Asymmetric models #3258

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

brianf-aws
Copy link
Contributor

Description

This PR implements the tutorial required for local asymmetric model embeddings. In this specific tutorial it is done using a docker container, to help users take advantage of multi node clusters using ML Nodes.

Related Issues

Resolves #3255

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

 After replicating the local model embeddings. I am able to provide a high level solution of what the tutorial entails.

Signed-off-by: Brian Flores <[email protected]>
@brianf-aws brianf-aws requested a deployment to ml-commons-cicd-env-require-approval December 6, 2024 01:44 — with GitHub Actions Waiting
@brianf-aws brianf-aws requested a deployment to ml-commons-cicd-env-require-approval December 6, 2024 01:44 — with GitHub Actions Waiting
Provides more context to each step

Signed-off-by: Brian Flores <[email protected]>
Expands on the context of the previous commit and improvements in grammar and structure

Signed-off-by: Brian Flores <[email protected]>
Signed-off-by: Brian Flores <[email protected]>
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 6, 2024 21:36 — with GitHub Actions Inactive
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 6, 2024 21:36 — with GitHub Actions Failure
@brianf-aws brianf-aws marked this pull request as ready for review December 6, 2024 21:36
@brianf-aws
Copy link
Contributor Author

There is a flaky test in the CI, can I get a retry please? :

org.opensearch.client.ResponseException: method [DELETE], host [http://127.0.0.1:33403/], URI [/_plugins/_ml/models/ooAcnpMBnp1fQxLKssuf], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"status_exception","reason":"Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete"}],"type":"status_exception","reason":"Model cannot be deleted in deploying or deployed state. Try undeploy model first then delete"},"status":400}


## Step 1: Spin Up a Docker OpenSearch Cluster

To run OpenSearch in a local development environment, you can use Docker and a pre-configured `docker-compose` file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a requirement? I think the step 2-6 can also be done if you run an OpenSearch cluster locally without a docker?

Copy link
Contributor Author

@brianf-aws brianf-aws Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats true, I chose a docker setup since there aren't many tutorials using it. Also its better with creating the tutorial with not having to register and deploy it again when I go back to use docker.

@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 6, 2024 23:21 — with GitHub Actions Inactive
@mingshl
Copy link
Collaborator

mingshl commented Dec 6, 2024

can you please add the configuring the knn index using ml inference ingest processors, and also search using ml inference request processors? So we can give the full tutorials about how to use this model during ingest and search


### b. Zip the Model Files

In order to upload the model to OpenSearch, you must zip the necessary model files (`model.onnx`, `sentencepiece.bpe.model`, and `tokenizer.json`). The `model.onnx` file is located in the `onnx` directory of the cloned repository.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the onnx the only format that this model provided? Can we add pytorch format tutorial here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Zane! I can see why you ask Ill clarify that this is only for onnx, I havent used pytorch models so thats I wrote it like that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, my suggestion is we can add both onnx and pytorch cases so that user can choose between them based on their cases.

@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 17, 2024 00:02 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 17, 2024 00:02 — with GitHub Actions Inactive
@brianf-aws
Copy link
Contributor Author

In order for this tutorial for all users following it, the following PR has to be merged to avoid the MLInput being null
#3281

@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval December 17, 2024 01:03 — with GitHub Actions Failure
Copy link
Contributor

@kolchfa-aws kolchfa-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions to clarify the text. In general, use sentence case capitalization and refer to the user as "you". Thanks!

This tutorial demonstrates how to generate text embeddings using an asymmetric embedding model in OpenSearch which will be used
to run semantic search. This is implemented within a Docker container, the example model used in this tutorial is the multilingual
`intfloat/multilingual-e5-small` model from Hugging Face.
You will learn how to prepare the model, register it in OpenSearch, and run inference to generate embeddings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You will learn how to prepare the model, register it in OpenSearch, and run inference to generate embeddings.
In this tutorial, you'll learn how to prepare the model, register it in OpenSearch, and run inference to generate embeddings.

Signed-off-by: Brian Flores <[email protected]>
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 23, 2024 21:22 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 23, 2024 21:22 — with GitHub Actions Inactive

To download the model, use the following steps:

1. Install Git Large File Storage (LFS) if you haven’t already:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should release this model from our pre-trained model repository so that customer can easily register this model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the process to introduce the model (to the model repository)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss offline more about this. But this is an example of the model tracing workflow from opensearch-py-ml

@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 24, 2024 20:34 — with GitHub Actions Inactive
Signed-off-by: Brian Flores <[email protected]>
@brianf-aws brianf-aws requested a deployment to ml-commons-cicd-env-require-approval December 26, 2024 21:59 — with GitHub Actions Waiting
@brianf-aws brianf-aws requested a deployment to ml-commons-cicd-env-require-approval December 26, 2024 21:59 — with GitHub Actions Waiting
---

## Step 1: Spin up a Docker OpenSearch cluster
## Step 1: Start OpenSearch locally
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care if customer started opensearch locally or with public IP? In our blue print example, we setup OS locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally locally because I have a step were I make them use python to service the model which runs on localhost:8080

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I have a OS which is running in let's say 8.8.8.8 and then if I use 8.8.8.8:8080, won't that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert but if two processes run on the same address like localhost but use different ports then it should be fine. Also the python server is just a one time occurrence just a means to get OS to download the model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work. If we setup a OS cluster in EC2 host, we can access through the public URL and the port. So I would suggest to rephrase: Start Opensearch and in line 22 we can say:

Run OpenSearch and ensure the following steps are completed. In this example, we set up the OS cluster locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thank you for the suggestion (addressed in commit b11cbe1).

@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 27, 2024 00:10 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 27, 2024 00:10 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval December 27, 2024 17:01 — with GitHub Actions Inactive

### b. Zip the model files

To upload the model to OpenSearch, you must zip the necessary model files (`model.onnx`, `sentencepiece.bpe.model`, and `tokenizer.json`). The `model.onnx` file is located in the `onnx` directory of the cloned repository.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these three files are located in onnx directory, not just model.onnx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit! c43e9a5

git lfs install
```

2. Clone the model repository:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than lfs install or cloning the whole repository may be we can do?

Create a Directory for the Files:

mkdir multilingual-e5-model
cd multilingual-e5-model

Download the Files Using wget or curl: Use the raw file links from the Hugging Face repository.

wget https://huggingface.co/intfloat/multilingual-e5-small/resolve/main/onnx/model.onnx
wget https://huggingface.co/intfloat/multilingual-e5-small/resolve/main/tokenizer.json
wget https://huggingface.co/intfloat/multilingual-e5-small/resolve/main/sentencepiece.bpe.model

Alternatively, use curl:

curl -O https://huggingface.co/intfloat/multilingual-e5-small/resolve/main/onnx/model.onnx
curl -O https://huggingface.co/intfloat/multilingual-e5-small/resolve/main/tokenizer.json
curl -O https://huggingface.co/intfloat/multilingual-e5-small/resolve/main/sentencepiece.bpe.model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you on this but lets suppose there is a new update on the model. user would have to delete and make a separate request. Also I think separate requests would cause issues since these are big files users are working on. Also its possible the endpoint can change or someone changes directories for a clean up. We lock ourselves into maintaining a tutorial for their changes

Copy link
Collaborator

@dhrubo-os dhrubo-os Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets suppose there is a new update on the model. user would have to delete and make a separate request.

Yeah, if they want to use the new updated model, don't they need to do the same for your case to use the updated model? They need to pull the updates from the git.

its possible the endpoint can change or someone changes directories for a clean up. We lock ourselves into maintaining a tutorial for their changes

I think this is true for both cases, if they change the file name, we need to update the doc in both cases, right?

My thought process is, in this way:

  1. customer doesn't need to install git & git lfs
  2. customer doesn't need to download unnecessary package which they don't need anymore.

This is not a blocker, but a suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Documentation] Tutorial for using Asymmetric models
6 participants