Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include summary of data streams in /search endpoint responses #1252

Open
jsoriano opened this issue Nov 20, 2024 · 9 comments
Open

Include summary of data streams in /search endpoint responses #1252

jsoriano opened this issue Nov 20, 2024 · 9 comments
Labels
Team:Ecosystem Label for the Packages Ecosystem team

Comments

@jsoriano
Copy link
Member

Security is implementing a functionality that creates RAG searches of our integrations to be used with the LLM. They need for this the package names, and the data stream names and titles. This information is available now in the registry, but the data streams are only included in responses of the /package endpoint, so it is needed to make a request per package. The information should be automatically updated. As this is intended to be executed on each deployment, this can be too many queries for a couple of fields.

By now they are building their own index from the integrations repository, but this is also not an option for GA.

The most direct solution is to include this information in the /search responses, the same way as policy templates are included. This could be optionally selected using a parameter, to don't modify the current queries.

The content added would be something like a subset of the content in the /packages response, for each package something like this:

    ...
    "data_streams": {
        {
            "type": "logs",
            "dataset": "apache.access",
            "title": "Apache access logs",
        },
        {
            "type": "logs",
            "dataset": "apache.error",
            "title": "Apache error logs",
        },
        {
            "type": "metrics",
            "dataset": "apache.status",
            "title": "Apache status metrics",
        },
    },
    ...

If more data is needed in the future, maybe we could prepare a "full index" that can be downloaded, but looks overkill at this point.

cc @P1llus

@P1llus
Copy link
Member

P1llus commented Nov 20, 2024

This looks exactly like what we are looking for yes!

@jsoriano
Copy link
Member Author

@kpollich I discussed with @P1llus about the requirements for this. It will be needed for 8.18.0.

@kpollich
Copy link
Member

Thanks for flagging, @jsoriano - happy to get this scheduled. Looks like this isn't too much effort to surface this data in the EPR API?

@jsoriano
Copy link
Member Author

Looks like this isn't too much effort to surface this data in the EPR API?

Yes, I think so. I think we can avoid reindexing in the public EPR because this information is already indexed, so it will probably be a change only in package-registry.

@P1llus
Copy link
Member

P1llus commented Nov 21, 2024

@kpollich @jsoriano . Do you think we could modify the PackageClient on the fleet code to support this optional format? In a way that its behind a flag or something, so it does not require any changes to any existing code elsewhere.

The reason is we would like to continue to use the PackageClient like some other kibana plugins does, as we do not have access to things like if the user sets a custom EPR endpoint address and such.

I believe the only change would be an additional flag argument to getPackages() and updating the PackageListItem[] type with the optional fields?

@kpollich
Copy link
Member

I believe the only change would be an additional flag argument to getPackages() and updating the PackageListItem[] type with the optional fields?

I think these changes are necessary but we also might need to update the flow where installed packages are loaded from Elasticsearch instead of EPR to store + fetch the same data stream data. cc @nchaulet to keep me honest.

@nchaulet
Copy link
Member

Yes I think making a change to the packageClient seems good, there will probably be a special case for uploaded package load from ES there.

Could the response size be an issue? the search endpoint is not paginated and having all datastreams of all packages will probably create a huge response, what will happens when we add more packages?

@P1llus
Copy link
Member

P1llus commented Nov 25, 2024

I was hoping that maybe this would not be the default behavior at the moment, so that there is less needs for changes to the current behavior, and we could work on the performance sides of it moving forward? I think pagination would be good.

We are not including ALL the metadata from the datastreams, so I believe its only a few lines extra per datastream, which is still some extra, but not as much as the whole rest of the metadata would be.

@jsoriano
Copy link
Member Author

jsoriano commented Dec 3, 2024

Could the response size be an issue? the search endpoint is not paginated and having all datastreams of all packages will probably create a huge response, what will happens when we add more packages?

I think pagination would be good.

Support for pagination in the /search endpoint would make sense, responses are only going to grow as it is designed now. Created issue for this #1256.

@kpollich kpollich added the Team:Ecosystem Label for the Packages Ecosystem team label Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Ecosystem Label for the Packages Ecosystem team
Projects
None yet
Development

No branches or pull requests

4 participants