Include summary of data streams in `/search` endpoint responses #1252

jsoriano · 2024-11-20T19:21:09Z

Security is implementing a functionality that creates RAG searches of our integrations to be used with the LLM. They need for this the package names, and the data stream names and titles. This information is available now in the registry, but the data streams are only included in responses of the /package endpoint, so it is needed to make a request per package. The information should be automatically updated. As this is intended to be executed on each deployment, this can be too many queries for a couple of fields.

By now they are building their own index from the integrations repository, but this is also not an option for GA.

The most direct solution is to include this information in the /search responses, the same way as policy templates are included. This could be optionally selected using a parameter, to don't modify the current queries.

The content added would be something like a subset of the content in the /packages response, for each package something like this:

    ...
    "data_streams": {
        {
            "type": "logs",
            "dataset": "apache.access",
            "title": "Apache access logs",
        },
        {
            "type": "logs",
            "dataset": "apache.error",
            "title": "Apache error logs",
        },
        {
            "type": "metrics",
            "dataset": "apache.status",
            "title": "Apache status metrics",
        },
    },
    ...

If more data is needed in the future, maybe we could prepare a "full index" that can be downloaded, but looks overkill at this point.

cc @P1llus

The text was updated successfully, but these errors were encountered:

P1llus · 2024-11-20T19:22:26Z

This looks exactly like what we are looking for yes!

jsoriano · 2024-11-20T19:32:53Z

@kpollich I discussed with @P1llus about the requirements for this. It will be needed for 8.18.0.

kpollich · 2024-11-20T19:41:01Z

Thanks for flagging, @jsoriano - happy to get this scheduled. Looks like this isn't too much effort to surface this data in the EPR API?

jsoriano · 2024-11-21T09:53:31Z

Looks like this isn't too much effort to surface this data in the EPR API?

Yes, I think so. I think we can avoid reindexing in the public EPR because this information is already indexed, so it will probably be a change only in package-registry.

P1llus · 2024-11-21T11:06:22Z

@kpollich @jsoriano . Do you think we could modify the PackageClient on the fleet code to support this optional format? In a way that its behind a flag or something, so it does not require any changes to any existing code elsewhere.

The reason is we would like to continue to use the PackageClient like some other kibana plugins does, as we do not have access to things like if the user sets a custom EPR endpoint address and such.

I believe the only change would be an additional flag argument to getPackages() and updating the PackageListItem[] type with the optional fields?

kpollich · 2024-11-21T14:05:48Z

I believe the only change would be an additional flag argument to getPackages() and updating the PackageListItem[] type with the optional fields?

I think these changes are necessary but we also might need to update the flow where installed packages are loaded from Elasticsearch instead of EPR to store + fetch the same data stream data. cc @nchaulet to keep me honest.

nchaulet · 2024-11-21T15:28:50Z

Yes I think making a change to the packageClient seems good, there will probably be a special case for uploaded package load from ES there.

Could the response size be an issue? the search endpoint is not paginated and having all datastreams of all packages will probably create a huge response, what will happens when we add more packages?

P1llus · 2024-11-25T15:07:15Z

I was hoping that maybe this would not be the default behavior at the moment, so that there is less needs for changes to the current behavior, and we could work on the performance sides of it moving forward? I think pagination would be good.

We are not including ALL the metadata from the datastreams, so I believe its only a few lines extra per datastream, which is still some extra, but not as much as the whole rest of the metadata would be.

jsoriano · 2024-12-03T11:52:44Z

Could the response size be an issue? the search endpoint is not paginated and having all datastreams of all packages will probably create a huge response, what will happens when we add more packages?

I think pagination would be good.

Support for pagination in the /search endpoint would make sense, responses are only going to grow as it is designed now. Created issue for this #1256.

kpollich added the Team:Ecosystem Label for the Packages Ecosystem team label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include summary of data streams in `/search` endpoint responses #1252

Include summary of data streams in `/search` endpoint responses #1252

jsoriano commented Nov 20, 2024

P1llus commented Nov 20, 2024

jsoriano commented Nov 20, 2024

kpollich commented Nov 20, 2024

jsoriano commented Nov 21, 2024

P1llus commented Nov 21, 2024

kpollich commented Nov 21, 2024

nchaulet commented Nov 21, 2024

P1llus commented Nov 25, 2024

jsoriano commented Dec 3, 2024

Include summary of data streams in /search endpoint responses #1252

Include summary of data streams in /search endpoint responses #1252

Comments

jsoriano commented Nov 20, 2024

P1llus commented Nov 20, 2024

jsoriano commented Nov 20, 2024

kpollich commented Nov 20, 2024

jsoriano commented Nov 21, 2024

P1llus commented Nov 21, 2024

kpollich commented Nov 21, 2024

nchaulet commented Nov 21, 2024

P1llus commented Nov 25, 2024

jsoriano commented Dec 3, 2024

Include summary of data streams in `/search` endpoint responses #1252

Include summary of data streams in `/search` endpoint responses #1252