Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GrowThePie API pulls #1157

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

GrowThePie API pulls #1157

wants to merge 4 commits into from

Conversation

chuxinh
Copy link
Collaborator

@chuxinh chuxinh commented Dec 12, 2024

Description

Migrating GTP utils to bigquery as part of the Superchain Health Dashboard Pipeline #1093

GrowThePie API documentation: here

Currently getting:

  • Fundamentals
  • Chain Metadata

Questions

  • Do we currently also pull the contract labelling? If not we can followup with them on getting the API endpoint
  • Currently the data pull gets everything in one go, is there anyway to make it more streamlined as in only getting latest results incrementally etc?
  • How to run and test it? Try out testing in some notebook right now

@lithium323
Copy link
Collaborator

thanks for this @chuxinh this is fantastic! I'll defer to @MSilb7 for the question around contract labeling.

Currently the data pull gets everything in one go, is there anyway to make it more streamlined as in only getting latest results incrementally etc?

I guess this depends on the API and whether it has a filter parameter. I'll take a closer look. For defillama we do have to pull all of the history but to save on memory usage and time writing out we keep and only write out the most recent 7 days.

How to run and test it? Try out testing in some notebook right now

The best way to test it end-to-end is to run the cli command. When I run the cli command for testing I manually change the write location to DataLocation.LOCAL, so the results are written to my local filesystem, here is where I change that:

    def write(
        self,
        dataframe: pl.DataFrame,
        sort_by: list[str] | None = None,
    ):
        return write_daily_data(
            root_path=self.root_path,
            dataframe=dataframe,
            sort_by=sort_by,
            # Override the location value here. To write to the local file system
            # use DataLocation.LOCAL
            location=DataLocation.GCS,
        )

That said, I think this is not a very robust approach, so what I'm going to do is automatically set location to local when it detects that the CLI is running not from Github Actions or Kubernetes. 99% of the that is what we want to do anyways. I'll get back to you here with that change.

Will take a closer look at the PR and leave comments there.

@chuxinh
Copy link
Collaborator Author

chuxinh commented Dec 13, 2024

Thanks @lithium323 !

I guess this depends on the API and whether it has a filter parameter. I'll take a closer look. For defillama we do have to pull all of the history but to save on memory usage and time writing out we keep and only write out the most recent 7 days.

I don't think they have any filter there, so far it's just access to their JSON files but you can check out here.

And also made some changes for the data pull to run locally. Came across some issues with partition that requires dt but everything else seems to work

summary_df = summary_df.rename({"date": "dt"})

GrowThePie.FUNDAMENTALS_SUMMARY.write(
dataframe=summary_df,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dataframe=summary_df,
dataframe=most_recent_dates(summary_df, n_dates=FUNDAMENTALS_LAST_N_DAYS),

Here we could only write the most recent data (to avoid many unnecessary writes)
where FUNDAMENTALS_LAST_N_DAYS could be only the last 7 days.

@lithium323
Copy link
Collaborator

I don't think they have any filter there, so far it's just access to their JSON files but you can check out here.

I saw they have a non-full endpoint, limited to 365 days, but it does not include ethereum so maybe we don't want to use it. The full endpoint is not too bad, it returns quite quickly. To avoid writing out too many dataframes each time we could fliter the results to the last N days before writing, we do that for defillama as well.

I'll run the cli locally to reproduce the issues with the partition without dt and will get back. Since you'll be OOO I will also look into merging the PR and running it in github actions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants