Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add package caching #6

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Add package caching #6

wants to merge 4 commits into from

Conversation

me-and
Copy link
Contributor

@me-and me-and commented Jan 9, 2023

To speed up Cygwin installations, allow users to store the downloaded packages using GitHub's caching, rather than needing the packages to be downloaded from the Cygwin mirrors on every run.

This speeds up the installation when there's a good cache hit, at the expense of adding some additional complexity, and using the limited cache space (although with 10GB assigned to every repository on GitHub, I can't imagine most folk are getting close to the limit!).

To avoid unnecessary cache churn, the new steps check if the cache has actually meaningfully changed before storing a new cache. This uses b2sum for speed, since we're just checking for unexpected changes in the cache, and we don't need cryptographic security.

I was originally working on this for my own builds, e.g. at https://github.com/me-and/Cygwin-inih/blob/fb7534276feeb2a5217df5013c679a370ee6013e/.github/workflows/ci.yml, but it occurred to me that this might be more useful here than in my own repos.

To speed up Cygwin installations, allow users to store the downloaded
packages using GitHub's caching, rather than needing the packages to be
downloaded from the Cygwin mirrors on every run.

This speeds up the installation when there's a good cache hit, at the
expense of adding some additional complexity, and using the limited
cache space (although with 10GB assigned to every repository on GitHub,
I can't imagine most folk are getting close to the limit!).

To avoid unnecessary cache churn, the new steps check if the cache has
actually meaningfully changed before storing a new cache.  This uses
b2sum for speed, since we're just checking for unexpected changes in the
cache, and we don't need cryptographic security.
@me-and me-and marked this pull request as ready for review January 9, 2023 23:47
README.md Outdated
@@ -26,6 +26,7 @@ Parameters
| site | http://mirrors.kernel.org/sourceware/cygwin/ | Mirror site to install from
| check-sig | true | Whether to check the setup.ini signature
| add-to-path | true | Whether to add Cygwin's `/bin` directory to the system `PATH`
| package-cache-key | '' | The key to use for caching downloaded packages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind making this a string, rather than a boolean? I think that the cache-keys are scoped to the repository, so the only possible collision is with other cache action uses in the same workflow?

It would be nice to be able to make this on by default in future, without the user having to choose a key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The store of caches is per-repository, and the actual restore behaviour is scoped even more tightly than that, with restrictions about what caches are allowed to be restored on what branch. But yes, this is something I debated; I went with letting the user decide mostly so they wouldn't be surprised by caches appearing with unexpected names.

Having been playing around with this a bit more, though, I think I'm going to suggest a different approach: hard-code the cache key name, but make the behaviour configurable with a cache-behaviour setting or similar, that takes values of disabled (get current cache-free behaviour), enabled (full new behaviour), or saveonly or restoreonly to do the obvious things. The latter are useful when – as with a lot of my cygport jobs – you need to run the action multiple times in the repository.

(As we discussed in #5, the ideal solution might be to swap from a composite action to a JS one, since that would allow post-run steps, which could be responsible for saving the cache after it has been used several times, but that's obviously a much bigger job!)

@jon-turney
Copy link
Member

Thanks! This is awesome!

I vagely recall trying to do this initially but running into some problem, so thanks for adding this!

To speed up Cygwin installations, allow users to store the downloaded packages using GitHub's caching, rather than needing the packages to be downloaded from the Cygwin mirrors on every run.

Some numbers showing that restoring the cache is actually typically faster than fetching the packages again would help give me confidence that this is a good change.

This speeds up the installation when there's a good cache hit, at the expense of adding some additional complexity, and using the limited cache space (although with 10GB assigned to every repository on GitHub, I can't imagine most folk are getting close to the limit!).

Unfortunately, setup allows it's package cache to grow without bound. Something which has needed fixing for literal decades...

I don't think that's blocking for this PR, since, yes, the likelhood of hitting the cache limit is small, but something you should be aware of.

To avoid unnecessary cache churn, the new steps check if the cache has actually meaningfully changed before storing a new cache. This uses b2sum.

Presumably there's some analysis been done that all the commands used are safe because they are in 'base' installation, but a comment to that effect might be worthwhile.

@me-and
Copy link
Contributor Author

me-and commented Jan 10, 2023

Thanks! This is awesome!

I vagely recall trying to do this initially but running into some problem, so thanks for adding this!

np! This is about my fifth attempt at getting it working without getting stuck and giving up in disgust…

To speed up Cygwin installations, allow users to store the downloaded packages using GitHub's caching, rather than needing the packages to be downloaded from the Cygwin mirrors on every run.

Some numbers showing that restoring the cache is actually typically faster than fetching the packages again would help give me confidence that this is a good change.

Yep, very fair. I have some rough anecdata that says it helps, but I've not really stress tested it to prove it makes a difference for installations that are big enough for it to be substantial. I should be able to have a go at that shortly :)

This speeds up the installation when there's a good cache hit, at the expense of adding some additional complexity, and using the limited cache space (although with 10GB assigned to every repository on GitHub, I can't imagine most folk are getting close to the limit!).

Unfortunately, setup allows it's package cache to grow without bound. Something which has needed fixing for literal decades...

I don't think that's blocking for this PR, since, yes, the likelhood of hitting the cache limit is small, but something you should be aware of.

I had been thinking about that. GitHub evicts caches that aren't used for 7+ days, but if a repo were getting rebuilds more regularly than that, the cache would just keep getting larger and larger. I think that's a good reason for it to not be default behaviour for this action (yet), and I should probably note it explicitly in the documentation, but I think it mostly means that anyone using this action might need to manually flush the cache every so often, to force a new smaller cache to be generated.

To avoid unnecessary cache churn, the new steps check if the cache has actually meaningfully changed before storing a new cache. This uses b2sum.

Presumably there's some analysis been done that all the commands used are safe because they are in 'base' installation, but a comment to that effect might be worthwhile.

They're all provided as part of the Git Bash / Git for Windows installation on the GitHub Windows runners. I've not actually checked how this works for self-hosted runners, and I could believe it's not guaranteed to work there. I'll add some notes to that effect…

Add an option to restore a cache without creating a new one, or to
create a cache without restoring an old one.  This is useful in the
scenario where the action is called multiple times in a run, as it
allows the cache to only be restored on the first call, and for a new
cache to only be written on the last call.

This option replaces the `package-cache-key` option; as discussed in cygwin#6
that option isn't very useful.

Additionally, add some more scopes to the cache name.  Without this
change, a second attempt to create a cache in the same action run (e.g.
because one step installs some packages, then another step installs some
more) will fail because of the cache name collision.

Also update the README and tests, although the tests are only getting a
very quick update for this function, as I'm going to do a much more
comprehensive test update separately.
@me-and me-and marked this pull request as draft January 10, 2023 22:46
@me-and me-and force-pushed the cache branch 3 times, most recently from c9be7a5 to 79832e3 Compare January 10, 2023 22:53
This should hopefully show how much of a difference, if any, caching
packages makes.  There's no automated comparison here, the test just
runs one install without a cache, then a second install with a cache,
then a human needs to look at what difference it makes.
me-and added a commit to me-and/cygwin-install-action that referenced this pull request Jan 11, 2023
Add an option to restore a cache without creating a new one, or to
create a cache without restoring an old one.  This is useful in the
scenario where the action is called multiple times in a run, as it
allows the cache to only be restored on the first call, and for a new
cache to only be written on the last call.

This option replaces the `package-cache-key` option; as discussed in cygwin#6
that option isn't very useful.

Additionally, add some more scopes to the cache name.  Without this
change, a second attempt to create a cache in the same action run (e.g.
because one step installs some packages, then another step installs some
more) will fail because of the cache name collision.

Also update the README and tests, although the tests are only getting a
very quick update for this function, as I'm going to do a much more
comprehensive test update separately.
me-and added a commit to me-and/cygwin-install-action that referenced this pull request Jan 11, 2023
Add an option to restore a cache without creating a new one, or to
create a cache without restoring an old one.  This is useful in the
scenario where the action is called multiple times in a run, as it
allows the cache to only be restored on the first call, and for a new
cache to only be written on the last call.

This option replaces the `package-cache-key` option; as discussed in cygwin#6
that option isn't very useful.

Additionally, add some more scopes to the cache name.  Without this
change, a second attempt to create a cache in the same action run (e.g.
because one step installs some packages, then another step installs some
more) will fail because of the cache name collision.

Also update the README and tests, although the tests are only getting a
very quick update for this function, as I'm going to do a much more
comprehensive test update separately.
me-and added a commit to me-and/cygwin-install-action that referenced this pull request Jan 11, 2023
Add an option to restore a cache without creating a new one, or to
create a cache without restoring an old one.  This is useful in the
scenario where the action is called multiple times in a run, as it
allows the cache to only be restored on the first call, and for a new
cache to only be written on the last call.

This option replaces the `package-cache-key` option; as discussed in cygwin#6
that option isn't very useful.

Additionally, add some more scopes to the cache name.  Without this
change, a second attempt to create a cache in the same action run (e.g.
because one step installs some packages, then another step installs some
more) will fail because of the cache name collision.

Also update the README and tests, although the tests are only getting a
very quick update for this function, as I'm going to do a much more
comprehensive test update separately.
me-and added a commit to me-and/cygwin-install-action that referenced this pull request Jan 11, 2023
Add an option to restore a cache without creating a new one, or to
create a cache without restoring an old one.  This is useful in the
scenario where the action is called multiple times in a run, as it
allows the cache to only be restored on the first call, and for a new
cache to only be written on the last call.

This option replaces the `package-cache-key` option; as discussed in cygwin#6
that option isn't very useful.

Additionally, add some more scopes to the cache name.  Without this
change, a second attempt to create a cache in the same action run (e.g.
because one step installs some packages, then another step installs some
more) will fail because of the cache name collision.

Also update the README and tests, although the tests are only getting a
very quick update for this function, as I'm going to do a much more
comprehensive test update separately.
@jon-turney
Copy link
Member

Presumably there's some analysis been done that all the commands used are safe because they are in 'base' installation, but a comment to that effect might be worthwhile.

They're all provided as part of the Git Bash / Git for Windows installation on the GitHub Windows runners. I've not actually checked how this works for self-hosted runners, and I could believe it's not guaranteed to work there. I'll add some notes to that effect…

Aha! I hadn't noticed that all this happens before we modify that PATH, so it'll never be using cygwin tools to do these things, it's assuming they are already provided by something in the VM image.

@me-and
Copy link
Contributor Author

me-and commented Jan 12, 2023

On the performance topic: initial numbers seem to show using a cache saves very roughly 4 minutes off a Cygwin installation that would otherwise take about 10 minutes in total. Restoring a cache of a few hundred MB seems to take only a few seconds, which clearly beats the download time from the Cygwin mirrors.

Currently the hold-up is the reason I raised #5: if Cygwin is in the PATH, the caching code will still try to use the Git for Windows tar, but will fail. Using the Git for Windows tar is hardcoded in the GitHub cache code. Removing Cygwin from PATH just for the caching steps doesn't seem to work, I don't think "you can't use caching if you add Cygwin to PATH" is a reasonable limitation, so I'm currently trying to find other ways to get the caching to work…

@jon-turney
Copy link
Member

jon-turney commented Jan 13, 2023

Currently the hold-up is the reason I raised #5: if Cygwin is in the PATH, the caching code will still try to use the Git for Windows tar, but will fail. Using the Git for Windows tar is hardcoded in the GitHub cache code. Removing Cygwin from PATH just for the caching steps doesn't seem to work, I don't think "you can't use caching if you add Cygwin to PATH" is a reasonable limitation, so I'm currently trying to find other ways to get the caching to work…

I could live with "if you want to use this action multiple time in the same workflow, you must use package-cache: disabled or add-path: false"?

@me-and
Copy link
Contributor Author

me-and commented Jan 13, 2023

I'm hoping it might be something that can get fixed fairly soon; it looks like the problem was added relatively recently, in actions/cache@9b0be58822, and based on actions/toolkit#1311 I'm not the only person to be seeing similar problems.

Another option – which I've not tested properly but is next on my idea list – is to try just using the parent of that commit in actions/cache. I created a simple test case to demonstrate the issue at https://github.com/me-and/repro-cygwin-cache-woe, and it looks like the parent commit might let the caching work as currently designed, but without this issue.

Otherwise, I agree: just documenting it as a limitation is probably going to be the best option for now!

-------

If you're likely to do regular builds, you might want to store the packages
locally rather than needing to download them from the Cygwin mirrors on every
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure "locally" is the best word to use here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants