Add package caching #6

me-and · 2023-01-09T23:43:52Z

To speed up Cygwin installations, allow users to store the downloaded packages using GitHub's caching, rather than needing the packages to be downloaded from the Cygwin mirrors on every run.

This speeds up the installation when there's a good cache hit, at the expense of adding some additional complexity, and using the limited cache space (although with 10GB assigned to every repository on GitHub, I can't imagine most folk are getting close to the limit!).

To avoid unnecessary cache churn, the new steps check if the cache has actually meaningfully changed before storing a new cache. This uses b2sum for speed, since we're just checking for unexpected changes in the cache, and we don't need cryptographic security.

I was originally working on this for my own builds, e.g. at https://github.com/me-and/Cygwin-inih/blob/fb7534276feeb2a5217df5013c679a370ee6013e/.github/workflows/ci.yml, but it occurred to me that this might be more useful here than in my own repos.

To speed up Cygwin installations, allow users to store the downloaded packages using GitHub's caching, rather than needing the packages to be downloaded from the Cygwin mirrors on every run. This speeds up the installation when there's a good cache hit, at the expense of adding some additional complexity, and using the limited cache space (although with 10GB assigned to every repository on GitHub, I can't imagine most folk are getting close to the limit!). To avoid unnecessary cache churn, the new steps check if the cache has actually meaningfully changed before storing a new cache. This uses b2sum for speed, since we're just checking for unexpected changes in the cache, and we don't need cryptographic security.

jon-turney · 2023-01-10T12:44:25Z

README.md

@@ -26,6 +26,7 @@ Parameters
 | site        | http://mirrors.kernel.org/sourceware/cygwin/ | Mirror site to install from
 | check-sig   | true                                         | Whether to check the setup.ini signature
 | add-to-path | true                                         | Whether to add Cygwin's `/bin` directory to the system `PATH`
+| package-cache-key | ''                                     | The key to use for caching downloaded packages.


What's the reasoning behind making this a string, rather than a boolean? I think that the cache-keys are scoped to the repository, so the only possible collision is with other cache action uses in the same workflow?

It would be nice to be able to make this on by default in future, without the user having to choose a key.

The store of caches is per-repository, and the actual restore behaviour is scoped even more tightly than that, with restrictions about what caches are allowed to be restored on what branch. But yes, this is something I debated; I went with letting the user decide mostly so they wouldn't be surprised by caches appearing with unexpected names.

Having been playing around with this a bit more, though, I think I'm going to suggest a different approach: hard-code the cache key name, but make the behaviour configurable with a cache-behaviour setting or similar, that takes values of disabled (get current cache-free behaviour), enabled (full new behaviour), or saveonly or restoreonly to do the obvious things. The latter are useful when – as with a lot of my cygport jobs – you need to run the action multiple times in the repository.

(As we discussed in #5, the ideal solution might be to swap from a composite action to a JS one, since that would allow post-run steps, which could be responsible for saving the cache after it has been used several times, but that's obviously a much bigger job!)

jon-turney · 2023-01-10T12:44:33Z

Thanks! This is awesome!

I vagely recall trying to do this initially but running into some problem, so thanks for adding this!

To speed up Cygwin installations, allow users to store the downloaded packages using GitHub's caching, rather than needing the packages to be downloaded from the Cygwin mirrors on every run.

Some numbers showing that restoring the cache is actually typically faster than fetching the packages again would help give me confidence that this is a good change.

This speeds up the installation when there's a good cache hit, at the expense of adding some additional complexity, and using the limited cache space (although with 10GB assigned to every repository on GitHub, I can't imagine most folk are getting close to the limit!).

Unfortunately, setup allows it's package cache to grow without bound. Something which has needed fixing for literal decades...

I don't think that's blocking for this PR, since, yes, the likelhood of hitting the cache limit is small, but something you should be aware of.

To avoid unnecessary cache churn, the new steps check if the cache has actually meaningfully changed before storing a new cache. This uses b2sum.

Presumably there's some analysis been done that all the commands used are safe because they are in 'base' installation, but a comment to that effect might be worthwhile.

me-and · 2023-01-10T14:38:40Z

Thanks! This is awesome!

I vagely recall trying to do this initially but running into some problem, so thanks for adding this!

np! This is about my fifth attempt at getting it working without getting stuck and giving up in disgust…

To speed up Cygwin installations, allow users to store the downloaded packages using GitHub's caching, rather than needing the packages to be downloaded from the Cygwin mirrors on every run.

Some numbers showing that restoring the cache is actually typically faster than fetching the packages again would help give me confidence that this is a good change.

Yep, very fair. I have some rough anecdata that says it helps, but I've not really stress tested it to prove it makes a difference for installations that are big enough for it to be substantial. I should be able to have a go at that shortly :)

This speeds up the installation when there's a good cache hit, at the expense of adding some additional complexity, and using the limited cache space (although with 10GB assigned to every repository on GitHub, I can't imagine most folk are getting close to the limit!).

Unfortunately, setup allows it's package cache to grow without bound. Something which has needed fixing for literal decades...

I don't think that's blocking for this PR, since, yes, the likelhood of hitting the cache limit is small, but something you should be aware of.

I had been thinking about that. GitHub evicts caches that aren't used for 7+ days, but if a repo were getting rebuilds more regularly than that, the cache would just keep getting larger and larger. I think that's a good reason for it to not be default behaviour for this action (yet), and I should probably note it explicitly in the documentation, but I think it mostly means that anyone using this action might need to manually flush the cache every so often, to force a new smaller cache to be generated.

To avoid unnecessary cache churn, the new steps check if the cache has actually meaningfully changed before storing a new cache. This uses b2sum.

Presumably there's some analysis been done that all the commands used are safe because they are in 'base' installation, but a comment to that effect might be worthwhile.

They're all provided as part of the Git Bash / Git for Windows installation on the GitHub Windows runners. I've not actually checked how this works for self-hosted runners, and I could believe it's not guaranteed to work there. I'll add some notes to that effect…

Add an option to restore a cache without creating a new one, or to create a cache without restoring an old one. This is useful in the scenario where the action is called multiple times in a run, as it allows the cache to only be restored on the first call, and for a new cache to only be written on the last call. This option replaces the `package-cache-key` option; as discussed in cygwin#6 that option isn't very useful. Additionally, add some more scopes to the cache name. Without this change, a second attempt to create a cache in the same action run (e.g. because one step installs some packages, then another step installs some more) will fail because of the cache name collision. Also update the README and tests, although the tests are only getting a very quick update for this function, as I'm going to do a much more comprehensive test update separately.

This should hopefully show how much of a difference, if any, caching packages makes. There's no automated comparison here, the test just runs one install without a cache, then a second install with a cache, then a human needs to look at what difference it makes.

Add an option to restore a cache without creating a new one, or to create a cache without restoring an old one. This is useful in the scenario where the action is called multiple times in a run, as it allows the cache to only be restored on the first call, and for a new cache to only be written on the last call. This option replaces the `package-cache-key` option; as discussed in cygwin#6 that option isn't very useful. Additionally, add some more scopes to the cache name. Without this change, a second attempt to create a cache in the same action run (e.g. because one step installs some packages, then another step installs some more) will fail because of the cache name collision. Also update the README and tests, although the tests are only getting a very quick update for this function, as I'm going to do a much more comprehensive test update separately.

jon-turney · 2023-01-11T15:17:01Z

Presumably there's some analysis been done that all the commands used are safe because they are in 'base' installation, but a comment to that effect might be worthwhile.

They're all provided as part of the Git Bash / Git for Windows installation on the GitHub Windows runners. I've not actually checked how this works for self-hosted runners, and I could believe it's not guaranteed to work there. I'll add some notes to that effect…

Aha! I hadn't noticed that all this happens before we modify that PATH, so it'll never be using cygwin tools to do these things, it's assuming they are already provided by something in the VM image.

me-and · 2023-01-12T20:14:04Z

On the performance topic: initial numbers seem to show using a cache saves very roughly 4 minutes off a Cygwin installation that would otherwise take about 10 minutes in total. Restoring a cache of a few hundred MB seems to take only a few seconds, which clearly beats the download time from the Cygwin mirrors.

Currently the hold-up is the reason I raised #5: if Cygwin is in the PATH, the caching code will still try to use the Git for Windows tar, but will fail. Using the Git for Windows tar is hardcoded in the GitHub cache code. Removing Cygwin from PATH just for the caching steps doesn't seem to work, I don't think "you can't use caching if you add Cygwin to PATH" is a reasonable limitation, so I'm currently trying to find other ways to get the caching to work…

jon-turney · 2023-01-13T13:42:44Z

Currently the hold-up is the reason I raised #5: if Cygwin is in the PATH, the caching code will still try to use the Git for Windows tar, but will fail. Using the Git for Windows tar is hardcoded in the GitHub cache code. Removing Cygwin from PATH just for the caching steps doesn't seem to work, I don't think "you can't use caching if you add Cygwin to PATH" is a reasonable limitation, so I'm currently trying to find other ways to get the caching to work…

I could live with "if you want to use this action multiple time in the same workflow, you must use package-cache: disabled or add-path: false"?

me-and · 2023-01-13T15:31:16Z

I'm hoping it might be something that can get fixed fairly soon; it looks like the problem was added relatively recently, in actions/cache@9b0be58822, and based on actions/toolkit#1311 I'm not the only person to be seeing similar problems.

Another option – which I've not tested properly but is next on my idea list – is to try just using the parent of that commit in actions/cache. I created a simple test case to demonstrate the issue at https://github.com/me-and/repro-cygwin-cache-woe, and it looks like the parent commit might let the caching work as currently designed, but without this issue.

Otherwise, I agree: just documenting it as a limitation is probably going to be the best option for now!

jon-turney · 2023-01-13T18:17:33Z

README.md

+-------
+
+If you're likely to do regular builds, you might want to store the packages
+locally rather than needing to download them from the Cygwin mirrors on every


I'm not sure "locally" is the best word to use here.

me-and marked this pull request as ready for review January 9, 2023 23:47

jon-turney reviewed Jan 10, 2023

View reviewed changes

me-and marked this pull request as draft January 10, 2023 22:46

me-and force-pushed the cache branch 3 times, most recently from c9be7a5 to 79832e3 Compare January 10, 2023 22:53

me-and added 2 commits January 10, 2023 23:04

Add much more comprehensive cache testing

0066f67

me-and force-pushed the cache branch from 79832e3 to 76f43ba Compare January 10, 2023 23:05

jon-turney reviewed Jan 13, 2023

View reviewed changes

me-and mentioned this pull request Jan 19, 2023

[feature] Make cygwin-packages dir configurable #7

Open

me-and mentioned this pull request Jul 19, 2023

cygwin install action doesn't make a github cache #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add package caching #6

Add package caching #6

me-and commented Jan 9, 2023

jon-turney Jan 10, 2023

me-and Jan 10, 2023

jon-turney commented Jan 10, 2023

me-and commented Jan 10, 2023

jon-turney commented Jan 11, 2023

me-and commented Jan 12, 2023

jon-turney commented Jan 13, 2023 •

edited

Loading

me-and commented Jan 13, 2023

jon-turney Jan 13, 2023

Add package caching #6

Are you sure you want to change the base?

Add package caching #6

Conversation

me-and commented Jan 9, 2023

jon-turney Jan 10, 2023

Choose a reason for hiding this comment

me-and Jan 10, 2023

Choose a reason for hiding this comment

jon-turney commented Jan 10, 2023

me-and commented Jan 10, 2023

jon-turney commented Jan 11, 2023

me-and commented Jan 12, 2023

jon-turney commented Jan 13, 2023 • edited Loading

me-and commented Jan 13, 2023

jon-turney Jan 13, 2023

Choose a reason for hiding this comment

jon-turney commented Jan 13, 2023 •

edited

Loading