Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable buffer size and flat compression for file entries for SBT #375

Closed
wants to merge 2 commits into from

Conversation

jacum
Copy link

@jacum jacum commented Nov 2, 2024

We noticed that in CI pipelines running sbt, substantial part of the time and resources are spent to compress and uncompress the artefacts.

This is very pronounced for the remote cache feature, which is a great way of sharing the stuff between several jobs in the same CI pipeline - however, actually storing and fetching artefacts is a heavy process.

It is aggravated by the fact that the CI pipelines usually are constrained by CPU and memory, and gzip compression is quite demanding on that. Our CI jobs could easily spend as much as half of the time pulling and pushing stuff from remote cache.
As the storage is cheap, reduced artefact size doesn't really add much of the value.

The proposed minimal change doesn't alter existing behaviour by default, but adds a way to override file copy buffer size and disable compression for the file entries created by SBT.

@jacum
Copy link
Author

jacum commented Nov 2, 2024

signed CLA, please revalidate

@@ -46,7 +46,10 @@ object IO {
val temporaryDirectory = new File(System.getProperty("java.io.tmpdir"))

/** The size of the byte or char buffer used in various methods. */
private val BufferSize = 8192
private val BufferSize =
Option(System.getProperty("sbt.io.buffer.size")).map(_.toInt).getOrElse(8192)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/sbt/sbt/blob/bcf5ded572711854b9ba28cec1e30114b428380b/main/src/main/scala/sbt/internal/SysProp.scala#L81-L92, which is a mini style guide for system properties. I'd suggest name sbt.io.bufferbyte?

In general, I wonder if we should copy-paste SysProp.scala to sbt.internal.io or something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for paying attention!

I believe as is it does comply with style guidelines as articulated in SysProp.scala

// System property style:
// 1. use sbt. prefix
// 2. prefer short nouns
// 3. use dot for namespacing, and avoid making dot-separated English phrase
// 4. make active/enable properties, instead of "sbt.disable."

and also existing values such as

  def taskTimings: Boolean = getOrFalse("sbt.task.timings")
  def taskTimingsOnShutdown: Boolean = getOrFalse("sbt.task.timings.on.shutdown")
  def taskTimingsThreshold: Long = long("sbt.task.timings.threshold", 0L)
  def taskTimingsOmitPaths: Boolean = getOrFalse("sbt.task.timings.omit.paths")

that being said, it's all cosmetics, it's up to you as maintainer to define final naming

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-paste SysProp.scala to sbt.internal.io

looks a good suggestion, however this was far out of scope of my humble hack

@eed3si9n
Copy link
Member

eed3si9n commented Nov 2, 2024

We noticed that in CI pipelines running sbt, substantial part of the time and resources are spent to compress and uncompress the artefacts.

While that might be true:

  1. Would calling setMethod turn off the compression?
  2. Why not call setLevel on ZipOutputStream?
  3. Would turning off the compression speed up the overall build time?

In general, parameterizing only via system property might end up creating untested code paths or worse, a build tool that behaves differently depending on the machine setup. For behavior that only affects the performance or UX it might be ok, but compression level hits a bit different, because creating a JAR is a major job of sbt.

@Friendseeker
Copy link
Member

Friendseeker commented Nov 2, 2024

@jacum Not sure if this is 100% relevant but sbt zinc serializes incremental compilation dependency data called Analysis with its own GZIP compression stream (which can potentially get very large). Just want to confirm that the overhead is indeed on compression done with sbt io instead of Zinc's own Analysis compression.

We noticed that in CI pipelines running sbt, substantial part of the time and resources are spent to compress and uncompress the artefacts.

Would it be possible to share some profiling data? If it takes too much hassle to present the data in a clean way, just the raw data and a very brief description of profiling setup can still be really valuable.

@eed3si9n
Copy link
Member

eed3si9n commented Nov 2, 2024

Yea. This is also a good point, we recently introduce parallel gzipping, so maybe that could be affecting some environments in a negative way.

@jacum
Copy link
Author

jacum commented Nov 3, 2024

Thanks for paying attention!

All we wanted to achieve now is to prevent sbt pushRemoteCache from applying any gzip on the jars being created as cache artefacts.

Unlike jars published to repositories, the cached jars are short living anyway - in the scope of the pipeline - and trading tons of cpu and pipeline runtime spent on g(un)zipping isn't definitely worth a few saved megabytes of the cache size.

Hope to be able to share some figures with you soon.

@Friendseeker
Copy link
Member

Friendseeker commented Nov 3, 2024

Thanks for paying attention!

All we wanted to achieve now is to prevent sbt pushRemoteCache from applying any gzip on the jars being created as cache artefacts.

Unlike jars published to repositories, the cached jars are short living anyway - in the scope of the pipeline - and trading tons of cpu and pipeline runtime spent on g(un)zipping isn't definitely worth a few saved megabytes of the cache size.

Hope to be able to share some figures with you soon.

Will look into it. I think maybe we can add a settings key at sbt side either called enableCacheCompression or cacheCompressionLevel that allows disabling of any GZIP compression (sbt side & zinc side).

I definitely see how this can be a pervasive issue. I was experimenting with some CI optimizations as part of my recent SWE internship and the potato CI was absolutely CPU starved, both in terms of single threaded performance and # of physical threads... Being able to avoid any CPU heavy workflow has potential for major speed up in such cases, and compression can definitely play a major role.

@jacum
Copy link
Author

jacum commented Nov 3, 2024

@Friendseeker your insights are much appreciated.
We run quite a lot of CI workloads and sbt does seem to be the CPU hog.
gzip-free local cache should definitely relieve this source of the load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants