Configurable buffer size and flat compression for file entries for SBT #375

jacum · 2024-11-02T08:16:45Z

We noticed that in CI pipelines running sbt, substantial part of the time and resources are spent to compress and uncompress the artefacts.

This is very pronounced for the remote cache feature, which is a great way of sharing the stuff between several jobs in the same CI pipeline - however, actually storing and fetching artefacts is a heavy process.

It is aggravated by the fact that the CI pipelines usually are constrained by CPU and memory, and gzip compression is quite demanding on that. Our CI jobs could easily spend as much as half of the time pulling and pushing stuff from remote cache.
As the storage is cheap, reduced artefact size doesn't really add much of the value.

The proposed minimal change doesn't alter existing behaviour by default, but adds a way to override file copy buffer size and disable compression for the file entries created by SBT.

jacum · 2024-11-02T18:04:36Z

signed CLA, please revalidate

eed3si9n · 2024-11-02T19:07:00Z

io/src/main/scala/sbt/io/IO.scala

@@ -46,7 +46,10 @@ object IO {
  val temporaryDirectory = new File(System.getProperty("java.io.tmpdir"))

  /** The size of the byte or char buffer used in various methods. */
-  private val BufferSize = 8192
+  private val BufferSize =
+    Option(System.getProperty("sbt.io.buffer.size")).map(_.toInt).getOrElse(8192)


See https://github.com/sbt/sbt/blob/bcf5ded572711854b9ba28cec1e30114b428380b/main/src/main/scala/sbt/internal/SysProp.scala#L81-L92, which is a mini style guide for system properties. I'd suggest name sbt.io.bufferbyte?

In general, I wonder if we should copy-paste SysProp.scala to sbt.internal.io or something.

thanks for paying attention!

I believe as is it does comply with style guidelines as articulated in SysProp.scala

// System property style:
// 1. use sbt. prefix
// 2. prefer short nouns
// 3. use dot for namespacing, and avoid making dot-separated English phrase
// 4. make active/enable properties, instead of "sbt.disable."

and also existing values such as

def taskTimings: Boolean = getOrFalse("sbt.task.timings") def taskTimingsOnShutdown: Boolean = getOrFalse("sbt.task.timings.on.shutdown") def taskTimingsThreshold: Long = long("sbt.task.timings.threshold", 0L) def taskTimingsOmitPaths: Boolean = getOrFalse("sbt.task.timings.omit.paths")

that being said, it's all cosmetics, it's up to you as maintainer to define final naming

copy-paste SysProp.scala to sbt.internal.io

looks a good suggestion, however this was far out of scope of my humble hack

eed3si9n · 2024-11-02T20:03:41Z

We noticed that in CI pipelines running sbt, substantial part of the time and resources are spent to compress and uncompress the artefacts.

While that might be true:

Would calling setMethod turn off the compression?
Why not call setLevel on ZipOutputStream?
Would turning off the compression speed up the overall build time?

In general, parameterizing only via system property might end up creating untested code paths or worse, a build tool that behaves differently depending on the machine setup. For behavior that only affects the performance or UX it might be ok, but compression level hits a bit different, because creating a JAR is a major job of sbt.

Friendseeker · 2024-11-02T22:23:47Z

@jacum Not sure if this is 100% relevant but sbt zinc serializes incremental compilation dependency data called Analysis with its own GZIP compression stream (which can potentially get very large). Just want to confirm that the overhead is indeed on compression done with sbt io instead of Zinc's own Analysis compression.

We noticed that in CI pipelines running sbt, substantial part of the time and resources are spent to compress and uncompress the artefacts.

Would it be possible to share some profiling data? If it takes too much hassle to present the data in a clean way, just the raw data and a very brief description of profiling setup can still be really valuable.

eed3si9n · 2024-11-02T23:11:56Z

Yea. This is also a good point, we recently introduce parallel gzipping, so maybe that could be affecting some environments in a negative way.

jacum · 2024-11-03T06:18:36Z

Thanks for paying attention!

All we wanted to achieve now is to prevent sbt pushRemoteCache from applying any gzip on the jars being created as cache artefacts.

Unlike jars published to repositories, the cached jars are short living anyway - in the scope of the pipeline - and trading tons of cpu and pipeline runtime spent on g(un)zipping isn't definitely worth a few saved megabytes of the cache size.

Hope to be able to share some figures with you soon.

Friendseeker · 2024-11-03T06:54:49Z

Thanks for paying attention!

All we wanted to achieve now is to prevent sbt pushRemoteCache from applying any gzip on the jars being created as cache artefacts.

Unlike jars published to repositories, the cached jars are short living anyway - in the scope of the pipeline - and trading tons of cpu and pipeline runtime spent on g(un)zipping isn't definitely worth a few saved megabytes of the cache size.

Hope to be able to share some figures with you soon.

Will look into it. I think maybe we can add a settings key at sbt side either called enableCacheCompression or cacheCompressionLevel that allows disabling of any GZIP compression (sbt side & zinc side).

I definitely see how this can be a pervasive issue. I was experimenting with some CI optimizations as part of my recent SWE internship and the potato CI was absolutely CPU starved, both in terms of single threaded performance and # of physical threads... Being able to avoid any CPU heavy workflow has potential for major speed up in such cases, and compression can definitely play a major role.

jacum · 2024-11-03T09:18:20Z

@Friendseeker your insights are much appreciated.
We run quite a lot of CI workloads and sbt does seem to be the CPU hog.
gzip-free local cache should definitely relieve this source of the load.

Tim Evdokimov added 2 commits November 2, 2024 09:04

configurable buffer and compression

c58e489

configurable buffer and compression

e6e6821

eed3si9n reviewed Nov 2, 2024

View reviewed changes

Friendseeker mentioned this pull request Nov 4, 2024

Add option to disable compression for cache sbt/sbt#7863

Open

jacum closed this Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable buffer size and flat compression for file entries for SBT #375

Configurable buffer size and flat compression for file entries for SBT #375

jacum commented Nov 2, 2024

jacum commented Nov 2, 2024

eed3si9n Nov 2, 2024

jacum Nov 2, 2024

jacum Nov 2, 2024

eed3si9n commented Nov 2, 2024

Friendseeker commented Nov 2, 2024 •

edited

Loading

eed3si9n commented Nov 2, 2024

jacum commented Nov 3, 2024

Friendseeker commented Nov 3, 2024 •

edited

Loading

jacum commented Nov 3, 2024

Configurable buffer size and flat compression for file entries for SBT #375

Configurable buffer size and flat compression for file entries for SBT #375

Conversation

jacum commented Nov 2, 2024

jacum commented Nov 2, 2024

eed3si9n Nov 2, 2024

Choose a reason for hiding this comment

jacum Nov 2, 2024

Choose a reason for hiding this comment

jacum Nov 2, 2024

Choose a reason for hiding this comment

eed3si9n commented Nov 2, 2024

Friendseeker commented Nov 2, 2024 • edited Loading

eed3si9n commented Nov 2, 2024

jacum commented Nov 3, 2024

Friendseeker commented Nov 3, 2024 • edited Loading

jacum commented Nov 3, 2024

Friendseeker commented Nov 2, 2024 •

edited

Loading

Friendseeker commented Nov 3, 2024 •

edited

Loading