Skip to content

Commit

Permalink
checksum: Use XXH3 algorithm instead of md5
Browse files Browse the repository at this point in the history
XXH3 is substantially faster than md5.

Test:
- Used the Linux source code at v6.2.12
- Had a taskfile configured with all source directories as `sources`
- Ran on a AMD Ryzen 3900XT CPU
- NVME storage
- Host system was Linux Fedora 38 using kernel 6.4.13
- Tests were ran in several configurations:
  - Full page cache results were gathered by running the test task several times in a row until the results became consistent, indicating that the entire source tree was in the page cache. Then the test was ran 5 more times and those results were averaged. This is a best-case result, most of the time go-task will be close to this result.
  - Empty page cache results were gathered by running `sync; echo 3 > /proc/sys/vm/drop_caches` after each run in order to empty out the page cache. Each test was ran 5 times and the results averaged. This is a worst case result, such as if the user runs go-task after booting their machine.

Results:
md5:
  Full: 184.65ms
  Empty: 483.34ms

XXH3:
  Full: 90.29ms (-51.1% over md5)
  Empty: 398.70ms (-17.5% over md5)

BLAKE3 (This is another option that has the property of providing cryptographic hashes at a slightly reduced performance over XXH3. Go-task however likely does not benefit from cryptographically secure hashes)
  Full: 106.21ms (-42.4% over md5)
  Empty: 407.78ms (-15.6% over md5)
  • Loading branch information
ReillyBrogan committed Sep 8, 2023
1 parent 84ad005 commit f7aa4fb
Show file tree
Hide file tree
Showing 12 changed files with 26 additions and 17 deletions.
2 changes: 1 addition & 1 deletion docs/docs/taskfile_versions.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ These are some major changes done on `v3`:
- A global `method:` was added to allow setting the default method, and Task's
default changed to `checksum`
- Two magic variables were added when using `status:`: `CHECKSUM` and
`TIMESTAMP` which contains, respectively, the md5 checksum and greatest
`TIMESTAMP` which contains, respectively, the XXH3 checksum and greatest
modification timestamp of the files listed on `sources:`
- Also, the `TASK` variable is always available with the current task name
- CLI variables are always treated as global variables
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ These are some major changes done on `v3`:
- Added support for `.env` like files
- Added `label:` setting to task so one can override how the task name appear in the logs
- A global `method:` was added to allow setting the default method, and Task's default changed to `checksum`
- Two magic variables were added when using `status:`: `CHECKSUM` and `TIMESTAMP` which contains, respectively, the md5 checksum and greatest modification timestamp of the files listed on `sources:`
- Two magic variables were added when using `status:`: `CHECKSUM` and `TIMESTAMP` which contains, respectively, the XXH3 checksum and greatest modification timestamp of the files listed on `sources:`
- Also, the `TASK` variable is always available with the current task name
- CLI variables are always treated as global variables
- Added `dir:` option to `includes` to allow choosing on which directory an included Taskfile will run:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ These are some major changes done on `v3`:
- Added support for `.env` like files
- Added `label:` setting to task so one can override how the task name appear in the logs
- A global `method:` was added to allow setting the default method, and Task's default changed to `checksum`
- Two magic variables were added when using `status:`: `CHECKSUM` and `TIMESTAMP` which contains, respectively, the md5 checksum and greatest modification timestamp of the files listed on `sources:`
- Two magic variables were added when using `status:`: `CHECKSUM` and `TIMESTAMP` which contains, respectively, the XXH3 checksum and greatest modification timestamp of the files listed on `sources:`
- Also, the `TASK` variable is always available with the current task name
- CLI variables are always treated as global variables
- Added `dir:` option to `includes` to allow choosing on which directory an included Taskfile will run:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ These are some major changes done on `v3`:
- Added support for `.env` like files
- Added `label:` setting to task so one can override how the task name appear in the logs
- A global `method:` was added to allow setting the default method, and Task's default changed to `checksum`
- Two magic variables were added when using `status:`: `CHECKSUM` and `TIMESTAMP` which contains, respectively, the md5 checksum and greatest modification timestamp of the files listed on `sources:`
- Two magic variables were added when using `status:`: `CHECKSUM` and `TIMESTAMP` which contains, respectively, the XXH3 checksum and greatest modification timestamp of the files listed on `sources:`
- Also, the `TASK` variable is always available with the current task name
- CLI variables are always treated as global variables
- Added `dir:` option to `includes` to allow choosing on which directory an included Taskfile will run:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ These are some major changes done on `v3`:
- Added support for `.env` like files
- Added `label:` setting to task so one can override how the task name appear in the logs
- A global `method:` was added to allow setting the default method, and Task's default changed to `checksum`
- Two magic variables were added when using `status:`: `CHECKSUM` and `TIMESTAMP` which contains, respectively, the md5 checksum and greatest modification timestamp of the files listed on `sources:`
- Two magic variables were added when using `status:`: `CHECKSUM` and `TIMESTAMP` which contains, respectively, the XXH3 checksum and greatest modification timestamp of the files listed on `sources:`
- Also, the `TASK` variable is always available with the current task name
- CLI variables are always treated as global variables
- Added `dir:` option to `includes` to allow choosing on which directory an included Taskfile will run:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ sidebar_position: 5
- Добавлена поддержка `.env` файлов
- Добавлен параметр `label:`. Появилась возможность переопределить имя задачи в логах
- Глобальный параметр `method:` был добавлен для установки метода по умолчанию, а задача по умолчанию изменена на `checksum`
- Добавлены 2 магические переменные, используемые в функции `status:` - `CHECKSUM` и `TIMESTAMP`, которые содержат, контрольную сумму md5 и наибольшую отметку времени изменения файлов, перечисленных в `sources:`
- Добавлены 2 магические переменные, используемые в функции `status:` - `CHECKSUM` и `TIMESTAMP`, которые содержат, контрольную сумму XXH3 и наибольшую отметку времени изменения файлов, перечисленных в `sources:`
- Кроме того, переменная `TASK` всегда доступна по имени текущей задачи
- Переменные CLI всегда считаются глобальными переменными
- Добавлена опция `dir:` в `includes` для того, чтобы выбрать, в каком каталоге Taskfile будет запущен:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ These are some major changes done on `v3`:
- Added support for `.env` like files
- Added `label:` setting to task so one can override how the task name appear in the logs
- A global `method:` was added to allow setting the default method, and Task's default changed to `checksum`
- Two magic variables were added when using `status:`: `CHECKSUM` and `TIMESTAMP` which contains, respectively, the md5 checksum and greatest modification timestamp of the files listed on `sources:`
- Two magic variables were added when using `status:`: `CHECKSUM` and `TIMESTAMP` which contains, respectively, the XXH3 checksum and greatest modification timestamp of the files listed on `sources:`
- Also, the `TASK` variable is always available with the current task name
- CLI variables are always treated as global variables
- Added `dir:` option to `includes` to allow choosing on which directory an included Taskfile will run:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Taskfile 文件的 `version:` 关键字接受语义化字符串, 所以 `2`, `
- 支持类 `.env` 文件
- 添加 `label:` 设置后可以覆盖任务名称在日志中的显示方式
- 添加了全局 `method:` 允许设置默认方法,Task 的默认值更改为 `checksum`
- 使用 `status:`: `CHECKSUM``TIMESTAMP` 时新增了两个魔术变量,分别包含 `sources:` 列出的文件的 md5 checksum 和最大修改时间戳
- 使用 `status:`: `CHECKSUM``TIMESTAMP` 时新增了两个魔术变量,分别包含 `sources:` 列出的文件的 XXH3 checksum 和最大修改时间戳
- 另外,`TASK` 变量总是可以使用当前的任务名称
- CLI 变量始终被视为全局变量
-`includes` 添加了 `dir:` 选项,以允许选择包含的任务文件将在哪个目录上运行:
Expand Down
2 changes: 2 additions & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ require (
github.com/sajari/fuzzy v1.0.0
github.com/spf13/pflag v1.0.5
github.com/stretchr/testify v1.8.4
github.com/zeebo/xxh3 v1.0.2
golang.org/x/exp v0.0.0-20230212135524-a684f29349b6
golang.org/x/sync v0.3.0
golang.org/x/term v0.11.0
Expand All @@ -22,6 +23,7 @@ require (

require (
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/klauspost/cpuid/v2 v2.0.9 // indirect
github.com/mattn/go-colorable v0.1.13 // indirect
github.com/mattn/go-isatty v0.0.17 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
Expand Down
5 changes: 5 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ github.com/go-task/slim-sprig v0.0.0-20210107165309-348f09dbbbc0/go.mod h1:fyg78
github.com/google/go-cmp v0.5.9 h1:O2Tfq5qg4qc4AmwVlvv0oLiVAGB7enBSJ2x2DqQFi38=
github.com/joho/godotenv v1.5.1 h1:7eLL/+HRGLY0ldzfGMeQkb7vMd0as4CfYvUVzLqw0N0=
github.com/joho/godotenv v1.5.1/go.mod h1:f4LDr5Voq0i2e/R5DDNOoa2zzDfwtkZa6DnEwAbqwq4=
github.com/klauspost/cpuid/v2 v2.0.9 h1:lgaqFMSdTdQYdZ04uHyN2d/eKdOMyi2YLSvlQIBFYa4=
github.com/klauspost/cpuid/v2 v2.0.9/go.mod h1:FInQzS24/EEf25PyTYn52gqo7WaD8xa0213Md/qVLRg=
github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE=
github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY=
github.com/mattn/go-colorable v0.1.13 h1:fFA4WZxdEF4tXPZVKMLwD8oUnCTTo08duU7wxecdEvA=
Expand Down Expand Up @@ -41,6 +43,9 @@ github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
github.com/stretchr/testify v1.8.4 h1:CcVxjf3Q8PM0mHUKJCdn+eZZtm5yQwehR5yeSVQQcUk=
github.com/stretchr/testify v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo=
github.com/zeebo/assert v1.3.0 h1:g7C04CbJuIDKNPFHmsk4hwZDO5O+kntRxzaUoNXj+IQ=
github.com/zeebo/xxh3 v1.0.2 h1:xZmwmqxHZA8AI603jOQ0tMqmBr9lPeFwGg6d+xy9DC0=
github.com/zeebo/xxh3 v1.0.2/go.mod h1:5NWz9Sef7zIDm2JHfFlcQvNekmcEl9ekUZQQKCYaDcA=
golang.org/x/exp v0.0.0-20230212135524-a684f29349b6 h1:Ic9KukPQ7PegFzHckNiMTQXGgEszA7mY2Fn4ZMtnMbw=
golang.org/x/exp v0.0.0-20230212135524-a684f29349b6/go.mod h1:CxIveKay+FTh1D0yPZemJVgC/95VzuuOLq5Qi4xnoYc=
golang.org/x/sync v0.3.0 h1:ftCYgMx6zT/asHUrPw8BLLscYtGznsLAnjq5RH9P66E=
Expand Down
18 changes: 10 additions & 8 deletions internal/fingerprint/sources_checksum.go
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
package fingerprint

import (
"crypto/md5"
"fmt"
"io"
"os"
"path/filepath"
"regexp"
"strings"

"github.com/zeebo/xxh3"

"github.com/go-task/task/v3/internal/filepathext"
"github.com/go-task/task/v3/taskfile"
)
Expand All @@ -35,16 +36,16 @@ func (checker *ChecksumChecker) IsUpToDate(t *taskfile.Task) (bool, error) {
checksumFile := checker.checksumFilePath(t)

data, _ := os.ReadFile(checksumFile)
oldMd5 := strings.TrimSpace(string(data))
oldHash := strings.TrimSpace(string(data))

newMd5, err := checker.checksum(t)
newHash, err := checker.checksum(t)
if err != nil {
return false, nil
}

if !checker.dry && oldMd5 != newMd5 {
if !checker.dry && oldHash != newHash {
_ = os.MkdirAll(filepathext.SmartJoin(checker.tempDir, "checksum"), 0o755)
if err = os.WriteFile(checksumFile, []byte(newMd5+"\n"), 0o644); err != nil {
if err = os.WriteFile(checksumFile, []byte(newHash+"\n"), 0o644); err != nil {
return false, err
}
}
Expand All @@ -65,7 +66,7 @@ func (checker *ChecksumChecker) IsUpToDate(t *taskfile.Task) (bool, error) {
}
}

return oldMd5 == newMd5, nil
return oldHash == newHash, nil
}

func (checker *ChecksumChecker) Value(t *taskfile.Task) (any, error) {
Expand All @@ -89,7 +90,7 @@ func (c *ChecksumChecker) checksum(t *taskfile.Task) (string, error) {
return "", err
}

h := md5.New()
h := xxh3.New()
for _, f := range sources {
// also sum the filename, so checksum changes for renaming a file
if _, err := io.Copy(h, strings.NewReader(filepath.Base(f))); err != nil {
Expand All @@ -105,7 +106,8 @@ func (c *ChecksumChecker) checksum(t *taskfile.Task) (string, error) {
f.Close()
}

return fmt.Sprintf("%x", h.Sum(nil)), nil
hash := h.Sum128()
return fmt.Sprintf("%x%x", hash.Hi, hash.Lo), nil
}

func (checker *ChecksumChecker) checksumFilePath(t *taskfile.Task) string {
Expand Down
2 changes: 1 addition & 1 deletion task_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -840,7 +840,7 @@ func TestStatusVariables(t *testing.T) {
require.NoError(t, e.Setup())
require.NoError(t, e.Run(context.Background(), taskfile.Call{Task: "build"}))

assert.Contains(t, buff.String(), "a41e7948dcd321db412ce61d3d5c9864")
assert.Contains(t, buff.String(), "3e464c4b03f4b65d740e1e130d4d108a")

inf, err := os.Stat(filepathext.SmartJoin(dir, "source.txt"))
require.NoError(t, err)
Expand Down

0 comments on commit f7aa4fb

Please sign in to comment.