All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Log much less from
falconeri_worker
by default, and make it configurable. This fixes an issue where the newer tracing code was causing the worker to log far too much.
- This version hard-coded a very low logging level. It was yanked because the low logging level would have made it impossible to debug falconeri issues discovered in the field, and because it was never fully released.
- Prevent key constraint error when retrying failed datums (Issue #33). But see Issue #36; we still don't do the right thing when output files are randomly named.
- Reduce odds of birthday paradox collision when naming jobs (Issue #35).
- Hard-code PostgreSQL version to prevent it from getting accidentally upgraded by Kubernetes.
- Use correct file name to upload release assets (again).
- Use correct file name to upload release assets.
- Attempted to fix binary builds on Linux (yet again).
- Attempted to fix binary builds on Linux (again).
- Attempted to fix binary builds on Linux. Not even trying on the Mac.
- Work around issue where
--field-selector
didn't find all running pods, resulting in accidental worker terminations.
- Fix
job_timeout
conversion tottlActiveSeconds
in the Kubernetes YAML.
This release adds a "babysitter" process inside each falconerid
. We use this to monitor jobs and datums, and detect and/or recover from various types of errors. Updating an existing cluster should be fine, but it's likely to spend a minute or two detecting and marking problems with old jobs. So please exercise appropriate caution.
We plan to stabilize a falconeri
1.0 with approximately this feature set. It has been in production for years, and the babysitter was the last missing critical feature.
- If worker pod disappears off the cluster while processing a datum, detect this and set the datum to
status = Status::Error
. This is handled automatically by a "babysitter" thread infalconerid
. - Add support for
datum_tries
in the pipeline JSON. Set this to 2, 3, etc., to automatically retry failed datums. This is also handled by the babysitter. - Periodically check to see whether a job has finished without being correctly marked as such. This is mostly intended to clean up existing clusters.
- Periodically check to see whether a Kubernetes job has unexpectedly disappeared, and mark the corresponding
falconeri
job as having failed. - Add trace spans for most low-level database access.
- We now correctly update
updated_at
on all tables that have it.
- Wrote some basic developer documentation to supplement the
justfile
s. - Allow specifying
--falconerid-log-level
forfalconeri deploy
. This uses standardRUST_LOG
syntax, as described in the CLI help.
- Cleaned up tracing output a bit.
- Switched to using
rustls
for HTTPS. Database connections still indirectly require OpenSSL thanks tolibpq
.
- Attempt to fix TravisCI binary releases.
- Don't show interactive progress bar when uploading outputs.
- Support
job_timeout
in pipeline schemas. This allows you to specify when an entire job should be stopped, even if it isn't done. Values include "300s", "2h", "2d", etc. - Add much better tracing support when
RUST_LOG=trace
is passed.
- We update most of our dependencies, including Rust libraries and our Docker base images. But this shouldn't affect normal use.
- Set
ttlSecondsAfterFinished
to 1 day so that old jobs don't hang around forever on the backplane wasting storage.