Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server/objects): new column to track object size #3750

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

iainsproat
Copy link
Contributor

@iainsproat iainsproat commented Dec 30, 2024

Description & motivation

Through memoization of the stored object size during upload of objects, we can store the data for later analysis of the size of objects.
This enables future features, such as calculating largest projects by stored size, rate of change of stored object size, averages of object size etc. etc..
NB approximate, as it JSON stringifies the object and then grossly assumes one string character to one byte.

Changes:

  • adds new column to objects table
  • sizeBytes column is a big int data type, is nullable, and defaults to null
  • calculates the object size (approximately) when uploaded
  • removes unused methods

To-do before merge:

  • is a migration the best place for the backfill of data? It could cause the startup to fail for large object tables.
    • would it be better as an async process, as it is not required for the operation of the server?
    • would it be better as an external service? (though this might cause more work with more docker images & k8s manifests etc.)
  • should we accurately calculate the actual string size, using textEncoder to calculate the number of bytes? Is it necessary, is our approximation ok?

Screenshots:

Validation of changes:

Checklist:

  • My pull request follows the guidelines in the Contributing guide?
  • My pull request does not duplicate any other open Pull Requests for the same update/change?
  • My commits are related to the pull request and do not amend unrelated code or documentation.
  • My code follows a similar style to existing code.
  • I have added appropriate tests.
  • I have updated or added relevant documentation.

References

Copy link

linear bot commented Dec 30, 2024

- this will cause slow startup for un-backfilled databases with large object tables
@iainsproat
Copy link
Contributor Author

iainsproat commented Jan 2, 2025

The migration takes too long, so we need to move the backfill to a different process. Either a background worker on the monolith, or a separate microservice.

@iainsproat
Copy link
Contributor Author

iainsproat commented Jan 7, 2025

Would we just be better doing SELECT pg_column_size("data") FROM "objects";, and no need memo-izing the data size? The sizes isn't exactly the same, but I'm not sure we care about exact values only magnitude of size.

Screenshot 2025-01-07 at 10 13 43

Or, if we care about more closely matching the size calculated by Node, the uncompressed size would be better: SELECT octet_length("data"::text) AS "derivedSize", "sizeBytes" FROM "objects" ORDER BY "derivedSize" DESC;

Screenshot 2025-01-07 at 11 11 16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant