Fix `_legalize_path` types #5224

snejus · 2024-05-04T12:35:01Z

Description

Part 2 of the work fixing types in beets.util.__init__ #5215.

Mypy was not happy here because _legalize_stage function implementation concatenates path and extension parameters, implying that their types need to match.

You can see that initially path parameter was defined as a str while extension was bytes.

In reality, depending on the fragment parameter value, extension was sometimes provided as a str and sometimes as bytes. The same parameter decided whether path gets converted into bytes within _legalize_stage implementation. No surprise that mypy was confused here.

_legalize_stage is only used within Item.destination method implementation which is where fragment is defined. I determined that the fragment parameter controls the form of the output path:

fragment=False returned absolute path as bytes (default)
fragment=True returned path relative to the library directory as str.

Given the above, the change

Renames fragment parameter to relative_to_libdir for clarity
Makes Item.destination to return the same type in both cases. I picked bytes since that's the type that majority of the code using this method expects.

I converted the output path to str for the code that has been expecting a string there.
Decouples _legalize_stage and _legalize_path implementations from the relative_to_libdir. The logic now uses str type only.

snejus · 2024-12-02T01:01:31Z

@wisp3rwind this is ready for a review - trying to revive this work

wisp3rwind

Nice seeing some progress here!

I think it's a great observation that there's only really one caller to legalize_path (i.e. Item.destination), which makes it straightforward to adapt the signature of legalizat_path to get rid of all its weirdness. This seems like the right thing to do!

It seems correct that the input to legalize_path should always be Unicode: The source it always a template evaluation, which is performed using Unicode strings, rather than something filesystem-specific or bytes. The output is most naturally bytes, since it's our canonical representation for things supposed to be passed to the filesystem.

At a relatively superficial glance, this PR looks reasonable to me (the one inline comment that I left is probably out-of-scope).

One thing I'm unsure about is the use truncate (but I'm not sure it made sense before, either): If the purpose of truncation is to deal with filesystem limitations, this should model as closely as possible what matters to the underlying filesystem. Previously, truncation could be performed on str or bytes depending on the fragment flag (notably, not depending on platform or filesystem). Now, it's always based on Unicode code points: In one sense, this is an improvement, since the result will always be valid UTF-8. On the other hand, the byte-length of the result is not bounded by the specified length. Ideally, I feel like this should attempt to truncate at Unicode graphemes boundaries, but measure length in terms of encoded bytes or Unicode code points depending on platform/filesystem.

It's also unexpected that the previous implementation used displayable_path in the legalization procedure: That is meant to remove non-printable characters, but does that align with what makes sense to be part of a filename? Maybe it would be interesting to check the git blame whether it has some rationale for its use here?

In any case, the previous solution doesn't seem to be ideal either (or, at least, it lacks sufficient documentation to make sense of it for me), so perfecting the truncation might be worth opening a tracking issue for, but out-of-scope in this PR. Do we have any bug reports that might be related to path truncation and legalization?

beets/library.py

snejus · 2024-12-10T13:48:33Z

It's also unexpected that the previous implementation used displayable_path in the legalization procedure: That is meant to remove non-printable characters, but does that align with what makes sense to be part of a filename? Maybe it would be interesting to check the git blame whether it has some rationale for its use here?

I followed your suggestion and had a look at the history of this functionality: this PR and this issue may be the most informative. Adrian nicely summarised the point of having to use bytes in this comment:

Thank you! This is related to #496, where truncation was also creating invalid filenames by bypassing the replace rules.

To resolve this, you could imagine doing one of two things:

Do the truncation on pre-encoded, Unicode character strings.
Somehow ensure, after truncation on the byte string, that we trim partial encoded characters.
The trouble with option 1 is that, on Unix, paths are byte strings—so a filesystem's maximum length is specified in bytes, not characters. (Not sure about Windows; it's probably characters there.) And the difficulty with 2 is that finding the character boundaries amounts to re-decoding back to Unicode.

See the rest of the thread around this comment. It seems like this may be less relevant in Python 3 ;)

snejus · 2024-12-10T14:00:19Z

All of this functionality was written back in the day when some useful libraries did not exist yet. Take pathvalidate, for example, which could be helpful for us here.

Background The `_legalize_stage` function was causing issues with Mypy due to inconsistent type usage between the `path` and `extension` parameters. This inconsistency stemmed from the `fragment` parameter influencing the types of these variables. Key issues 1. `path` was defined as `str`, while `extension` was `bytes`. 2. Depending on `fragment`, `extension` could be either `str` or `bytes`. 3. `path` was sometimes converted to `bytes` within `_legalize_stage`. Item.destination` method - The `fragment` parameter determined the output format: - `False`: Returned absolute path as bytes (default) - `True`: Returned path relative to library directory as str Thus - Rename `fragment` parameter to `relative_to_libdir` for clarity - Ensure `Item.destination` returns `bytes` in all cases - Code expecting strings now converts the output to `str` - Use only `str` type in `_legalize_stage` and `_legalize_path` functions - These functions are no longer dependent on `relative_to_libdir`

snejus self-assigned this May 4, 2024

snejus requested a review from wisp3rwind May 4, 2024 12:35

This comment was marked as resolved.

Sign in to view

snejus force-pushed the fix-most-types-in-util-init branch 4 times, most recently from 79cf3fd to 6a1f684 Compare May 6, 2024 08:30

wisp3rwind mentioned this pull request May 7, 2024

Fix types in beets/util/__init__.py #5215

Closed

snejus force-pushed the fix-most-types-in-util-init branch 4 times, most recently from 80c3daa to 860daa9 Compare September 13, 2024 22:06

Base automatically changed from fix-most-types-in-util-init to master September 18, 2024 12:40

snejus force-pushed the fix-legalize-path branch 2 times, most recently from b53b089 to 48f6d71 Compare December 2, 2024 01:00

wisp3rwind reviewed Dec 10, 2024

View reviewed changes

beets/library.py Outdated Show resolved Hide resolved

snejus force-pushed the fix-legalize-path branch from 48f6d71 to bba0fdc Compare December 10, 2024 12:20

snejus added 3 commits December 19, 2024 16:47

Remove never used replacements paramater from item.destination

d6d736a

Remove unused platform parameter in item.destination

96e011e

snejus force-pushed the fix-legalize-path branch from fe0083b to 96e011e Compare December 19, 2024 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `_legalize_path` types #5224

Fix `_legalize_path` types #5224

snejus commented May 4, 2024

This comment was marked as resolved.

snejus commented Dec 2, 2024

wisp3rwind left a comment

snejus commented Dec 10, 2024

snejus commented Dec 10, 2024

Fix _legalize_path types #5224

Are you sure you want to change the base?

Fix _legalize_path types #5224

Conversation

snejus commented May 4, 2024

Description

This comment was marked as resolved.

snejus commented Dec 2, 2024

wisp3rwind left a comment

Choose a reason for hiding this comment

snejus commented Dec 10, 2024

snejus commented Dec 10, 2024

Fix `_legalize_path` types #5224

Fix `_legalize_path` types #5224