More issues with escaped unicode #85

blegat · 2024-11-15T08:35:11Z

Follow up from #78

With the script

using DocumenterCitations
bib = CitationBibliography("bug.bib")
DocumenterCitations.format_bibliography_reference(bib.style, bib.entries["key"])

For

@misc{key,
  author = {{\"U}nl{\"u}, {\c C}a{\u g}lar},
}

I get

ERROR: LoadError: ArgumentError: Premature end of tex string: BoundsError("{\\c", 4)
Stacktrace:
  [1] tex_to_markdown(tex_str::SubString{String}; transform_case::Function, debug::Base.CoreLogging.LogLevel)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:135

When I try DocumenterCitations.tex_to_markdown(raw"{\"U}nl{\"u}, {\c C}a{\u g}lar"), I get "\"Unl\"u, Çağlar", which seems indeed weird because \"u" is not replaced by the unicode character.

With

@inproceedings{key,
  author = {Mikolov, Tom{\'a}{\v s}},
}

I get

caused by: BoundsError: attempt to access 11-codeunit SubString{String} at index [12]
Stacktrace:
  [1] checkbounds
    @ ./strings/basic.jl:216 [inlined]
  [2] getindex
    @ ./strings/substring.jl:100 [inlined]
  [3] _collect_group(tex_str::SubString{String}, i::Int64)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:407
  [4] _process_tex(tex_str::SubString{…}; transform_case::DocumenterCitations.var"#73#75", debug::Base.CoreLogging.LogLevel)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:196
  [5] tex_to_markdown(tex_str::SubString{String}; transform_case::Function, debug::Base.CoreLogging.LogLevel)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:131
  [6] tex_to_markdown
    @ ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:125 [inlined]
  [7] _initial(name::String)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/formatting.jl:25

but when I do DocumenterCitations.tex_to_markdown(raw"Mikolov, Tom{\'a}{\v s}") I get correctly "Mikolov, Tomáš".

The text was updated successfully, but these errors were encountered:

goerz · 2024-11-15T12:11:04Z

I’ll look into this at some point.

It seems like the zotero-better-bibtex plugin has an option to keep Unicode.

You should absolutely 100% enable that. I’m actually really confused about the statement in their README

Unfortunately, for those shackled to BibTeX and who cannot (yet) move to BibLaTeX, unicode is a major PITA.

I have all my .bib files in Unicode, and I’m using plain BibTeX, not BibLaTeX. It has “just worked” for the last 15 years (maybe since pdflatex started to exist?). As far as I can tell, it’s just not a problem anymore, and nobody should use these tex escapes anymore.

blegat · 2024-11-15T14:32:50Z

Good point, if I untick the checkbox "Export unicode as plain-text..." then I get rid of the errors.
If I also select "in the 'url' field", below in the screenshot I also get rid of the warnings complaining that there is an "urldate" without a "url" because by default, "Add URLs to BibTeX export" was "No". So I think you can recommend Zotero users to use these settings.

I also tried BibLaTeX export but I got an error, see Humans-of-Julia/BibInternal.jl#33

trontrytel · 2024-11-15T19:42:14Z

I got similar errors with DocumenterCitations v1.3.5 yesterday (things were fine with older versions). I was able to fix them by removing TeX syntax: CliMA/CloudMicrophysics.jl#483

Thank you!

goerz · 2024-11-15T20:18:32Z

The reason things might have worked in v1.3.4 stopped working in v1.3.5 was that the solution to #78 was to try to convert latex to unicode before obtaining the initials for first names. That means first names are now processed, while they weren't before, and if there was anything in a first name that trips up the conversion, it breaks. I actually ran into that myself.

Ultimately, the bottom line is that DocumenterCitations requires Unicode. Any handling of LaTeX commands will always be an incomplete and heuristic fallback, and not officially supported.

goerz · 2024-11-16T14:52:36Z

For […] author = {{\"U}nl{\"u}, {\c C}a{\u g}lar} […] I get […] Premature end of tex string: BoundsError("{\\c", 4)

This particular case seems to be a bug in Bibliography.jl: Humans-of-Julia/BibParser.jl#39

I also think that zotero-better-bibtex isn't really using the "correct" escape sequences here. They should probably stick to the ones officially supported by BibTeX. For this example, that would be

@misc{Unlu2024,
  title = {More issues with escaped unicode},
  author = {\"{U}nl\"{u}, \c{C}a\u{g}lar},
  year = {2024},
  note = {Bug Report #85},
}

which works fine.

When I try DocumenterCitations.tex_to_markdown(raw"{\"U}nl{\"u}, {\c C}a{\u g}lar"), I get "\"Unl\"u, Çağlar", which seems indeed weird because "u" is not replaced by the unicode character.

No, that's actually an issue with the raw string: Raw strings in Julia aren't quite as raw as one might think: quotes still have to be escaped, and then the escape has to be escaped. You'd have to write that as

@test tex_to_markdown(raw"{\\\"U}nl{\\\"u}, {\c C}a{\u g}lar") == "Ünlü, Çağlar"

which works.

goerz · 2024-11-16T15:27:25Z

@trontrytel

I got similar errors with DocumenterCitations v1.3.5 yesterday (things were fine with older versions). I was able to fix them by removing TeX syntax: CliMA/CloudMicrophysics.jl#483

The only entry I can reproduce as failing is Lehtinen2007, and that's failing due to the same bug in Bibliography: Humans-of-Julia/BibParser.jl#39 (comment)

Unfortunately, your "fix" of removing the braces is actually not correct: it changes the last name "Dal Maso" to "Maso" with "Dal" as a middle name. The correct way to handle this is to use the "Last, First" format.

@article{Lehtinen2007,
  title = {Estimating nucleation rates from apparent particle formation rates and vice versa: Revised formulation of the Kerminen–Kulmala equation},
  author = {Lehtinen, Kari E.J. and Dal Maso, Miikka and Kulmala, Markku and Kerminen, Veli-Matti},
  journal = {Journal of Aerosol Science},
  volume = {38},
  number = {9},
  pages = {988-994},
  year = {2007},
  doi = {10.1016/j.jaerosci.2007.06.009}
}

I strongly recommend always using that format (and to make sure that any automatic exporter uses it)

goerz · 2024-11-16T16:38:53Z

So this doesn't really seem actionable on my side, but I'll keep this issue open until Humans-of-Julia/BibParser.jl#39 is resolved.

Meanwhile, there's some additional testing in b8c5de3.

blegat mentioned this issue Nov 15, 2024

Use DocumenterCitations for bibliography blegat/LINMA2472#1

Merged

3 tasks

blegat changed the title ~~More issues with unicode:~~ More issues with escaped unicode Nov 15, 2024

goerz mentioned this issue Nov 16, 2024

Incorrect parsing of names with spaced escape codes Humans-of-Julia/BibParser.jl#39

Open

goerz added the invalid This doesn't seem right label Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More issues with escaped unicode #85

More issues with escaped unicode #85

blegat commented Nov 15, 2024 •

edited by fredrikekre

Loading

goerz commented Nov 15, 2024

blegat commented Nov 15, 2024

trontrytel commented Nov 15, 2024

goerz commented Nov 15, 2024

goerz commented Nov 16, 2024

goerz commented Nov 16, 2024

goerz commented Nov 16, 2024

More issues with escaped unicode #85

More issues with escaped unicode #85

Comments

blegat commented Nov 15, 2024 • edited by fredrikekre Loading

goerz commented Nov 15, 2024

blegat commented Nov 15, 2024

trontrytel commented Nov 15, 2024

goerz commented Nov 15, 2024

goerz commented Nov 16, 2024

goerz commented Nov 16, 2024

goerz commented Nov 16, 2024

blegat commented Nov 15, 2024 •

edited by fredrikekre

Loading