Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More issues with escaped unicode #85

Open
blegat opened this issue Nov 15, 2024 · 7 comments
Open

More issues with escaped unicode #85

blegat opened this issue Nov 15, 2024 · 7 comments
Labels
invalid This doesn't seem right

Comments

@blegat
Copy link
Contributor

blegat commented Nov 15, 2024

Follow up from #78

With the script

using DocumenterCitations
bib = CitationBibliography("bug.bib")
DocumenterCitations.format_bibliography_reference(bib.style, bib.entries["key"])

For

@misc{key,
  author = {{\"U}nl{\"u}, {\c C}a{\u g}lar},
}

I get

ERROR: LoadError: ArgumentError: Premature end of tex string: BoundsError("{\\c", 4)
Stacktrace:
  [1] tex_to_markdown(tex_str::SubString{String}; transform_case::Function, debug::Base.CoreLogging.LogLevel)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:135

When I try DocumenterCitations.tex_to_markdown(raw"{\"U}nl{\"u}, {\c C}a{\u g}lar"), I get "\"Unl\"u, Çağlar", which seems indeed weird because \"u" is not replaced by the unicode character.

With

@inproceedings{key,
  author = {Mikolov, Tom{\'a}{\v s}},
}

I get

caused by: BoundsError: attempt to access 11-codeunit SubString{String} at index [12]
Stacktrace:
  [1] checkbounds
    @ ./strings/basic.jl:216 [inlined]
  [2] getindex
    @ ./strings/substring.jl:100 [inlined]
  [3] _collect_group(tex_str::SubString{String}, i::Int64)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:407
  [4] _process_tex(tex_str::SubString{…}; transform_case::DocumenterCitations.var"#73#75", debug::Base.CoreLogging.LogLevel)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:196
  [5] tex_to_markdown(tex_str::SubString{String}; transform_case::Function, debug::Base.CoreLogging.LogLevel)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:131
  [6] tex_to_markdown
    @ ~/.julia/dev/DocumenterCitations/src/tex_to_markdown.jl:125 [inlined]
  [7] _initial(name::String)
    @ DocumenterCitations ~/.julia/dev/DocumenterCitations/src/formatting.jl:25

but when I do DocumenterCitations.tex_to_markdown(raw"Mikolov, Tom{\'a}{\v s}") I get correctly "Mikolov, Tomáš".

@goerz
Copy link
Member

goerz commented Nov 15, 2024

I’ll look into this at some point.

It seems like the zotero-better-bibtex plugin has an option to keep Unicode.

You should absolutely 100% enable that. I’m actually really confused about the statement in their README

Unfortunately, for those shackled to BibTeX and who cannot (yet) move to BibLaTeX, unicode is a major PITA.

I have all my .bib files in Unicode, and I’m using plain BibTeX, not BibLaTeX. It has “just worked” for the last 15 years (maybe since pdflatex started to exist?). As far as I can tell, it’s just not a problem anymore, and nobody should use these tex escapes anymore.

@blegat blegat changed the title More issues with unicode: More issues with escaped unicode Nov 15, 2024
@blegat
Copy link
Contributor Author

blegat commented Nov 15, 2024

Good point, if I untick the checkbox "Export unicode as plain-text..." then I get rid of the errors.
If I also select "in the 'url' field", below in the screenshot I also get rid of the warnings complaining that there is an "urldate" without a "url" because by default, "Add URLs to BibTeX export" was "No". So I think you can recommend Zotero users to use these settings.
zotero

I also tried BibLaTeX export but I got an error, see Humans-of-Julia/BibInternal.jl#33

@trontrytel
Copy link

I got similar errors with DocumenterCitations v1.3.5 yesterday (things were fine with older versions). I was able to fix them by removing TeX syntax: CliMA/CloudMicrophysics.jl#483

Thank you!

@goerz
Copy link
Member

goerz commented Nov 15, 2024

The reason things might have worked in v1.3.4 stopped working in v1.3.5 was that the solution to #78 was to try to convert latex to unicode before obtaining the initials for first names. That means first names are now processed, while they weren't before, and if there was anything in a first name that trips up the conversion, it breaks. I actually ran into that myself.

Ultimately, the bottom line is that DocumenterCitations requires Unicode. Any handling of LaTeX commands will always be an incomplete and heuristic fallback, and not officially supported.

@goerz
Copy link
Member

goerz commented Nov 16, 2024

For […] author = {{\"U}nl{\"u}, {\c C}a{\u g}lar} […] I get […] Premature end of tex string: BoundsError("{\\c", 4)

This particular case seems to be a bug in Bibliography.jl: Humans-of-Julia/BibParser.jl#39

I also think that zotero-better-bibtex isn't really using the "correct" escape sequences here. They should probably stick to the ones officially supported by BibTeX. For this example, that would be

@misc{Unlu2024,
  title = {More issues with escaped unicode},
  author = {\"{U}nl\"{u}, \c{C}a\u{g}lar},
  year = {2024},
  note = {Bug Report #85},
}

which works fine.

When I try DocumenterCitations.tex_to_markdown(raw"{\"U}nl{\"u}, {\c C}a{\u g}lar"), I get "\"Unl\"u, Çağlar", which seems indeed weird because "u" is not replaced by the unicode character.

No, that's actually an issue with the raw string: Raw strings in Julia aren't quite as raw as one might think: quotes still have to be escaped, and then the escape has to be escaped. You'd have to write that as

@test tex_to_markdown(raw"{\\\"U}nl{\\\"u}, {\c C}a{\u g}lar") == "Ünlü, Çağlar"

which works.

@goerz
Copy link
Member

goerz commented Nov 16, 2024

@trontrytel

I got similar errors with DocumenterCitations v1.3.5 yesterday (things were fine with older versions). I was able to fix them by removing TeX syntax: CliMA/CloudMicrophysics.jl#483

The only entry I can reproduce as failing is Lehtinen2007, and that's failing due to the same bug in Bibliography: Humans-of-Julia/BibParser.jl#39 (comment)

Unfortunately, your "fix" of removing the braces is actually not correct: it changes the last name "Dal Maso" to "Maso" with "Dal" as a middle name. The correct way to handle this is to use the "Last, First" format.

@article{Lehtinen2007,
  title = {Estimating nucleation rates from apparent particle formation rates and vice versa: Revised formulation of the Kerminen–Kulmala equation},
  author = {Lehtinen, Kari E.J. and Dal Maso, Miikka and Kulmala, Markku and Kerminen, Veli-Matti},
  journal = {Journal of Aerosol Science},
  volume = {38},
  number = {9},
  pages = {988-994},
  year = {2007},
  doi = {10.1016/j.jaerosci.2007.06.009}
}

I strongly recommend always using that format (and to make sure that any automatic exporter uses it)

@goerz
Copy link
Member

goerz commented Nov 16, 2024

So this doesn't really seem actionable on my side, but I'll keep this issue open until Humans-of-Julia/BibParser.jl#39 is resolved.

Meanwhile, there's some additional testing in b8c5de3.

@goerz goerz added the invalid This doesn't seem right label Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants