Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LaTeX hyphenation patterns encoded in iso-8859-1 #688

Open
lopippo opened this issue Oct 15, 2022 · 3 comments
Open

LaTeX hyphenation patterns encoded in iso-8859-1 #688

lopippo opened this issue Oct 15, 2022 · 3 comments

Comments

@lopippo
Copy link

lopippo commented Oct 15, 2022

Problem description

Greetings,
in my quest to have a proper package for daps in official Debian, I have stumbled upon this series of lintian warnings:

W: daps: national-encoding [etc/daps/xep/hyphen/dehyph_rx.tex]
W: daps: national-encoding [etc/daps/xep/hyphen/huhyph_rx.tex]
W: daps: national-encoding [etc/daps/xep/hyphen/ithyph_rx.tex]
W: daps: national-encoding [etc/daps/xep/hyphen/ruhyphal.tex]

The reason of the warning is the following:

A file is not valid UTF-8.
Debian has used UTF-8 for many years. Support for national encodings is being phased out. This file probably appears to users in mangled characters (also called mojibake).
Packaging control files must be encoded in valid UTF-8.
Please convert the file to UTF-8 using iconv or a similar tool.

I can see that this makes perfect sense with respect to the hyphenation files: they are indeed for languages that require diacritical signs not present in us-ascii, for example.

So my question is the following, would it be correct to convert these files to utf-8 with a command line like the following:

iconv --from-code=ISO_8859-1 --to-code=UTF-8// -o file-new file

Note: the file huhyph_rx.tex has the following first line:
% ISO8859-2
but when checked using file -i huhyph_rx.tex, the encoding seems to actually be charset=iso-8859-1 (which is confirmed by vim which says the encoding is latin-1, a synonym of iso-8859-1, I think).

This means that the first line of the file is not binding and we can freely recode these files to UTF-8.
If this idea is not silly nor erroneous, would you do this upstream so that for the next version it will be there? For the time being, if you do not make negative comments about this, I'll make that reencoding myself.
What's your take on this?

Sincerely,
Filippo

@tomschr
Copy link
Collaborator

tomschr commented Oct 17, 2022

Hi @lopippo, thanks for your ideas. 👍

Well, I think you can ignore these files especially for Debian. Probably there is nobody who uses XEP as formatter (it's a commercial product). These hyphenation files are from a time when we still used XEP. At that time, FOP was not yet as advanced as it is now.

Nevertheless, I've tried to convert them to UTF-8. I used the following XEP config file (I removed other parts and only showed the structure that has been changed):

XEP configuration file
<config xmlns="http://www.renderx.com/XEP/config" xml:base="/usr/share/xep/">
  <options>
    <!-- The following tow options are moved into the /usr/bin/xep
         script:
    -->
    <option name="LICENSE" value="file:///etc/xep/license.xml"/>
    <option name="BROKENIMAGE" value="file:///usr/share/xep/images/images/404.gif"/>
    <option name="TMPDIR" value="none"/>
    <option name="LOGO" value="file:///usr/share/xep/images/logo-renderx.svg"/>
    <option name="STAMP_PNG" value="file:///usr/share/xep/images/stamp-renderx.png"/>
    <!-- ... -->
  </options>

<fonts xmlns="http://www.renderx.com/XEP/config" 
         xml:base="fonts/"
         default-family="Helvetica">
   <!-- ... -->
    <font-group label="SUSE" embed="true">
      <font-family name="OpenSans">
        <font><font-data ttf="/usr/share/fonts/truetype/OpenSans-Regular.ttf"/></font>
        <font style="italic"><font-data ttf="/usr/share/fonts/truetype/OpenSans-Italic.ttf"/></font>
        <font weight="bold"><font-data ttf="/usr/share/fonts/truetype/OpenSans-Bold.ttf"/></font>
        <font weight="bold" style="italic"><font-data ttf="/usr/share/fonts/truetype/OpenSans-BoldItalic.ttf"/></font>
      </font-family>

      <font-family name="DejaVuSansMono">
        <font><font-data ttf="/usr/share/fonts/truetype/DejaVuSansMono.ttf"/></font>
        <font style="italic"><font-data ttf="/usr/share/fonts/truetype/DejaVuSansMono-Oblique.ttf"/></font>
        <font weight="bold"><font-data ttf="/usr/share/fonts/truetype/DejaVuSansMono-Bold.ttf"/></font>
        <font weight="bold" style="italic"><font-data ttf="/usr/share/fonts/truetype/DejaVuSansMono-BoldOblique.ttf"/></font>
      </font-family>
    </font-group>
   </fonts>

  <languages default-language="en-US" xml:base="file:///home/tom/.config/daps/xep-hyphen/">
    <language name="German" codes="de deu ger">
      <!-- old <hyphenation pattern="dehyph_rx.tex"/> -->
      <hyphenation encoding="UTF-8" pattern="dehyph_rx-utf8.tex"/>
    </language>
  </languages>
</config>

With that config, I've build a German guide and get the following message:

[warning] (918,9): shamelessly skipping \pattern

I don't get this warning when I use the original file (dehyph_rx.tex). When I open the PDF file, I get some words with incorrect hyphenation. For example:

physisc-hen  vs. physischen
bereitste-llen vs. bereitstellen
geog-rafischen vs. geogra-fischen

You can add an encoding attribute in the config file. However, that doesn't change the output. Perhaps these are old files (the XEP tool is quite old).

@lopippo
Copy link
Author

lopippo commented Oct 17, 2022

Greetings Tom,

thank you for your quick response. I also suspect on Debian nobody is going to use XEP, so I think I'll scrap the /etc/daps/xep directory altogether. We'll see if we get bug reports.

Is remove the whole /etc/daps/xep acceptable ?

Sincerely,
Filippo

@tomschr
Copy link
Collaborator

tomschr commented Oct 17, 2022

Greetings Filippo,

Is remove the whole /etc/daps/xep acceptable ?

I would, but it seems we still need it internally for the Security Guide. If we could fix that part, maybe we are able to remove them alltogether.

@fsundermeyer could we circumvent the issue with the Security Guide?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants