Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong diacritics for Devanagari -> ISO/IAST/ITRANS #43

Open
bwasty opened this issue Nov 19, 2021 · 6 comments
Open

Wrong diacritics for Devanagari -> ISO/IAST/ITRANS #43

bwasty opened this issue Nov 19, 2021 · 6 comments

Comments

@bwasty
Copy link

bwasty commented Nov 19, 2021

I found several issues with transliterating diacritics from Devanagari (Hindi):

  • कॅ -> kaॅ (iso/iast, fine in itrans: ka.c)
  • फ़ोन -> pha़ōna (iso/iast)
  • सड़क -> saḍa़ka (iso/iast)
  • ज़्यादा ->ja़yAdA (itrans; other way correct: zyaada)

By the way, great project, wrote 2 small tools with it already:

@bwasty bwasty changed the title Wrong diacritics for Devanagari <-> ISO/IAST/ITRANS Wrong diacritics for Devanagari -> ISO/IAST/ITRANS Nov 19, 2021
@vvasuki
Copy link
Member

vvasuki commented Nov 19, 2021

I found several issues with transliterating diacritics from Devanagari (Hindi):

  • कॅ -> kaॅ (iso/iast, fine in itrans: ka.c)

What should this be in ISO?

  • सड़क -> saḍa़ka (iso/iast)

What should this be in ISO?

  • फ़ोन -> pha़ōna (iso/iast)

f is expected I suppose. Contribute a fix?

  • ज़्यादा ->ja़yAdA (itrans; other way correct: zyaada)

Contribute a fix?

By the way, great project, wrote 2 small tools with it already:

@bwasty
Copy link
Author

bwasty commented Nov 19, 2021

  • कॅ -> kaॅ (iso/iast, fine in itrans: ka.c)

What should this be in ISO?

m̐k. Same in IAST (according to this. Here ˜ is shown, though the discussion page suggests is correct)

  • सड़क -> saḍa़ka (iso/iast)

What should this be in ISO?

saṛaka in ISO. For IAST it's not specified - so remove the dangling dot maybe? or use the same? For ITRANS it should be .Da or .Ra.

Related: ढ़ should become ṛha in ISO and .Dha/Rha in ITRANS.

  • फ़ोन -> pha़ōna (iso/iast)

f is expected I suppose. Contribute a fix?

Yes, for ISO and ITRANS. For IAST it's not specified - maybe do the same anyway?

  • ज़्यादा ->ja़yAdA (itrans; other way correct: zyaada)

Contribute a fix?

I'm not sure I understand Devanagari well enough yet (literally started learning a week ago), but I might try :)

@vvasuki
Copy link
Member

vvasuki commented Nov 19, 2021

  • कॅ -> kaॅ (iso/iast, fine in itrans: ka.c)

What should this be in ISO?

m̐k. Same in IAST (according to this. Here ˜ is shown, though the discussion page suggests is correct)

No - you seem to be confusing कँ with कॅ.

@bwasty
Copy link
Author

bwasty commented Nov 19, 2021

Ah, right, damn. Wikipedia shows ê for and .
The unicode block shows a few more characters with a 'candra', but I guess they have no transliteration?

@vvasuki
Copy link
Member

vvasuki commented Nov 20, 2021

Basically, problem is that transliterateBrahmic assumes that it's ok to transliterate character by character. It does not consider max token length (unlike https://github.com/indic-transliteration/indic_transliteration_py/blob/99fe6b2fd5b220794d1709e3297c919d58c4cfcc/indic_transliteration/sanscript/brahmic_mapper.py ). Porting the python code might work.

@bwasty
Copy link
Author

bwasty commented Nov 20, 2021

Ok, I'll look into that after having a stab at #42 (since that 'annoys' me more and I found this interesting paper)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants