Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move schemes into separate JSON files #13

Merged
merged 1 commit into from
Aug 5, 2020

Conversation

psvenk
Copy link
Collaborator

@psvenk psvenk commented Aug 4, 2020

This has the benefits of separating data from code and of having a language-agnostic specification of schemes (for implementations of Sanscript in other programming languages).

This also adds a build step integrating the scheme data into the JS file — the file that now contains only the code has been moved to a subdirectory and the build step (npm run dist) outputs to sanscript.js (so this causes no breaking changes). There is also an additional development dependency of @iarna/toml.

I chose TOML over JSON and other formats because it supports comments and it can represent the scheme data in a simple and readable format without any indentation. I generated the TOML file using a script taking the initial declaration of Sanscript.schemes and serializing it using the TOML library, followed by adjusting the formatting and adding in the comments from the JS file. Having a separate TOML file actually improves JSON compatibility because the following Node.js script converts the TOML file to valid JSON (automatically stripping comments):

Code
const TOML = require("@iarna/toml");
const fs = require("fs");
const path = require("path");

const schemes = TOML.parse(fs.readFileSync(
    path.join(__dirname, "src", "schemes.toml")
));
fs.writeFileSync(
    path.join(__dirname, "src", "schemes.json"), JSON.stringify(schemes)
);

I have marked this pull request as "draft" because there is some refactoring to be done before the TOML file is truly portable between programming languages (e.g. some encodings, such as Kolkata, are defined in code), but I wanted to get your thoughts on the idea before proceeding with that.

All tests (35 tests; 780 assertions) are passing when I open tests/index.html in my browser.

@rramphal
Copy link
Collaborator

rramphal commented Aug 5, 2020

Hey @psvenk, this is an interesting idea!

Personally, I don't think we gain much by using the TOML format over the JSON format in this specific case, but of course, it's basically the same. We do avoid adding a dependency by using JSON, but, if we're already adding a build step, adding such a well-tested dependency is arguably a non-issue.

If you're going to introduce a build step, why not go all the way and split each language out into its own file? Regardless of whether we stick with JSON or TOML, we could define a Sanscript Scheme schema using something like ajv, which is something that even the current implementation could benefit from. Given the fact that the schema is unlikely to change often, it would just need to really be done once while maintaining consistency between the different language files. If we do use TOML, it would obviously be better to write the schema for TOML, however TOLS is still in the works.

So basically:

assamese.toml ... zanbazar_square.toml > concatenate into schemes.toml > transcode into schemes.json > validate in toto > inject into sanscript.js (npm run dist) > publish (npm run publish)

-- OR --

assamese.toml ... zanbazar_square.toml > transcode into assamese.json ... zanbazar_square.json > validate separately > concatenate into schemes.json > inject into sanscript.js (npm run dist) > publish (npm run publish)

We could even have a GitHub action handle this for us.

@psvenk
Copy link
Collaborator Author

psvenk commented Aug 5, 2020

@rramphal Thanks for the speedy response!

Personally, I don't think we gain much by using the TOML format over the JSON format in this specific case, but of course, it's basically the same. We do avoid adding a dependency by using JSON, but, if we're already adding a build step, adding such a well-tested dependency is arguably a non-issue.

I agree that there is not much to gain in this case, especially with splitting each scheme into its own file (in which case the files consist of only keys mapping to arrays). JSON does have the benefit of being more widely used in the JS world too; maybe we could introduce a description field containing what is currently in comments (possibly along with a way for client code to display the description of a scheme).

If you're going to introduce a build step, why not go all the way and split each language out into its own file?

Thanks for pointing this out; I'll do this in a commit soon (in JSON).

Regardless of whether we stick with JSON or TOML, we could define a Sanscript Scheme schema using something like ajv, which is something that even the current implementation could benefit from. Given the fact that the schema is unlikely to change often, it would just need to really be done once while maintaining consistency between the different language files. If we do use TOML, it would obviously be better to write the schema for TOML, however TOLS is still in the works.

This is a great idea. We could also add to the schema some properties which are currently implemented as special cases in the code. The fact that a somewhat mature schema representation language is available for JSON but not for TOML leads me to believe that we should use JSON for storing data on the scripts/romanizations for consistency. I also like the idea of verifying this with CI.

@vvasuki
Copy link
Member

vvasuki commented Aug 5, 2020

Good - I like the general idea as well as the idea of splitting into separate json/toml files. If we're ditching toml because schema specification and validation is simpler elsewhere, might as well pick https://json5.org/ .

Action item for later:

  • We could separate the schema definitions out into a separate repository which can then ben included as a submodule here and in other language specific projects.

@psvenk psvenk force-pushed the separate-schemes branch from 111ef31 to 08a42b2 Compare August 5, 2020 14:23
This has the benefits of separating data from code and of having a
language-agnostic specification of schemes (for implementations of
Sanscript in other programming languages).

This also adds a build step integrating the scheme data into the JS file
-- the file that now contains only the code has been moved to a
subdirectory and the build step (npm run dist) outputs to sanscript.js
(so this causes no breaking changes).
@psvenk psvenk force-pushed the separate-schemes branch from 08a42b2 to 49fa16b Compare August 5, 2020 14:25
@psvenk psvenk changed the title Move schemes into separate TOML file Move schemes into separate JSON files Aug 5, 2020
@psvenk
Copy link
Collaborator Author

psvenk commented Aug 5, 2020

I have switched to JSON, split schemes.json into one file per scheme, and squashed and rebased on master. As expected, two tests are failing due to #4 (non-Sanskrit letters from Devanagari to ITRANS, and Devanagari to Gurmukhi due to a change in tests.js splitting off the combining nuqta character from dotted Gurmukhi letters).

@vvasuki Thank you very much for your response.

If we're ditching toml because schema specification and validation is simpler elsewhere, might as well pick https://json5.org/ .

I did a little bit of looking around and it seems that JSON5 is quite JavaScript-specific and that implementations of JSON Schema in other languages generally do not support JSON5, so this would mean introducing a Node.js dependency to other implementations of Sanscript in order to convert the JSON5 to JSON. So, if I understand correctly, JSON5 would be simpler to use than TOML in the case of JavaScript (because ajv supports JSON5 natively) but more difficult in other languages (because JSON5 needs to be transcoded to JSON using Node.js before validation or the validation needs to happen in Node.js).

Action item for later:

  • We could separate the schema definitions out into a separate repository which can then ben included as a submodule here and in other language specific projects.

Sounds good!

@vvasuki
Copy link
Member

vvasuki commented Aug 5, 2020

Ok - sent you an invite to be a maintainer. Once ready and @rramphal takes a look, merge into master and let me know so that I can update npm.

@rramphal
Copy link
Collaborator

rramphal commented Aug 5, 2020

Solid work, @psvenk!

I'd like to suggest some improvements, but they are all pretty much outside the scope of this PR. I can get to them in a couple of hours. After @psvenk merges this into master, @vvasuki, if you can hold off on publishing to npm, I'd be glad to submit a PR with some ideas for improvements if that sounds good.

@rramphal
Copy link
Collaborator

rramphal commented Aug 5, 2020

Also, I went ahead and captured some of the ideas mentioned here in #15 and #16.

@psvenk
Copy link
Collaborator Author

psvenk commented Aug 5, 2020

I'd like to suggest some improvements, but they are all pretty much outside the scope of this PR. I can get to them in a couple of hours.

Great. I already have a schema done, so I can open a separate PR after merging this to gather feedback about the schema. The schema was written so that the current schemes pass validation, but some changes to the scheme format and the logic would be necessary to make the schema simpler (e.g. replace singleton arrays with simple strings). Additionally, some hard-coded rules (e.g. Brahmic vs. roman and the definition of the Kolkata scheme) should be moved out of the logic and made into parameters in the schema or moved into dist.js, but that too can be done in a separate PR.

@psvenk psvenk marked this pull request as ready for review August 5, 2020 22:05
@psvenk psvenk merged commit 04c968d into indic-transliteration:master Aug 5, 2020
@psvenk psvenk deleted the separate-schemes branch August 5, 2020 22:06
@rramphal
Copy link
Collaborator

rramphal commented Aug 5, 2020

@psvenk Awesome. I think getting a validator in to represent the current schema for now, and then following up with a PR later to move some of the logic over sounds like a good plan.

@psvenk
Copy link
Collaborator Author

psvenk commented Aug 5, 2020

@psvenk Awesome. I think getting a validator in to represent the current schema for now, and then following up with a PR later to move some of the logic over sounds like a good plan.

@rramphal In that case, I'll open that PR up for review and open a separate issue for moving the logic over.

vvasuki pushed a commit that referenced this pull request Jul 18, 2021
This has the benefits of separating data from code and of having a
language-agnostic specification of schemes (for implementations of
Sanscript in other programming languages).

This also adds a build step integrating the scheme data into the JS file
-- the file that now contains only the code has been moved to a
subdirectory and the build step (npm run dist) outputs to sanscript.js
(so this causes no breaking changes).
@rramphal rramphal mentioned this pull request Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants