-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate implementing an ECMA 402-based internationalization library #858
Comments
I actually attempted to start work on this as part of my learning Rust. I wasn't successful at that time but would like to get back to this idea soonish. |
@steveklabnik @alexcrichton any thoughts on why this ECMAScript proposal is the preferred model for an i18n API in Rust instead of, for example, ICU, Java, or C#? The idea of implementing an ECMAScript API in Rust never occurred to me. |
I'd like to take a stab on this, converting the locale crate (it is basically useless at the moment and that is the most suitable name). @listochkin, do you have remains of your attempt anywhere for inspiration? @mrhota, the adding of However, there will have to be some differences. I have not read the ECMA-402 standard to as much detail, but:
I am also not sure about the And with an |
@jan-hudec I'll help with this effort. I'm excited someone else has an interest in this! I gather that the desire here is to have a library inspired by this spec, since a Rust "implementation" would be ... strange and unidiomatic. |
Thank you, @mrhota. First thing will be to come up with reasonable design and name things. My key requirement is, that formatting for new types can be easily defined. So we will have a trait, similar to Rust already uses separate types for times, so time formatting will be triggered by using them and I want similar approach to money, i.e. instead of passing a number and format string indicating monetary format you will pass something like I also often work with other kinds of dimensional quantities. CLDR contains unit names and abbreviations for various locales, but I haven't seen it integrated in any localization library. And I don't want to integrate it either. I just want to make it easy for application that wants it to extract the necessary bits from CLDR, store it in its translation catalogue or somewhere, and define appropriate formatting for its dimensional quantity type, so it can format |
ECMA-402 is a good place to start, then. it's suitably abstract for a
starting point, and I think we can probably arrive at a reasonable
first-pass design for i18n components (i.e. locale objects) by "rustifying"
(oxidizing?) the basic string stuff from the spec.
By the way, someone else "owns" the i18n crate name on crates.io. I was
thinking either get that person involved so we can use the name, or go with
"Intl", like the spec.
|
Look at the referencing issue (rust-locale/rust-locale#7). I've put some outline (without much names yet) already. Regarding i18n, I think |
The more I think about this, the less I see ECMA-402 as particularly reasonable inspiration for this. I have done internationalization support in work projects, have a pretty good idea of features it should have and I don't see them in the JavaScript version. The C++ |
I saw your post on the user forums about forward compatible locale design. I think github is probably the better place to continue that topic and to combine it with the above discussion. What do you think? |
Ah, never mind. For future visitors to this issue, it looks like we'll just continue discussions over at rust-locale/rust-locale#7, like @jan-hudec already started doing. 😄 |
Have there be any efforts towards this end, lately? |
@alexreg, I am slowly making some progress on https://github.com/rust-locale/rust-locale/tree/next, but there is still a lot work left. Including the main ECMA-402 bit, trait adding |
@jan-hudec Oh, cool. It looks like you're using the C++ stdlib approach of facets + locales then, eh? Let me know if I can help with development! |
(Silly me thought that project was dead just because the master branch was stale, by the way...) |
@alexreg, well, it's a mix of the C++, the Java and the JavaScript approach, really. I use the term facet from C++ (which I am most familiar with). But instead of storing them in the locale object as C++ does I have them in a static map keyed on the Locale and construct them on demand with a factory, which is closest to how Java works. The advantage is that is can be more easily extended by overriding some of the factories (which is not yet implemented, but shouldn't be hard) and by creating new types of facets—because I have idea that there will be domain-specific extensions (I can think of units and transliterations as possible ones, but there are probably some more).
It's not dead, but it's been progressing really slowly. I first started and got stuck half way in the exponential numbers. Then it slept for some time and then got a redesign that introduced the inverted objects and RFC5646 language tags, then then got reduced to more appropriate RFC4647 ones. But then I got stuck on the CLDR data and also still had some glitches in the numbers. So around new year I at least published the helper crate locale_config for obtaining the system configuration and couple of weeks ago I finally managed to finish the numbers. To make a release, I would like to add date and time and add a facet (Localised) with the ECMA-402-like methods for formatting via appropriate facet, so the functions currently used in https://github.com/ogham/exa can be replaced with appropriate new version. I would definitely welcome some help; we can discuss what in an issue on the https://github.com/rust-locale/rust-locale/ project. |
@jan-hudec Okay, sounds fair. I already created an issue on there, if you can see it... we can also speak on IRC, if you're on there. |
Incidentally, I'm not a big fan of this static map/factory design. It sounds like over-engineering. But maybe we could discuss this. |
What I want here is to have a base And the static map/factory design makes it trivial to add new facets and quite simple to replace existing ones with more advanced version (the later needs explicit initialisation for now). |
Oh right, I see your motivation better now. Fair enough then. Just to clarify: when a Locale wants to create its facets, it will look to the factory, and get it to generate the facets. This may just be the default facets (for numbers, currency, date-time, etc.), or they may be "advanced" versions of the facets provided by alternative crates. These alternative crates would override the factory, perhaps by specialisation? |
@alexreg, yes. Overriding factories for existing facets will have to be done by registering function pointers, because there are no new types involved (well, there are the concrete types, but most of the code does not know about them). For new types of facets, providing suitable impl is enough. |
Sounds fair enough. I'm sure you've thought this out anyway, and will continue to tweak it where appropriate, to make the developer interface as simple as possible. :) |
Thanks @kud1ing. Looks interesting. Does the library support numeric, datetime, everything? |
Hi all! We just released https://crates.io/crates/intl_pluralrules which brings one of the foundations for any decent intl/l10n API. You can read about it here: https://blog.mozilla.org/l10n/2018/08/03/intl_pluralrules-a-rust-crate-for-handling-plural-forms-with-cldr-plural-rules/ As for l10n - I just released fluent-rs 0.3, which gets closer to be a complete l10n solution. It's still raw and doesn't have any syntactic sugar like macros, but it gets the job done the right way. As for intl - with plural rules ready, we can start talking about CLDR based date/time formatting, number formatting and units. Maybe collation? |
It's ready to support it (which is non-trivial - gettext for example will never be on the syntax level). Once Rust gains any intl date/time/number formatting library, we can plug it into Fluent. |
@zbraniecki Sounds good. Does your project have an impedance mismatch of some sort with @jan-hudec's library, or would you consider using it? I believe his might be further along on things like numerics and datetimes. |
Not sure yet. I think the ultimate goal is aligned - get ECMA402-like basic internationalization features:
and with time advanced ones:
But based on my experience working on CLDR, ECMA402 and Rust, I'm more prone to follow the
I'll investigate upstreaming the crate to make it part of As for |
Also, in ECMA402 we're currently standardizing core I'm interested in upstreaming at least the core manipulation portion of it to unic since it is part of Unicode. [0] https://github.com/tc39/proposal-intl-locale |
I submitted a request to update arewewebyet - rust-lang/arewewebyet#120 |
@zbraniecki, I basically got into analysis paralysis and got nowhere with the locale crate. Exactly because putting it in one crate is ultimately the wrong way, because there are too many options and most programs won't need much of that anyway. The plan with locale was changed from using the system API to doing it myself—except it is too much work—and eventually to binding ICU. Which I started, but didn't get around to compiling it when there is no system one available yet (their system for bundling the data is Insane™ and cross-compilation-unfriendly). There is one think of it I think is in useful state, the locale_config, which reads the user locale from system configuration (supporting POSIX, Windows (old api) and CGI). It returns basically a BCP47 tag, but extended with category locales to be able to represent the POSIX locale categories and the separate UI language setting. There are also already two libraries for manipulating the BCP47 tags: language-tag and language-tags; it would perhaps be nice to avoid duplication here. Though I see you did get further—I don't think either of them actually does matching. |
@zbraniecki Sounds like a fair approach. I'm no expert on internationalisation presently, but if you want to get things moving in Rust, I could potentially contribute to this efforts (especially *intl_number |
Hahah, every software engineer has their own skeleton closet, right? :)
Yeah, for that reason, both in ECMA402 and in Rust, I'm leaning toward centralizing around CLDR which is a very well maintained database.
That's cool! I'm wondering if it should be two smaller crates then: On the other hand,
Yeah, I'm not sure what's the difference between those two. I evaluated the I would hope that with some sort of
Internationalized list formatting, think of cases like
Sure! I'm mainly focused now on finalizing Fluent 1.0 and its rust port, so not much time for intl, but I think the next step can be one of:
I would recommend to wait with I'm unlikely to kick off anything until Fluent 1.0, but I'll be happy to advise and contribute code :) |
@zbraniecki, @alexreg, actually I now remember why I didn't select a set of separate crate for each facet (numbers, datetime etc.): to make the formatting extensible, there should be one trait for formatting things—defining the Of course if the |
Well, ICU is CLDR based and already implements most of the things, so binding it should be faster route to something working. And while their data bundling is pretty weird, I actually think when statically linking with Rust, we can simply I have three issues why I see it more as a temporary solution than a final one:
It is more than BCP47 unicode extensions and variants. In POSIX, you can set e.g. It might make sense to extract the little bit that reads the POSIX environment variables, but it's just that—a bunch of environment variables. And they are not really well defined, so you either convert them to BCP47, or you let |
Sure, if you're good with "something working" :) My point is that ICU is a massive codebase with quite significant technical debt. I'd argue that CLDR has very little of it. For that reason, I'd prefer to invest resources (hah, I mean, suggest for us to focus on) in implementing similar APIs in Rust, rather than wrapping C. The added benefit is that slicing CLDR is relatively easy and we can make our crates handle data selection both by locale selection and table selection, much easier. |
@jan-hudec Sounds fair to me with regards to keeping it in one crate. Not sure what @zbraniecki thinks though... |
Fair enough. Let me know when you're done with that though.
As a new crate you mean? Something like
Can you elaborate on this maybe?
Yep, this is the first big task, once the smaller things are out of the way! |
I was wondering, if the plans you are making will allow for compiling to WASM and using in the context of a single page web application. This requires a) a pure Rust implementation, that does not wrap a pre-compiled library (say C or C++) b) a manageable download size of the internationalization libraries and c) no reliance on the std library. |
intl_pluralrules 1.0 has been released - https://crates.io/crates/intl_pluralrules
It would be great to have a low-level BCP47 language tag manipulation library that can parse/operate and serialize language tags as locale objects.. `ECMA402 Intl.Locale is just that, so it's likely a good start. |
@zbraniecki Okay, I'd be happy to work on this, if you think it's easier than one of the above tasks you suggested? Maybe get on Discord/IRC? |
I think it is definitely easier than date time formatting, and lays foundation for it. You could start by taking the I'm not sure how to fit it into unic crate system, but I hope @behnam can help with that. |
Sorry for the delays on UNIC-related work. I'm trying to fit the work in my schedule. I'm starting by putting the core data into standalone repos, to make it easier to define a maintainable Locale implementation in UNIC. I'll give more updates on that as soon as I can. Regarding, ECMA-402-based int'l library, I think it would be a great thing to have, specially to be able to backport for older JS environments. But what we're trying to do in UNIC (and other implementations that are not open-source yet) is to create a more modern API. Again, more on that soon. |
There are two or three on crates.io already. In various state of completeness. |
I only saw crates handling accepted-headers and nothing related to manipulation and serialization of langage tag objects. In other words, if you mean We're looking for something similar to ECMA402 Example of usage: let loc1 = Locale::from_string("en-us-u-hc-h24");
assert_eq!(loc1.get_region(), "US");
assert_eq!(loc1.get_extension_value("unicode", "hourCycle"), "h24");
loc1.set_script("latn");
loc1.set_extension_value("unicode", "hourCycle", "h11");
loc1.set_extension_value("unicode", "calendar", "buddhist");
assert_eq!(loc1.to_string(), "en-Latn-US-u-ca-buddhist-hc-h11");
assert_eq!(loc1.matches_language("en-GB"), true);
assert_eq!(loc1.matches_locale("en-US-u-ca-buddhist-hc-12"), true) and so on. |
@zbraniecki, no, I mean primarily |
Oh, yeah, this one looks similar. So yep, some merge of this and |
So, what we need to do now is the following, regarding a
Correct me if I'm wrong on any of the above. I have a bit of time to work on this now, but I do want to make sure I'm not going to waste effort. |
This list looks good! For the Ecma402 Happy to provide feedback and/or review, but you may want to also check with @behnam - he seems to be interested in this effort as well and has some experience from his recent work in Python. |
Okay, sure. Let's see what @behnam has to say, then I can get going hopefully. :-) |
Okay, some updates are coming soon to UNIC, but it will take longer to build the Having the Territory model, then the next focus will be the The general idea is to not use &str/String for representing i18n data. The top reason is to not hide all the problems under the rug. For example, if a country code goes out of use, the code having conditions against that country code should get compile errors. IMHO, this is necessary for getting a solid i18n library that doesn't require playing whack a mole with bugs down the road. As we have mentioned over the past year or so, UNIC is still experimenting with how to improve i18n architecture. For any application that doesn't want to get into this experiment, I would recommend using ICU-based solutions, which would work with all of these as strings. What do you think? |
Just to update i18n discussion on Rust - new library being developed at https://github.com/unicode-org/icu4x. It's based on Ecma402 API and implemented in Rust. We are currently at the early stages of development. |
An update. ICU4X is now in a stable version 1.1 - https://github.com/unicode-org/icu4x - ready for adoption, and it supplies basic ECMA-402 functionality. |
Issue by alexcrichton
Wednesday May 28, 2014 at 18:11 GMT
For earlier discussion, see rust-lang/rust#14494
This issue was labelled with: A-libs, P-high in the Rust repository
We have been told that internationalization is pretty standardized at this point on the web and many languages are starting to follow along. The authoritative spec for this is located here, and it sounds like we shouldn't deviate from that spec much (as it's what everyone is expecting).
Nominating, but I do not believe this is a 1.0 issue.
The text was updated successfully, but these errors were encountered: