Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should TimeZoneIdMapper support efficient storage and retrieval of non-canonical IANA IDs? #5610

Open
sffc opened this issue Sep 27, 2024 · 4 comments
Labels
C-time-zone Component: Time Zones needs-approval One or more stakeholders need to approve proposal U-temporal User: Temporal

Comments

@sffc
Copy link
Member

sffc commented Sep 27, 2024

See #5533 (comment)

Currently, TimeZoneIdMapper has two data payloads:

  1. Trie: IANA string to BCP-47 ID (does NOT support random access: only key-to-value lookup)
  2. Map: BCP-47 ID to canonical IANA string (random access supported)

These data structures allow efficient implementation of the following operations:

  • IANA to BCP-47
  • BCP-47 to IANA
  • Normalize IANA
  • Canonicalize IANA

However, it does not support efficient storage or retrieval of a non-canonical IANA name. For example, it supports mapping from "europe/kiev" to any of the following:

  • "Europe/Kiev" returned as an owned String
  • "uaiev" returned as a TinyStr
  • "Europe/Kyiv" returned as &str

But we don't have a representation that allows the user to return "Europe/Kiev" later: only "Europe/Kyiv". The client would need to store the string on their own.

A few ways to enable this behavior with data:

  1. Change the map from BCP-47 ID to IANA string to also include non-canonical aliases, and return an opaque integer of some sort to select the correct alias
  2. Switch from using a Trie to using a Map so that random access is supported, and then let the client save the index of their non-canonical time zone in the payload
  3. Add a third payload where this is its only job

One concern I have is that any of these approaches involves returning an index of some sort into a structure. However, that structure is ephemeral and could change across ICU4X versions, so I don't want people storing these indices. We currently always return BCP-47 IDs because those have a stability policy.

An alternative solution might be to intern the strings, which could be done externally to ICU4X. If you need to save the non-canonical ID, just stuff it into your own static string interning structure that deduplicates. The total size of the interned string collection is bounded. An example utility for this is elsa::FrozenIndexSet.

Thoughts @justingrant?

@sffc sffc added needs-approval One or more stakeholders need to approve proposal C-time-zone Component: Time Zones U-temporal User: Temporal labels Sep 27, 2024
@justingrant
Copy link

Thanks @sffc for starting this conversation. I don't yet have an opinion about the best solution here, but I've got some initial questions/comments below.

The main guidance I'd have (based on my experience with time zone canonicalization in ECMAScript) is that a time zone API should enable an efficient way to answer the following questions/usecases:

  1. Is this string a valid IANA ID?
  2. Do these two IANA IDs represent the same time zone, meaning they are aliases of each other in IANA? Note that this comparison should happen via a library method to dissuade users from using a naive string comparison, and because the data underlying this comparison will change across ICU4X versions.
  3. What is the normalized IANA ID for this IANA ID? (IANA strings should always be normalized before use, because they're case-insensitive)
  4. What is the list of time zones that I should use to build a UI picker? The list must include only one entry for each canonical IANA ID.
  5. How should I efficiently store a large number of time zones in memory, assuming that canonicalization *is* OK? Canonicalization is OK when the zone never needs to be converted back into an IANA ID.
  6. How should I efficiently store a large number of time zones in memory when canonicalization is *not* OK, like when a user's choice of IANA ID needs to be persisted and we want to respect the user's initial choice, despite future changes to IANA like Kiev=>Kyiv.

Does this list sound correct?

Does that list of use cases influence the proposed solutions in your OP?

Other than the last one in the list, are there others that current ICU4X doesn't meet?

that structure is ephemeral and could change across ICU4X versions

Well, we could choose to make the indexes stable over time if we wanted to (with newly-added zones being given new, larger indexes than existing zones), at the cost of being unable to use the indexes to provide a lexicographic sort order.

In other words from the caller's perspective they'd be IDs not indexes and would need to be documented as such. I'm not suggesting that we should make them stable, but we could.

If we did do this, then it'd be smart to randomize the order to avoid clients fooling themselves into thinking that the indexes were sorted.

If we choose to make an unstable ID visible to callers, then maybe naming it something like unstableId would be smart.

  • "uaiev" returned as a TinyStr

I assume you mean TinyAsciiStr?

An alternative solution might be to intern the strings, which could be done externally to ICU4X. If you need to save the non-canonical ID, just stuff it into your own static string interning structure that deduplicates. The total size of the interned string collection is bounded. An example utility for this is elsa::FrozenIndexSet.

One variant on this could be for ICU4X to provide some method that dynamically creates a data structure that callers will want, instead of requiring every caller to figure this out on their own.

@sffc
Copy link
Member Author

sffc commented Oct 3, 2024

1-5 are all supported with current data, although we don't have an API yet for 4. But we already have a list of all canonical BCP-47 IDs in the data.

I'm definitely not in favor of returning any stable integer, because that just invites people to store them, and then we've created a new taxonomy. I think if we decided to support this, returning a string reference is about at far as I'd want to go.

I'm open to the idea of us using a string interning data structure internally.

@justingrant
Copy link

I'm definitely not in favor of returning any stable integer

I don't know enough Rust to know if this idea is practical, but could the memory savings be achieved without letting callers access the actual index value by wrapping the index as a private field in a public struct?

For example, a NamedTimeZoneId struct with an index: u16 private field, and with some public methods in the impl like these?

  • from_IANA factory method that accepts a string. Could also have from_BCP47.
  • to_IANA instance method that returns the IANA name as an (interned?) string. Could also have to_BCP47.
  • get_all_IANA static method to return the full list of normalized IANA IDs. Could also have get_all_BCP47.
  • Maybe an instance method that accepts a NamedTimeZoneId and returns true if both the argument and Self resolve to the same canonical ID, false otherwise. Could also be a static method.
  • For serialization/deserialization, use the IANA string.

Apologies if my Rust inexperience is showing here, if this is not practical.

@sffc
Copy link
Member Author

sffc commented Oct 3, 2024

Yeah, we could return a wrapped thing, or a string reference, if we have the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-time-zone Component: Time Zones needs-approval One or more stakeholders need to approve proposal U-temporal User: Temporal
Projects
None yet
Development

No branches or pull requests

2 participants