-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idna to_unicode()
API has degraded in 1.0
#938
Comments
@hsivonen did some benchmarking in Firefox and the new IDNA crate should be slightly faster than the old one. |
Did the benchmarking involve doing a bunch of |
Do I understand correctly that you want to an API entry point that takes a Is the issue that The remark about "most applications" comes from the observation that for protocol needs, apps generally need to run ToASCII (as a matter of observation, the common case for using this crate directly is to call the crate top-level The interface complexity of I'm curious: What's your use case for pure (without policy for dealing with confusables) ToUnicode? |
I work on InstantDomainSearch, which processes several GBs worth of input files (mostly CSV) to build up state (of which domains are free vs taken vs available on the aftermarket) and then processes user queries against this state. I mainly use idna to normalize domains using UTF-8 which seems like a sensibly compact canonical form. Whereas the |
This makes sense, but I don't think it's unreasonable for a "most applications" statement in the docs not to apply to your application. Most applications need an UTS 46 implementation for the purpose on resolving names (and most of those won't call into the Code that's about registering domains is quite far from "most applications". (And registration needs policy checks on the ToUnicode output that are different from the display policy integration point that the API caters to.)
Letting the parameters change from call to call isn't at all what the API design is about. On the contrary, the API design more obviously allows for constant propagation than an API where the parameters don't looks like constants at the call site.
The closure signature not naming the parameters indeed isn't nice. In retrospect, at least the two Anyway, back to the core question: Is this about having API surface that takes |
Yes, when I benchmarked this a few years ago, IIRC avoiding one allocation per |
That answers the first question. Thanks. It doesn't answer the second one though. Anyway, to avoid API surface marked deprecated, it seems that this function needs to go somewhere: either in your app or in the pub fn to_unicode(domain: &str, out: &mut String) -> Result<(), Errors> {
match Uts46::new().process(
domain.as_bytes(),
FIRST_APP_SPECIFIC_CONSTANT,
SECOND_APP_SPECIFIC_CONSTANT,
THIRD_APP_SPECIFIC_CONSTANT,
|_, _, _| true,
out,
None,
) {
Ok(ProcessingSuccess::Passthrough) => {
out.push_str(domain);
Ok(())
}
Ok(ProcessingSuccess::WroteToSink) => Ok(()),
Err(ProcessingError::ValidityError) => Err(crate::Errors::default()),
Err(ProcessingError::SinkError) => unreachable!(),
}
} That the three constants are app-specific hints at the direction that the function should go into your app. Is there an obvious-enough a combination for the constants that this could go into the crate without exposing three arguments instead of making the three values constant? Also, if this went into the crate, instead of |
It feels to me like a failure of the current crate API that it doesn't propose reasonable defaults for all these constants -- another reason that IMO the previous |
Indeed, However, since The current design assumes that the progression is that if the main entry point without parameters isn't suitable, the API user would configure the ASCII deny list, the hyphen policy, or the confusable policy before configuring the allocation policy. Since you come from the angle of customizing allocation but only a bit (reusing the buffer of |
I work on a domain search engine that deals with many domains. As part of performance efforts to optimize this path, I had previously carefully optimized
Idna::to_unicode()
to avoid allocations where possible (for example, in #653). However, the 1.0 release (apart from bringing in 25 transitive new dependencies which IMO is not great by itself for such a low-level crate) proposes I usedomain_to_unicode()
(which is a little simpler but definitely doesn't enable me to avoid per-conversion allocations), but then says:In turn,
Uts64::to_unicode()
documents itself as:Meanwhile the interface for
to_user_interface()
is:Which IMO is pretty unreasonable to suggest as an interface "most applications" should use. If there is a need for a more complex API it seems clear that the previous approach of building a
Config
with a builder pattern and building an instance with an internal buffer that could be reused allowed for more idiomatic and more performant operations.The text was updated successfully, but these errors were encountered: