Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update link to Sacremoses project repo to point to authoritative source, not a fork #2276

Open
mbrukman opened this issue Nov 24, 2024 · 1 comment

Comments

@mbrukman
Copy link

📚 Documentation

Summary

This repo's README.md points to the https://github.com/alvations/sacremoses repo for the Sacramoses project:

Alternatively, you might want to use the `Moses <http://www.statmt.org/moses/>`_ tokenizer port in `SacreMoses <https://github.com/alvations/sacremoses>`_ (split from `NLTK <http://nltk.org/>`_). You have to install SacreMoses::

Similarly, in torchtext/data/utils.py:

"See the docs at https://github.com/alvations/sacremoses "

However, the authoritative home of this project appears to be https://github.com/hplt-project/sacremoses, so the repo links should be updated accordingly.

cc: @alvations (author of the above repo) to confirm or correct if this is a misunderstanding on my part (apologies in advance if that's the case).

Rationale and background research

https://github.com/alvations/sacremoses may have been correct repo at the time of initial extraction of Sacramoses from the NLTK project (see issue #306 and PR #361); however, today, https://github.com/alvations/sacremoses is a fork of https://github.com/hplt-project/sacremoses, and it appears that it is simply behind the other, authoritative project by a number of commits, without having any unique commits of its own:

This branch is 43 commits behind hplt-project/sacremoses:master.

We can also see that https://pypi.org/project/sacremoses/ has the "homepage" link pointing to https://github.com/hplt-project/sacremoses, further supporting that this is the authoritative source of the project.

@alvations
Copy link

Thanks for the PR! Pointing to HPLT is correct.

P/S: Though sacremoses and some nltk tokenizers are written in the same style, esp the Penn Treebank tokenizer part, it wasn't extracted from NLTK though; there's only that many ways to write regexes in Python 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants