Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore data augmentation for NER robustness #336

Open
vskmd opened this issue Mar 13, 2021 · 1 comment
Open

Explore data augmentation for NER robustness #336

vskmd opened this issue Mar 13, 2021 · 1 comment
Labels

Comments

@vskmd
Copy link

vskmd commented Mar 13, 2021

Hi, I am working on a covid-19 antiviral and was spot checking antivirals in scispacy and was surprised that remdesivir is not tagged as a chemical in any of the 1,338 PubMed abstracts containing it. I'm using en_ner_bc5cdr_md to extract CHEMICAL and DISEASE entities; spacy: '3.0.4', scispacy: '0.4.0'.

As you see below, remdesivir is not tagged as a CHEMICAL when I run en_ner_bc5cdr_md in Jupyter Lab.

image

However, when I put the same text into your demo, I was surprised that remdesivir is found.

image

Questions

  • Wonder if the version running on demo is the same one that I used in my notebook (spacy: '3.0.4', scispacy: '0.4.0')?
  • Maybe remdesivir isn't found since it wasn't present in earlier training sets?
  • Can we expect new chemicals to be recognized (e.g., first time ever published)?
  • It's especially surprising that remdesivir wasn't detected as a CHEMICAL even in the following line where it's called a 'drug' from the text used in my example:

Though the drug remdesivir (RDV) is not approved by the FDA, still the "Emergency Use Authorization" (EUA) for compassionate use in severe cases is endorsed.

  • In the demo remdesivir is detected but only once while it is mentioned several times in that passage. Is that expected?

Thanks,
vikram

@dakinggg
Copy link
Collaborator

  1. The version on the demo is probably not the latest release version. I should check and update that.
    2/3/4) First, this is a model, so inconsistent and surprising output is likely, and some memorization is likely (@DeNeutoy looks like data augmentation could help a lot here). Second, the BC5CDR corpus was annotated with specific guidelines (https://biocreative.bioinformatics.udel.edu/media/store/files/2015/bc5_CDR_data_guidelines.pdf) which you may want to read and see if they align with your expectations of what would be annotated as a chemical. Here is some output of a mix of real and made up chemical names. I don't really conclude anything from this, other than that the model is definitely using some combination of the form of the name itself and the context
In [29]: for drug_name in ["mesna", "remdesivir", "mebane", "relidate", "novila", "aspirin", "coloxal", "inovivir", "scopolamine", "entamine", "valimine", "henirin", "noonirin", "halirin"]:
    ...:     text = f"The drug {drug_name} is used to treat the virus"
    ...:     doc = nlp(text)
    ...:     print(doc.ents)
    ...: 
(mesna,)
()
(mebane,)
()
()
(aspirin,)
()
()
(scopolamine,)
(entamine,)
(valimine,)
(henirin,)
()
()

Looks like it is also sensitive to capitalization

In [56]: doc = nlp("Remdesivir is a chemical")
In [57]: doc.ents
Out[57]: (Remdesivir,)

In [58]: doc = nlp("remdesivir is a chemical")

In [59]: doc.ents
Out[59]: ()

I don't have much else to add at the moment. We were thinking about running some data augmentation experiments to try to improve the NER, but haven't done it yet (I'd be thrilled to have a contribution along those lines).
5) Definitely the model takes into account the context that the word occurs in, so it is not wholly surprising to me that the same word could be classified differently in different contexts.

@dakinggg dakinggg changed the title Expectations on CHEMICAL NER? Explore data augmentation for NER robustness Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants