Skip to content

Commit

Permalink
CU-346mpwz Improving memory usage of MedCAT models (CogStack#323)
Browse files Browse the repository at this point in the history
* CU-863gntc58 Add parent to child relationship getter to UMLS preprocessing

* CU-863gntc58 Only use ISA relationships

* Make sure parents do not have themselves as children

* CU-863gntc58 Only keep preferred names

* CU-346mpwz Add memory optimiser for CDB

* CU-346mpwz Add name2<stuff> to memory optimiser for CDB

* CU-346mpwz Add keys/items/values views to memory optimiser fake dicts

* CU-346mpwz Fix keys/items/values views in memory optimiser fake dicts

* CU-346mpwz Add option to optimise or not cui and/or name based dicts in memory optimiser

* CU-346mpwz Make default memory optimiser omit name2... optimising; add comment regarding this in docstring

* CU-346mpwz Remove unused/legacy code from memory optimiser

* CU-346mpwz Add tests for memory optimiser

* CU-346mpwz Add tests memory optimised CDB

* CU-346mpwz Make dict names available within memory optimiser

* CU-346mpwz Add separate tests for memory optimised CDB

* CU-346mpwz Remove unused imports in memory optimiser

* CU-346mpwz Move some encoding and decoing stuff within serialisation to their own module

* CU-346mpwz Add tests for encoding/decoding stuff

* CU-346mpwz Add encoding/decoding for delegating dict as well as postprocessing for delegation linking with json serialisation

* CU-346mpwz Fix decision upon JSON deserialisation of CDB when loading model pack

* CU-346mpwz Adapt serialisation tests to the potential one2many mappings

* CU-346mpwz Add tests for memory optimisation, including JSON serialisation ones

* CU-346mpwz Remove debug print statements

* CU-346mpwz Remove debug methods from tests

* CU-346mpwz Fix method signatures in encoding/decoding methods

* CU-346mpwz Fix typing issue in serialiser when passing encoder

* CU-346mpwz Relax typing restrictions for umls preprocessing / parent2child mapping

* CU-346mpwz Remove some debug variables

* CU-346mpwz Fix remnant merge conflict

* CU-346mpwz Add item removal and popping to delegating dict

* CU-346mpwz Add item removal and popping tests to delegating dict

* CU-346mpwz Add item adding/setting tests to delegating dict

* CU-346mpwz Fix typing issue (List vs list)

* CU-346mpwz Add possibility of memory-optimising for snames as well

* CU-346mpwz Add comment regarding memory-optimising for filtering by CUI to CDB

* CU-346mpwz Add sname based memory optimisation tests

* CU-346mpwz Add json serialisation capabilities to snames delegation

* CU-346mpwz Make sname optimisation default for memory optimisation

* CU-346mpwz Fix typo in serialisation tests

* CU-346mpwz Add variable to keep track of current memory optimisation info to CDB

* CU-346mpwz Add default cui2snames to sname optimisations; make sure sname optimisation dirties the CDB

* CU-346mpwz Add method to undo CDB memory optimisation

* CU-346mpwz Add tests for undoing CDB memory optimisation

* CU-346mpwz Clear memory optimised parts if/when undoing optimisations

* CU-346mpwz Remove accidentally added file/module

* CU-346mpwz Add more straight forward optimisation part names; Fix memory optimisation part clearing

* CU-346mpwz Add further tests for memory optimisation (dirty state, checking optimised parts)
  • Loading branch information
mart-r authored Jul 6, 2023
1 parent c1455e2 commit 8631ae3
Show file tree
Hide file tree
Showing 9 changed files with 1,026 additions and 43 deletions.
5 changes: 3 additions & 2 deletions medcat/cat.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
from medcat.vocab import Vocab
from medcat.utils.decorators import deprecated
from medcat.ner.transformers_ner import TransformersNER
from medcat.utils.saving.serializer import SPECIALITY_NAMES
from medcat.utils.saving.serializer import SPECIALITY_NAMES, ONE2MANY


logger = logging.getLogger(__name__) # separate logger from the package-level one
Expand Down Expand Up @@ -356,7 +356,8 @@ def load_model_pack(cls,

# Load the CDB
cdb_path = os.path.join(model_pack_path, "cdb.dat")
has_jsons = len(glob.glob(os.path.join(model_pack_path, '*.json'))) >= len(SPECIALITY_NAMES)
nr_of_jsons_expected = len(SPECIALITY_NAMES) - len(ONE2MANY)
has_jsons = len(glob.glob(os.path.join(model_pack_path, '*.json'))) >= nr_of_jsons_expected
json_path = model_pack_path if has_jsons else None
logger.info('Loading model pack with %s', 'JSON format' if json_path else 'dill format')
cdb = CDB.load(cdb_path, json_path)
Expand Down
17 changes: 14 additions & 3 deletions medcat/cdb.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ def __init__(self, config: Union[Config, None] = None) -> None:
self._optim_params = None
self.is_dirty = False
self._hash: Optional[str] = None
self._memory_optimised_parts: Set[str] = set()

def get_name(self, cui: str) -> str:
"""Returns preferred name if it exists, otherwise it will return
Expand Down Expand Up @@ -180,9 +181,13 @@ def remove_cui(self, cui: str) -> None:
for name, cuis2status in self.name2cuis2status.items():
if cui in cuis2status:
del cuis2status[cui]
self.snames = set()
for cuis in self.cui2snames.values():
self.snames |= cuis
if isinstance(self.snames, set):
# if this is a memory optimised CDB, this won't be a set
# but it also won't need to be changed since it
# relies directly on cui2snames
self.snames = set()
for cuis in self.cui2snames.values():
self.snames |= cuis
self.name2count_train = {name: len(cuis) for name, cuis in self.name2cuis.items()}
self.is_dirty = True

Expand Down Expand Up @@ -561,6 +566,10 @@ def filter_by_cui(self, cuis_to_keep: Union[List[str], Set[str]]) -> None:
This also will not remove any data from cdb.addl_info - as this field can contain data of
unknown structure.
As a side note, if the CDB has been memory-optimised, filtering will undo this memory optimisation.
This is because the dicts being involved will be rewritten.
However, the memory optimisation can be performed again afterwards.
Args:
cuis_to_keep (List[str]):
CUIs that will be kept, the rest will be removed (not completely, look above).
Expand Down Expand Up @@ -624,6 +633,8 @@ def filter_by_cui(self, cuis_to_keep: Union[List[str], Set[str]]) -> None:
self.cui2type_ids = new_cui2type_ids
self.cui2preferred_name = new_cui2preferred_name
self.is_dirty = True
# reset memory optimisation state
self._memory_optimised_parts.clear()

def make_stats(self):
stats = {}
Expand Down
Loading

0 comments on commit 8631ae3

Please sign in to comment.