CU-346mpwz Improving memory usage of MedCAT models (CogStack#323)

* CU-863gntc58 Add parent to child relationship getter to UMLS preprocessing * CU-863gntc58 Only use ISA relationships * Make sure parents do not have themselves as children * CU-863gntc58 Only keep preferred names * CU-346mpwz Add memory optimiser for CDB * CU-346mpwz Add name2<stuff> to memory optimiser for CDB * CU-346mpwz Add keys/items/values views to memory optimiser fake dicts * CU-346mpwz Fix keys/items/values views in memory optimiser fake dicts * CU-346mpwz Add option to optimise or not cui and/or name based dicts in memory optimiser * CU-346mpwz Make default memory optimiser omit name2... optimising; add comment regarding this in docstring * CU-346mpwz Remove unused/legacy code from memory optimiser * CU-346mpwz Add tests for memory optimiser * CU-346mpwz Add tests memory optimised CDB * CU-346mpwz Make dict names available within memory optimiser * CU-346mpwz Add separate tests for memory optimised CDB * CU-346mpwz Remove unused imports in memory optimiser * CU-346mpwz Move some encoding and decoing stuff within serialisation to their own module * CU-346mpwz Add tests for encoding/decoding stuff * CU-346mpwz Add encoding/decoding for delegating dict as well as postprocessing for delegation linking with json serialisation * CU-346mpwz Fix decision upon JSON deserialisation of CDB when loading model pack * CU-346mpwz Adapt serialisation tests to the potential one2many mappings * CU-346mpwz Add tests for memory optimisation, including JSON serialisation ones * CU-346mpwz Remove debug print statements * CU-346mpwz Remove debug methods from tests * CU-346mpwz Fix method signatures in encoding/decoding methods * CU-346mpwz Fix typing issue in serialiser when passing encoder * CU-346mpwz Relax typing restrictions for umls preprocessing / parent2child mapping * CU-346mpwz Remove some debug variables * CU-346mpwz Fix remnant merge conflict * CU-346mpwz Add item removal and popping to delegating dict * CU-346mpwz Add item removal and popping tests to delegating dict * CU-346mpwz Add item adding/setting tests to delegating dict * CU-346mpwz Fix typing issue (List vs list) * CU-346mpwz Add possibility of memory-optimising for snames as well * CU-346mpwz Add comment regarding memory-optimising for filtering by CUI to CDB * CU-346mpwz Add sname based memory optimisation tests * CU-346mpwz Add json serialisation capabilities to snames delegation * CU-346mpwz Make sname optimisation default for memory optimisation * CU-346mpwz Fix typo in serialisation tests * CU-346mpwz Add variable to keep track of current memory optimisation info to CDB * CU-346mpwz Add default cui2snames to sname optimisations; make sure sname optimisation dirties the CDB * CU-346mpwz Add method to undo CDB memory optimisation * CU-346mpwz Add tests for undoing CDB memory optimisation * CU-346mpwz Clear memory optimised parts if/when undoing optimisations * CU-346mpwz Remove accidentally added file/module * CU-346mpwz Add more straight forward optimisation part names; Fix memory optimisation part clearing * CU-346mpwz Add further tests for memory optimisation (dirty state, checking optimised parts)
bramiozo · Jul 6, 2023 · 8631ae3 · 8631ae3
1 parent c1455e2
commit 8631ae3
Show file tree

Hide file tree

Showing 9 changed files with 1,026 additions and 43 deletions.
diff --git a/medcat/cat.py b/medcat/cat.py
@@ -40,7 +40,7 @@
 from medcat.vocab import Vocab
 from medcat.utils.decorators import deprecated
 from medcat.ner.transformers_ner import TransformersNER
-from medcat.utils.saving.serializer import SPECIALITY_NAMES
+from medcat.utils.saving.serializer import SPECIALITY_NAMES, ONE2MANY
 
 
 logger = logging.getLogger(__name__) # separate logger from the package-level one
@@ -356,7 +356,8 @@ def load_model_pack(cls,
 
         # Load the CDB
         cdb_path = os.path.join(model_pack_path, "cdb.dat")
-        has_jsons = len(glob.glob(os.path.join(model_pack_path, '*.json'))) >= len(SPECIALITY_NAMES)
+        nr_of_jsons_expected = len(SPECIALITY_NAMES) - len(ONE2MANY)
+        has_jsons = len(glob.glob(os.path.join(model_pack_path, '*.json'))) >= nr_of_jsons_expected
         json_path = model_pack_path if has_jsons else None
         logger.info('Loading model pack with %s', 'JSON format' if json_path else 'dill format')
         cdb = CDB.load(cdb_path, json_path)

diff --git a/medcat/cdb.py b/medcat/cdb.py
@@ -95,6 +95,7 @@ def __init__(self, config: Union[Config, None] = None) -> None:
         self._optim_params = None
         self.is_dirty = False
         self._hash: Optional[str] = None
+        self._memory_optimised_parts: Set[str] = set()
 
     def get_name(self, cui: str) -> str:
         """Returns preferred name if it exists, otherwise it will return
@@ -180,9 +181,13 @@ def remove_cui(self, cui: str) -> None:
         for name, cuis2status in self.name2cuis2status.items():
             if cui in cuis2status:
                 del cuis2status[cui]
-        self.snames = set()
-        for cuis in self.cui2snames.values():
-            self.snames |= cuis
+        if isinstance(self.snames, set):
+            # if this is a memory optimised CDB, this won't be a set
+            # but it also won't need to be changed since it
+            # relies directly on cui2snames
+            self.snames = set()
+            for cuis in self.cui2snames.values():
+                self.snames |= cuis
         self.name2count_train = {name: len(cuis) for name, cuis in self.name2cuis.items()}
         self.is_dirty = True
 
@@ -561,6 +566,10 @@ def filter_by_cui(self, cuis_to_keep: Union[List[str], Set[str]]) -> None:
         This also will not remove any data from cdb.addl_info - as this field can contain data of
         unknown structure.
 
+        As a side note, if the CDB has been memory-optimised, filtering will undo this memory optimisation.
+        This is because the dicts being involved will be rewritten.
+        However, the memory optimisation can be performed again afterwards.
+
         Args:
             cuis_to_keep (List[str]):
                 CUIs that will be kept, the rest will be removed (not completely, look above).
@@ -624,6 +633,8 @@ def filter_by_cui(self, cuis_to_keep: Union[List[str], Set[str]]) -> None:
         self.cui2type_ids = new_cui2type_ids
         self.cui2preferred_name = new_cui2preferred_name
         self.is_dirty = True
+        # reset memory optimisation state
+        self._memory_optimised_parts.clear()
 
     def make_stats(self):
         stats = {}