medcat.cat

Classes:

CAT –

This is a collection of serialisable model parts.
OutOfDataException –

Attributes:

AddonType –
logger –

AddonType `module-attribute`

AddonType = TypeVar('AddonType', bound='AddonComponent')

logger `module-attribute`

logger = getLogger(__name__)

CAT

CAT(cdb: CDB, vocab: Union[Vocab, None] = None, config: Optional[Config] = None, model_load_path: Optional[str] = None, config_dict: Optional[dict] = None, addon_config_dict: Optional[dict[str, dict]] = None)

Bases: AbstractSerialisable

This is a collection of serialisable model parts.

Methods:

add_addon –

Add the addon to the model pack an pipe.
attempt_unpack –

Attempt unpack the zip to a folder and get the model pack path.
describe_pipeline –
get_addon_output –

Get the addon output for the entity.
get_addons –

Get the list of all addons in this model pack.
get_addons_of_type –

Get a list of addons of a specific type.
get_entities –

Get the entities recognised and linked within the provided text.
get_entities_multi_texts –

Get entities from multiple texts (potentially in parallel).
get_init_attrs –
get_model_card –

Get the model card either a (nested) dict or a json string.
get_required_plugins –
ignore_attrs –
load_addons –

Load addons based on a model pack path.
load_cdb –

Loads the concept database from the provided model pack path
load_model_card_off_disk –

Load the model card off disk as a (nested) dict or a json string.
load_model_pack –

Load the model pack from file.
save_entities_multi_texts –

Saves the resulting entities on disk and allows multiprocessing.
save_model_card –
save_model_pack –

Save model pack.

Attributes:

FORCE_SPAWN_MP –
cdb –
config –
pipe (Pipeline) –
trainer –

The trainer object.
usage_monitor –
vocab –

Source code in medcat-v2/medcat/cat.py

def __init__(self,
             cdb: CDB,
             vocab: Union[Vocab, None] = None,
             config: Optional[Config] = None,
             model_load_path: Optional[str] = None,
             config_dict: Optional[dict] = None,
             addon_config_dict: Optional[dict[str, dict]] = None,
             ) -> None:
    self.cdb = cdb
    self.vocab = vocab
    # ensure  config
    if config is None and self.cdb.config is None:
        raise ValueError("Need to specify a config for either CDB or CAT")
    elif config is None:
        config = cdb.config
    elif config is not None:
        self.cdb.config = config
    self.config = config
    if config_dict:
        self.config.merge_config(config_dict)

    self._trainer: Optional[Trainer] = None
    self._pipeline = self._recreate_pipe(model_load_path, addon_config_dict)
    self.usage_monitor = UsageMonitor(
        self._get_hash, self.config.general.usage_monitor)

FORCE_SPAWN_MP `class-attribute` `instance-attribute`

FORCE_SPAWN_MP = True

cdb `instance-attribute`

cdb = cdb

config `instance-attribute`

config = config

pipe `property`

pipe: Pipeline

trainer `property`

trainer

The trainer object.

usage_monitor `instance-attribute`

usage_monitor = UsageMonitor(_get_hash, usage_monitor)

vocab `instance-attribute`

vocab = vocab

add_addon

add_addon(addon: AddonComponent) -> None

Add the addon to the model pack an pipe.

Parameters:

addon
(AddonComponent) –

The addon to add.

Source code in medcat-v2/medcat/cat.py

def add_addon(self, addon: AddonComponent) -> None:
    """Add the addon to the model pack an pipe.

    Args:
        addon (AddonComponent): The addon to add.
    """
    self.config.components.addons.append(addon.config)
    self._pipeline.add_addon(addon)

attempt_unpack `classmethod`

attempt_unpack(zip_path: str) -> str

Attempt unpack the zip to a folder and get the model pack path.

If the folder already exists, no unpacking is done.

Parameters:

zip_path
(str) –

The ZIP path

Returns:

str ( str ) –

The model pack path

Source code in medcat-v2/medcat/cat.py

@classmethod
def attempt_unpack(cls, zip_path: str) -> str:
    """Attempt unpack the zip to a folder and get the model pack path.

    If the folder already exists, no unpacking is done.

    Args:
        zip_path (str): The ZIP path

    Returns:
        str: The model pack path
    """
    base_dir = os.path.dirname(zip_path)
    filename = os.path.basename(zip_path)

    foldername = filename.replace(".zip", '')

    model_pack_path = os.path.join(base_dir, foldername)
    if os.path.exists(model_pack_path):
        logger.info(
            "Found an existing unzipped model pack at: %s, "
            "the provided zip will not be touched.", model_pack_path)
    else:
        logger.info("Unziping the model pack and loading models.")
        shutil.unpack_archive(zip_path, extract_dir=model_pack_path)
    return model_pack_path

describe_pipeline

describe_pipeline() -> PipelineDescription

Source code in medcat-v2/medcat/cat.py

def describe_pipeline(self) -> PipelineDescription:
    pipeline_description: PipelineDescription = {"core": {}, "addons": []}

    for component in self._pipeline.iter_all_components():
        provider = find_provider(component)

        if component.is_core():
            core_comp = cast(CoreComponent, component)
            pipeline_description["core"][core_comp.get_type().name] = {
                "name": component.name,
                "provider": provider,
            }
        else:
            pipeline_description["addons"].append({
                "name": component.name,
                "provider": provider,
            })
    return pipeline_description

get_addon_output

get_addon_output(ent: MutableEntity) -> dict[str, dict]

Get the addon output for the entity.

This includes a key-value pair for each addon that provides some. Sometimes same-type addons may combine their output under the same key.

Parameters:

ent
(MutableEntity) –

The entity in quesiton.

Raises:

ValueError –

If unable to merge multiple addon output.

Returns:

dict[str, dict] –

dict[str, dict]: All the addon output.

Source code in medcat-v2/medcat/cat.py

def get_addon_output(self, ent: MutableEntity) -> dict[str, dict]:
    """Get the addon output for the entity.

    This includes a key-value pair for each addon that provides some.
    Sometimes same-type addons may combine their output under the same key.

    Args:
        ent (MutableEntity): The entity in quesiton.

    Raises:
        ValueError: If unable to merge multiple addon output.

    Returns:
        dict[str, dict]: All the addon output.
    """
    out_dict: dict[str, dict] = {}
    for addon in self._pipeline._addons:
        if not addon.include_in_output:
            continue
        key, val = addon.get_output_key_val(ent)
        if key in out_dict:
            # e.g multiple meta_anns types
            # NOTE: type-ignore due to the strict TypedDict implementation
            cur_val = out_dict[key]  # type: ignore
            if not isinstance(cur_val, dict):
                raise ValueError(
                    "Unable to merge multiple addon output for the same "
                    f" key. Tried to update '{key}'. Previously had "
                    f"{cur_val}, got {val} from addon {addon.full_name}")
            cur_val.update(val)
        else:
            # NOTE: type-ignore due to the strict TypedDict implementation
            out_dict[key] = val  # type: ignore
    return out_dict

get_addons

get_addons() -> list[AddonComponent]

Get the list of all addons in this model pack.

Returns:

list[AddonComponent] –

list[AddonComponent]: The list of addons present.

Source code in medcat-v2/medcat/cat.py

def get_addons(self) -> list[AddonComponent]:
    """Get the list of all addons in this model pack.

    Returns:
        list[AddonComponent]: The list of addons present.
    """
    return list(self._pipeline.iter_addons())

get_addons_of_type

get_addons_of_type(addon_type: Type[AddonType]) -> list[AddonType]

Get a list of addons of a specific type.

Parameters:

addon_type
(Type[AddonType]) –

The type of addons to look for.

Returns:

list[AddonType] –

list[AddonType]: The list of addons of this specific type.

Source code in medcat-v2/medcat/cat.py

def get_addons_of_type(self, addon_type: Type[AddonType]) -> list[AddonType]:
    """Get a list of addons of a specific type.

    Args:
        addon_type (Type[AddonType]): The type of addons to look for.

    Returns:
        list[AddonType]: The list of addons of this specific type.
    """
    return [
        addon for addon in self.get_addons()
        if isinstance(addon, addon_type)
    ]

get_entities

get_entities(text: str, only_cui: Literal[False] = False) -> Entities

get_entities(text: str, only_cui: Literal[True] = True) -> OnlyCUIEntities

get_entities(text: str, only_cui: bool = False) -> Union[dict, Entities, OnlyCUIEntities]

get_entities(text: str, only_cui: bool = False) -> Union[dict, Entities, OnlyCUIEntities]

Get the entities recognised and linked within the provided text.

This will run the text through the pipeline and annotated the recognised and linked entities.

Parameters:

text
(str) –

The text to use.
only_cui
(bool, default: False ) –

Whether to only output the CUIs rather than the entire context. Defaults to False.

Returns:

Union[dict, Entities, OnlyCUIEntities] –

Union[dict, Entities, OnlyCUIEntities]: The entities found and linked within the text.

Source code in medcat-v2/medcat/cat.py

def get_entities(self,
                 text: str,
                 only_cui: bool = False,
                 # TODO : addl_info
                 ) -> Union[dict, Entities, OnlyCUIEntities]:
    """Get the entities recognised and linked within the provided text.

    This will run the text through the pipeline and annotated the
    recognised and linked entities.

    Args:
        text (str): The text to use.
        only_cui (bool, optional): Whether to only output the CUIs
            rather than the entire context. Defaults to False.

    Returns:
        Union[dict, Entities, OnlyCUIEntities]: The entities found and
            linked within the text.
    """
    self._ensure_not_training()
    doc = self(text)
    if not doc:
        return {}
    return self._doc_to_out(doc, only_cui=only_cui)

get_entities_multi_texts

get_entities_multi_texts(texts: Union[Iterable[str], Iterable[tuple[str, str]]], only_cui: bool = False, n_process: int = 1, batch_size: int = -1, batch_size_chars: int = 1000000, save_dir_path: Optional[str] = None, batches_per_save: int = 20) -> Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]]

Get entities from multiple texts (potentially in parallel).

If n_process > 1, n_process - 1 new processes will be created and data will be processed on those as well as the main process in parallel.

Parameters:

texts
(Union[Iterable[str], Iterable[tuple[str, str]]]) –

The input text. Either an iterable of raw text or one with in the format of (text_index, text).
only_cui
(bool, default: False ) –

Whether to only return CUIs rather than other information like start/end and annotated value. Defaults to False.
n_process
(int, default: 1 ) –

Number of processes to use. Defaults to 1.
batch_size
(int, default: -1 ) –

The number of texts to batch at a time. A batch of the specified size will be given to each worker process. Defaults to -1 and in this case the character count will be used instead.
batch_size_chars
(int, default: 1000000 ) –

The maximum number of characters to process in a batch. Each process will be given batch of texts with a total number of characters not exceeding this value. Defaults to 1,000,000 characters. Set to -1 to disable.
save_dir_path
(Optional[str], default: None ) –

The path to where (if specified) the results are saved. The directory will have a annotated_ids.pickle file containing the tuple[list[str], int] with a list of indices already saved and then umber of parts already saved. In addition there will be (usually multuple) files in the part_<num>.pickle format with the partial outputs.
batches_per_save
(int, default: 20 ) –

The number of patches to save (if save_dir_path is specified) at once. Defaults to 20.

Yields:

tuple[str, Union[dict, Entities, OnlyCUIEntities]] –

Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]]: The results in the format of (text_index, entities).

Source code in medcat-v2/medcat/cat.py

def get_entities_multi_texts(
        self,
        texts: Union[Iterable[str], Iterable[tuple[str, str]]],
        only_cui: bool = False,
        n_process: int = 1,
        batch_size: int = -1,
        batch_size_chars: int = 1_000_000,
        save_dir_path: Optional[str] = None,
        batches_per_save: int = 20,
        ) -> Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]]:
    """Get entities from multiple texts (potentially in parallel).

    If `n_process` > 1, `n_process - 1` new processes will be created
    and data will be processed on those as well as the main process in
    parallel.

    Args:
        texts (Union[Iterable[str], Iterable[tuple[str, str]]]):
            The input text. Either an iterable of raw text or one
            with in the format of `(text_index, text)`.
        only_cui (bool):
            Whether to only return CUIs rather than other information
            like start/end and annotated value. Defaults to False.
        n_process (int):
            Number of processes to use. Defaults to 1.
        batch_size (int):
            The number of texts to batch at a time. A batch of the
            specified size will be given to each worker process.
            Defaults to -1 and in this case the character count will
            be used instead.
        batch_size_chars (int):
            The maximum number of characters to process in a batch.
            Each process will be given batch of texts with a total
            number of characters not exceeding this value. Defaults
            to 1,000,000 characters. Set to -1 to disable.
        save_dir_path (Optional[str]):
            The path to where (if specified) the results are saved.
            The directory will have a `annotated_ids.pickle` file
            containing the tuple[list[str], int] with a list of
            indices already saved and then umber of parts already saved.
            In addition there will be (usually multuple) files in the
            `part_<num>.pickle` format with the partial outputs.
        batches_per_save (int):
            The number of patches to save (if `save_dir_path` is specified)
            at once. Defaults to 20.

    Yields:
        Iterator[tuple[str, Union[dict, Entities, OnlyCUIEntities]]]:
            The results in the format of (text_index, entities).
    """
    text_iter = cast(
        Union[Iterator[str], Iterator[tuple[str, str]]], iter(texts))
    batch_iter = self._generate_batches(
        text_iter, batch_size, batch_size_chars, only_cui)
    if save_dir_path:
        saver = BatchAnnotationSaver(save_dir_path, batches_per_save)
    else:
        saver = None
    yield from self._get_entities_multi_texts(
        n_process=n_process, batch_iter=batch_iter, saver=saver)

get_init_attrs `classmethod`

get_init_attrs() -> list[str]

Source code in medcat-v2/medcat/cat.py

@classmethod
def get_init_attrs(cls) -> list[str]:
    return ['cdb', 'vocab']

get_model_card

get_model_card(as_dict: Literal[True]) -> ModelCard

get_model_card(as_dict: Literal[False]) -> str

get_model_card(as_dict: bool = False) -> Union[str, ModelCard]

Get the model card either a (nested) dict or a json string.

Parameters:

as_dict
(bool, default: False ) –

Whether to return as dict. Defaults to False.

Returns:

Union[str, ModelCard] –

Union[str, ModelCard]: The model card.

Source code in medcat-v2/medcat/cat.py

def get_model_card(self, as_dict: bool = False) -> Union[str, ModelCard]:
    """Get the model card either a (nested) `dict` or a json string.

    Args:
        as_dict (bool): Whether to return as dict. Defaults to False.

    Returns:
        Union[str, ModelCard]: The model card.
    """
    has_meta_cat = True
    try:
        from medcat.components.addons.meta_cat import MetaCATAddon
    except MissingDependenciesError:
        has_meta_cat = False
    met_cat_model_cards: list[dict]
    if has_meta_cat:
        met_cat_model_cards = [
            mc.mc.get_model_card(True) for mc in
            self.get_addons_of_type(MetaCATAddon)
        ]
    else:
        met_cat_model_cards = []
    cdb_info = self.cdb.get_basic_info()

    # Pipeline Description
    pipeline_description = self.describe_pipeline()

    # Required Plugins
    required_plugins = self.get_required_plugins()

    model_card: ModelCard = {
        'Model ID': self.config.meta.hash,
        'Last Modified On': self.config.meta.last_saved.isoformat(),
        'History (from least to most recent)': self.config.meta.history,
        'Description': self.config.meta.description,
        'Source Ontology': self.config.meta.ontology,
        'Location': self.config.meta.location,
        'Pipeline Description': pipeline_description,
        'Required Plugins': required_plugins,
        'MetaCAT models': met_cat_model_cards,
        'Basic CDB Stats': cdb_info,
        'Performance': {},  # TODO
        'Important Parameters (Partial view, '
        'all available in cat.config)': get_important_config_parameters(
            self.config),
        'MedCAT Version': self.config.meta.medcat_version,
    }
    if as_dict:
        return model_card
    return json.dumps(model_card, indent=2, sort_keys=False)

get_required_plugins

get_required_plugins() -> list[RequiredPluginDescription]

Source code in medcat-v2/medcat/cat.py

def get_required_plugins(self) -> list[RequiredPluginDescription]:
    # get plugins based on pipe
    req_plugins: dict[str, list[tuple[str, str]]] = {}
    pipe_descr = self.describe_pipeline()
    core_comps = list(pipe_descr["core"].items())
    addons = [("addon", addon) for addon in pipe_descr["addons"]]
    for comp_type, comp in core_comps + addons:
        provider = comp["provider"]
        if provider == "medcat":
            continue
        if provider not in req_plugins:
            req_plugins[provider] = []
        req_plugins[provider].append((comp_type, comp["name"]))
    # map to plugin info
    out_plugins: list[RequiredPluginDescription] = []
    for plugin_name, comp_names in req_plugins.items():
        plugin_info = plugin_registry.get_plugin_info(plugin_name)
        if plugin_info is None:
            continue
        out_plugins.append(
            {
                "name": plugin_name,
                "provides": comp_names,
                "author": plugin_info.author,
                "url": plugin_info.url,
            }
        )
    return out_plugins

ignore_attrs `classmethod`

ignore_attrs() -> list[str]

Source code in medcat-v2/medcat/cat.py

@classmethod
def ignore_attrs(cls) -> list[str]:
    return [
        '_trainer',  # recreate if nededed
        '_pipeline',  # need to recreate regardless
        'config',  # will be loaded along with CDB
        'usage_monitor',  # will be created at startup
    ]

load_addons `classmethod`

load_addons(model_pack_path: str, addon_config_dict: Optional[dict[str, dict]] = None) -> list[tuple[str, AddonComponent]]

Load addons based on a model pack path.

Parameters:

model_pack_path
(str) –

path to model pack, zip or dir.
addon_config_dict
(Optional[dict], default: None ) –

The Addon-specific config dict to merge in before pipe initialisation. If specified, it needs to have an addon dict per name. For instance, {"meta_cat.Subject": {'general': {'device': 'cpu'}}} would apply to the specific MetaCAT.

Returns:

list[tuple[str, AddonComponent]] –

List[tuple(str, AddonComponent)]: list of pairs of adddon names the addons.

Source code in medcat-v2/medcat/cat.py

@classmethod
def load_addons(
        cls, model_pack_path: str,
        addon_config_dict: Optional[dict[str, dict]] = None
        ) -> list[tuple[str, AddonComponent]]:
    """Load addons based on a model pack path.

    Args:
        model_pack_path (str): path to model pack, zip or dir.
        addon_config_dict (Optional[dict]): The Addon-specific
            config dict to merge in before pipe initialisation.
            If specified, it needs to have an addon dict per name.
            For instance,
            `{"meta_cat.Subject": {'general': {'device': 'cpu'}}}`
            would apply to the specific MetaCAT.

    Returns:
        List[tuple(str, AddonComponent)]: list of pairs of adddon names the addons.
    """
    components_folder = os.path.join(model_pack_path, COMPONENTS_FOLDER)
    if not os.path.exists(components_folder):
        return []
    addon_paths_and_names = [
        (folder_path, folder_name.removeprefix(AddonComponent.NAME_PREFIX))
        for folder_name in os.listdir(components_folder)
        if os.path.isdir(folder_path := os.path.join(
            components_folder, folder_name))
        and folder_name.startswith(AddonComponent.NAME_PREFIX)
    ]
    loaded_addons = [
        addon for addon_path, addon_name in addon_paths_and_names
        if isinstance(addon := (
            deserialise(addon_path, model_config=addon_config_dict.get(addon_name))
            if addon_config_dict else
            deserialise(addon_path)
            ), AddonComponent)
    ]
    return [(addon.full_name, addon) for addon in loaded_addons]

load_cdb `classmethod`

load_cdb(model_pack_path: str) -> CDB

Loads the concept database from the provided model pack path

Parameters:

model_pack_path
(str) –

path to model pack, zip or dir.

Returns:

CDB ( CDB ) –

The loaded concept database

Source code in medcat-v2/medcat/cat.py

@classmethod
def load_cdb(cls, model_pack_path: str) -> CDB:
    """
    Loads the concept database from the provided model pack path

    Args:
        model_pack_path (str): path to model pack, zip or dir.

    Returns:
        CDB: The loaded concept database
    """
    zip_path = (model_pack_path if model_pack_path.endswith(".zip")
                else model_pack_path + ".zip")
    model_pack_path = cls.attempt_unpack(zip_path)
    cdb_path = os.path.join(model_pack_path, "cdb")
    cdb = CDB.load(cdb_path)
    return cdb

load_model_card_off_disk `classmethod`

load_model_card_off_disk(model_pack_path: str, as_dict: Literal[True], avoid_unpack: bool = False) -> ModelCard

load_model_card_off_disk(model_pack_path: str, as_dict: Literal[False], avoid_unpack: bool = False) -> str

load_model_card_off_disk(model_pack_path: str, as_dict: bool = False, avoid_unpack: bool = False) -> Union[str, ModelCard]

Load the model card off disk as a (nested) dict or a json string.

Parameters:

model_pack_path
(str) –

The path to the model pack (zip or folder).
as_dict
(bool, default: False ) –

Whether to return as dict. Defaults to False.
avoid_unpack
(bool, default: False ) –

Whether to avoid unpacking the model pack if no previous unpacked path exists. Defaults to False.

Returns:

Union[str, ModelCard] –

Union[str, ModelCard]: The model card.

Source code in medcat-v2/medcat/cat.py

@classmethod
def load_model_card_off_disk(cls, model_pack_path: str,
                             as_dict: bool = False,
                             avoid_unpack: bool = False,
                             ) -> Union[str, ModelCard]:
    """Load the model card off disk as a (nested) `dict` or a json string.

    Args:
        model_pack_path (str): The path to the model pack (zip or folder).
        as_dict (bool): Whether to return as dict. Defaults to False.
        avoid_unpack (bool): Whether to avoid unpacking the model pack if
            no previous unpacked path exists. Defaults to False.

    Returns:
        Union[str, ModelCard]: The model card.
    """
    model_card: Optional[ModelCard] = None
    # unpack if needed
    if model_pack_path.endswith(".zip"):
        if (avoid_unpack and
                not os.path.exists(model_pack_path.removesuffix(".zip"))):
            # stream the model card directly from the zip
            with zipfile.ZipFile(model_pack_path) as zf:
                with zf.open("model_card.json") as src:
                    model_card = json.load(src)
        else:
            # if allowed to unpack or already unpacked anyway
            model_pack_path = cls.attempt_unpack(model_pack_path)
    if model_card is None:
        # i.e not loaded directly off disk
        # load model card
        model_card_path = os.path.join(model_pack_path, "model_card.json")
        with open(model_card_path) as f:
            model_card = json.load(f)
    # return as dict or json
    if as_dict:
        return model_card
    return json.dumps(model_card, indent=2, sort_keys=False)

load_model_pack `classmethod`

load_model_pack(model_pack_path: str, config_dict: Optional[dict] = None, addon_config_dict: Optional[dict[str, dict]] = None) -> CAT

Load the model pack from file.

Parameters:

model_pack_path
(str) –

The model pack path.
config_dict
(Optional[dict], default: None ) –

The model config to merge in before initialising the pipe. Defaults to None.
addon_config_dict
(Optional[dict], default: None ) –

The Addon-specific config dict to merge in before pipe initialisation. If specified, it needs to have an addon dict per name. For instance, {"meta_cat.Subject": {}} would apply to the specific MetaCAT.

Raises:

ValueError –

If the saved data does not represent a model pack.
MissingPluginError –

If required plugins are missing for this model pack.

Returns:

CAT ( CAT ) –

The loaded model pack.

Source code in medcat-v2/medcat/cat.py

@classmethod
def load_model_pack(cls, model_pack_path: str,
                    config_dict: Optional[dict] = None,
                    addon_config_dict: Optional[dict[str, dict]] = None
                    ) -> 'CAT':
    """Load the model pack from file.

    Args:
        model_pack_path (str): The model pack path.
        config_dict (Optional[dict]): The model config to
            merge in before initialising the pipe. Defaults to None.
        addon_config_dict (Optional[dict]): The Addon-specific
            config dict to merge in before pipe initialisation.
            If specified, it needs to have an addon dict per name.
            For instance, `{"meta_cat.Subject": {}}` would apply
            to the specific MetaCAT.

    Raises:
        ValueError: If the saved data does not represent a model pack.
        MissingPluginError: If required plugins are missing for this model pack.

    Returns:
        CAT: The loaded model pack.
    """
    if model_pack_path.endswith(".zip"):
        model_pack_path = cls.attempt_unpack(model_pack_path)
    logger.info("Attempting to load model from file: %s",
                model_pack_path)
    is_legacy = is_legacy_model_pack(model_pack_path)
    avoid_legacy = avoid_legacy_conversion()
    if is_legacy and not avoid_legacy:
        from medcat.utils.legacy.conversion_all import Converter
        doing_legacy_conversion_message(logger, 'CAT', model_pack_path)
        return Converter(model_pack_path, None).convert()
    elif is_legacy and avoid_legacy:
        raise LegacyConversionDisabledError("CAT")

    # Load model card to check for required plugins
    missing_plugins = cls._get_missing_plugins(model_pack_path)

    try:
        # NOTE: ignoring addons since they will be loaded later / separately
        cat = deserialise(model_pack_path, model_load_path=model_pack_path,
                          ignore_folders_prefix={
                            AddonComponent.NAME_PREFIX,
                            # NOTE: will be loaded manually
                            AbstractCoreComponent.NAME_PREFIX,
                            # tokenizer stuff internals are loaded separately
                            # if appropraite
                            TOKENIZER_PREFIX,
                            # components will be loaded semi-manually
                            # within the creation of pipe
                            COMPONENTS_FOLDER,
                            # ignore hidden files/folders
                            '.'},
                          config_dict=config_dict,
                          addon_config_dict=addon_config_dict)
    except ImportError as e:
        if missing_plugins:
            raise MissingPluginError(missing_plugins) from e
        raise

    # NOTE: deserialising of components that need serialised
    #       will be dealt with upon pipeline creation automatically
    if not isinstance(cat, CAT):
        raise ValueError(f"Unable to load CAT. Got: {cat}")
    # reset mapped ontologies at load time but after CDB load
    cat._set_and_get_mapped_ontologies()
    return cat

save_entities_multi_texts

save_entities_multi_texts(texts: Union[Iterable[str], Iterable[tuple[str, str]]], save_dir_path: str, only_cui: bool = False, n_process: int = 1, batch_size: int = -1, batch_size_chars: int = 1000000, batches_per_save: int = 20) -> None

Saves the resulting entities on disk and allows multiprocessing.

This uses get_entities_multi_texts under the hood. But it is designed to save the data on disk as it comes through.

Parameters:

texts
(Union[Iterable[str], Iterable[tuple[str, str]]]) –

The input text. Either an iterable of raw text or one with in the format of (text_index, text).
save_dir_path
(str) –

The path where the results are saved. The directory will have a annotated_ids.pickle file containing the tuple[list[str], int] with a list of indices already saved and the number of parts already saved. In addition there will be (usually multuple) files in the part_<num>.pickle format with the partial outputs.
only_cui
(bool, default: False ) –

Whether to only return CUIs rather than other information like start/end and annotated value. Defaults to False.
n_process
(int, default: 1 ) –

Number of processes to use. Defaults to 1. The number of texts to batch at a time. A batch of the specified size will be given to each worker process. Defaults to -1 and in this case the character count will be used instead.
batch_size_chars
(int, default: 1000000 ) –

The maximum number of characters to process in a batch. Each process will be given batch of texts with a total number of characters not exceeding this value. Defaults to 1,000,000 characters. Set to -1 to disable.

Source code in medcat-v2/medcat/cat.py

def save_entities_multi_texts(
        self,
        texts: Union[Iterable[str], Iterable[tuple[str, str]]],
        save_dir_path: str,
        only_cui: bool = False,
        n_process: int = 1,
        batch_size: int = -1,
        batch_size_chars: int = 1_000_000,
        batches_per_save: int = 20,
) -> None:
    """Saves the resulting entities on disk and allows multiprocessing.

    This uses `get_entities_multi_texts` under the hood. But it is designed
    to save the data on disk as it comes through.

    Args:
        texts (Union[Iterable[str], Iterable[tuple[str, str]]]):
            The input text. Either an iterable of raw text or one
            with in the format of `(text_index, text)`.
        save_dir_path (str):
            The path where the results are saved. The directory will have
            a `annotated_ids.pickle` file containing the
            `tuple[list[str], int]` with a list of indices already saved
            and the number of parts already saved. In addition there will
            be (usually multuple) files in the `part_<num>.pickle` format
            with the partial outputs.
        only_cui (bool):
            Whether to only return CUIs rather than other information
            like start/end and annotated value. Defaults to False.
        n_process (int):
            Number of processes to use. Defaults to 1.
            The number of texts to batch at a time. A batch of the
            specified size will be given to each worker process.
            Defaults to -1 and in this case the character count will
            be used instead.
        batch_size_chars (int):
            The maximum number of characters to process in a batch.
            Each process will be given batch of texts with a total
            number of characters not exceeding this value. Defaults
            to 1,000,000 characters. Set to -1 to disable.
    """
    if save_dir_path is None:
        raise ValueError("Need to specify a save path (`save_dir_path`), "
                         f"got {save_dir_path}")
    out_iter = self.get_entities_multi_texts(
        texts, only_cui=only_cui, n_process=n_process,
        batch_size=batch_size, batch_size_chars=batch_size_chars,
        save_dir_path=save_dir_path, batches_per_save=batches_per_save)
    # NOTE: not keeping anything since it'll be saved on disk
    deque(out_iter, maxlen=0)

save_model_card

save_model_card(model_card_path: str) -> None

Source code in medcat-v2/medcat/cat.py

def save_model_card(self, model_card_path: str) -> None:
    model_card: str = self.get_model_card(as_dict=False)
    with open(model_card_path, 'w') as f:
        f.write(model_card)

save_model_pack

save_model_pack(target_folder: str, pack_name: str = DEFAULT_PACK_NAME, serialiser_type: Union[str, AvailableSerialisers] = 'dill', make_archive: bool = True, only_archive: bool = False, add_hash_to_pack_name: bool = True, change_description: Optional[str] = None) -> str

Save model pack.

The resulting model pack name will have the hash of the model pack in its name if (and only if) the default model pack name is used.

Parameters:

target_folder
(str) –

The folder to save the pack in.
pack_name
(str, default: DEFAULT_PACK_NAME ) –

The model pack name. Defaults to DEFAULT_PACK_NAME.
serialiser_type
(Union[str, AvailableSerialisers], default: 'dill' ) –

The serialiser type. Defaults to 'dill'.
make_archive
(bool, default: True ) –

Whether to make the arhive /.zip file. Defaults to True.
only_archive
(bool, default: False ) –

Whether to clear the non-compressed folder. Defaults to False.
add_hash_to_pack_name
(bool, default: True ) –

Whether to add the hash to the pack name. This is only relevant if pack_name is specified. Defaults to True.
change_description
(Optional[str], default: None ) –

If provided, this the description will be added to the model description. Defaults to None.

Returns:

str ( str ) –

The final model pack path.

Source code in medcat-v2/medcat/cat.py

def save_model_pack(
        self, target_folder: str, pack_name: str = DEFAULT_PACK_NAME,
        serialiser_type: Union[str, AvailableSerialisers] = 'dill',
        make_archive: bool = True,
        only_archive: bool = False,
        add_hash_to_pack_name: bool = True,
        change_description: Optional[str] = None,
        ) -> str:
    """Save model pack.

    The resulting model pack name will have the hash of the model pack
    in its name if (and only if) the default model pack name is used.

    Args:
        target_folder (str):
            The folder to save the pack in.
        pack_name (str, optional): The model pack name.
            Defaults to DEFAULT_PACK_NAME.
        serialiser_type (Union[str, AvailableSerialisers], optional):
            The serialiser type. Defaults to 'dill'.
        make_archive (bool):
            Whether to make the arhive /.zip file. Defaults to True.
        only_archive (bool):
            Whether to clear the non-compressed folder. Defaults to False.
        add_hash_to_pack_name (bool):
            Whether to add the hash to the pack name. This is only relevant
            if pack_name is specified. Defaults to True.
        change_description (Optional[str]):
            If provided, this the description will be added to the
            model description. Defaults to None.

    Returns:
        str: The final model pack path.
    """
    self.config.meta.mark_saved_now()
    # figure out the location/folder of the saved files
    hex_hash = self._versioning(change_description)
    if pack_name == DEFAULT_PACK_NAME or add_hash_to_pack_name:
        pack_name = f"{pack_name}_{hex_hash}"
    model_pack_path = os.path.join(target_folder, pack_name)
    # ensure target folder and model pack folder exist
    ensure_folder_if_parent(model_pack_path)
    # tokenizer (e.g spacy model) - needs to saved before since
    #     it changes config slightly
    if isinstance(self._pipeline.tokenizer, SaveableTokenizer):
        internals_path = self._pipeline.tokenizer.save_internals_to(
            model_pack_path)
        self.config.general.nlp.modelname = internals_path
    # serialise
    serialise(serialiser_type, self, model_pack_path)
    self.save_model_card(os.path.join(model_pack_path, "model_card.json"))
    # components
    components_folder = os.path.join(
        model_pack_path, COMPONENTS_FOLDER)
    self._pipeline.save_components(serialiser_type, components_folder)
    # zip everything
    if make_archive:
        shutil.make_archive(model_pack_path, 'zip',
                            root_dir=model_pack_path)
        if only_archive:
            logger.info("Removing the non-archived model pack folder: %s",
                        model_pack_path)
            shutil.rmtree(model_pack_path, ignore_errors=True)
            # change the model pack path to the zip file so that we
            # refer to an existing file
            model_pack_path += ".zip"
    return model_pack_path

OutOfDataException

Bases: ValueError

medcat.cat

AddonType module-attribute

logger module-attribute

CAT

FORCE_SPAWN_MP class-attribute instance-attribute

cdb instance-attribute

config instance-attribute

pipe property

trainer property

usage_monitor instance-attribute

vocab instance-attribute

add_addon

addon

attempt_unpack classmethod

zip_path

describe_pipeline

get_addon_output

ent

get_addons

get_addons_of_type

addon_type

get_entities

text

only_cui

get_entities_multi_texts

texts

only_cui

n_process

batch_size

batch_size_chars

save_dir_path

batches_per_save

get_init_attrs classmethod

get_model_card

as_dict

get_required_plugins

ignore_attrs classmethod

load_addons classmethod

model_pack_path

addon_config_dict

load_cdb classmethod

model_pack_path

load_model_card_off_disk classmethod

model_pack_path

as_dict

avoid_unpack

load_model_pack classmethod

model_pack_path

config_dict

addon_config_dict

save_entities_multi_texts

texts

save_dir_path

only_cui

n_process

batch_size_chars

save_model_card

save_model_pack

target_folder

pack_name

serialiser_type

make_archive

only_archive

add_hash_to_pack_name

change_description

OutOfDataException

AddonType `module-attribute`

logger `module-attribute`

FORCE_SPAWN_MP `class-attribute` `instance-attribute`

cdb `instance-attribute`

config `instance-attribute`

pipe `property`

trainer `property`

usage_monitor `instance-attribute`

vocab `instance-attribute`

`addon`

attempt_unpack `classmethod`

`zip_path`

`ent`

`addon_type`

`text`

`only_cui`

`texts`

`only_cui`

`n_process`

`batch_size`

`batch_size_chars`

`save_dir_path`

`batches_per_save`

get_init_attrs `classmethod`

`as_dict`

ignore_attrs `classmethod`

load_addons `classmethod`

`model_pack_path`

`addon_config_dict`

load_cdb `classmethod`

`model_pack_path`

load_model_card_off_disk `classmethod`

`model_pack_path`

`as_dict`

`avoid_unpack`

load_model_pack `classmethod`

`model_pack_path`

`config_dict`

`addon_config_dict`

`texts`

`save_dir_path`

`only_cui`

`n_process`

`batch_size_chars`

`target_folder`

`pack_name`

`serialiser_type`

`make_archive`

`only_archive`

`add_hash_to_pack_name`

`change_description`