Skip to content

medcat.preprocessors.cleaners

Classes:

Functions:

  • prepare_name

    Generates different forms of a name. Will edit the provided names

LCDBMaker

Bases: Protocol

Attributes:

min_letters_required instance-attribute

min_letters_required: int

name_versions instance-attribute

name_versions: list[str]

LGeneral

Bases: Protocol

Attributes:

separator instance-attribute

separator: str

LPreprocessing

Bases: Protocol

Attributes:

do_not_normalize instance-attribute

do_not_normalize: set[str]

min_len_normalize instance-attribute

min_len_normalize: int

NameDescriptor dataclass

NameDescriptor(tokens: list[str], snames: set[str], raw_name: str, is_upper: bool)

Attributes:

is_upper instance-attribute

is_upper: bool

raw_name instance-attribute

raw_name: str

snames instance-attribute

snames: set[str]

tokens instance-attribute

tokens: list[str]

UnknownTokenVersion

UnknownTokenVersion(version: str)

Bases: ValueError

Source code in medcat-v2/medcat/preprocessors/cleaners.py
114
115
def __init__(self, version: str) -> None:
    super().__init__(f"Unknown token version: '{version}'")

prepare_name

Generates different forms of a name. Will edit the provided names dictionary and add information generated from the name.

Parameters:

Returns:

  • names ( dict ) –

    The updated dictionary of prepared names.

Source code in medcat-v2/medcat/preprocessors/cleaners.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
def prepare_name(raw_name: str, nlp: BaseTokenizer,
                 names: dict[str, NameDescriptor],
                 configs: tuple[LGeneral, LPreprocessing, LCDBMaker],
                 ) -> dict[str, NameDescriptor]:
    """Generates different forms of a name. Will edit the provided `names`
    dictionary and add information generated from the `name`.

    Args:
        nlp (BaseTokenizer): The tokenizer.
        names (dict[str, NameDescriptor]):
            Dictionary of existing names for this concept in this row of a CSV.
            The new generated name versions and other required information will
            be added here.
        configs (tuple[LGeneral, LPreprocessing, LCDBMaker]):
            Applicable configs for medcat.

    Returns:
        names (dict):
            The updated dictionary of prepared names.
    """
    sc_name = nlp(raw_name)
    _, preprocessing, cdb_maker = configs

    for version in cdb_maker.name_versions:
        tokens = None

        tokens = _get_tokens(preprocessing, sc_name, version)

        if tokens is not None and tokens:
            _update_dict(configs, raw_name, names, tokens,
                         sc_name.base.isupper())

    return names