medcat.preprocessors.cleaners

Classes:

LCDBMaker –
LGeneral –
LPreprocessing –
NameDescriptor –
UnknownTokenVersion –

Functions:

prepare_name –

Generates different forms of a name. Will edit the provided names

LCDBMaker

Bases: Protocol

Attributes:

min_letters_required (int) –
name_versions (list[str]) –

min_letters_required `instance-attribute`

min_letters_required: int

name_versions `instance-attribute`

name_versions: list[str]

LGeneral

Bases: Protocol

Attributes:

separator (str) –

separator `instance-attribute`

separator: str

LPreprocessing

Bases: Protocol

Attributes:

do_not_normalize (set[str]) –
min_len_normalize (int) –

do_not_normalize `instance-attribute`

do_not_normalize: set[str]

min_len_normalize `instance-attribute`

min_len_normalize: int

NameDescriptor `dataclass`

NameDescriptor(tokens: list[str], snames: set[str], raw_name: str, is_upper: bool)

Attributes:

is_upper (bool) –
raw_name (str) –
snames (set[str]) –
tokens (list[str]) –

is_upper `instance-attribute`

is_upper: bool

raw_name `instance-attribute`

raw_name: str

snames `instance-attribute`

snames: set[str]

tokens `instance-attribute`

tokens: list[str]

UnknownTokenVersion

UnknownTokenVersion(version: str)

Bases: ValueError

Source code in medcat-v2/medcat/preprocessors/cleaners.py

def __init__(self, version: str) -> None:
    super().__init__(f"Unknown token version: '{version}'")

prepare_name

prepare_name(raw_name: str, nlp: BaseTokenizer, names: dict[str, NameDescriptor], configs: tuple[LGeneral, LPreprocessing, LCDBMaker]) -> dict[str, NameDescriptor]

Generates different forms of a name. Will edit the provided names dictionary and add information generated from the name.

Parameters:

nlp
(BaseTokenizer) –

The tokenizer.
names
(dict[str, NameDescriptor]) –

Dictionary of existing names for this concept in this row of a CSV. The new generated name versions and other required information will be added here.
configs
(tuple[LGeneral, LPreprocessing, LCDBMaker]) –

Applicable configs for medcat.

Returns:

names ( dict ) –

The updated dictionary of prepared names.

Source code in medcat-v2/medcat/preprocessors/cleaners.py

def prepare_name(raw_name: str, nlp: BaseTokenizer,
                 names: dict[str, NameDescriptor],
                 configs: tuple[LGeneral, LPreprocessing, LCDBMaker],
                 ) -> dict[str, NameDescriptor]:
    """Generates different forms of a name. Will edit the provided `names`
    dictionary and add information generated from the `name`.

    Args:
        nlp (BaseTokenizer): The tokenizer.
        names (dict[str, NameDescriptor]):
            Dictionary of existing names for this concept in this row of a CSV.
            The new generated name versions and other required information will
            be added here.
        configs (tuple[LGeneral, LPreprocessing, LCDBMaker]):
            Applicable configs for medcat.

    Returns:
        names (dict):
            The updated dictionary of prepared names.
    """
    sc_name = nlp(raw_name)
    _, preprocessing, cdb_maker = configs

    for version in cdb_maker.name_versions:
        tokens = None

        tokens = _get_tokens(preprocessing, sc_name, version)

        if tokens is not None and tokens:
            _update_dict(configs, raw_name, names, tokens,
                         sc_name.base.isupper())

    return names

medcat.preprocessors.cleaners

LCDBMaker

min_letters_required `instance-attribute`

name_versions `instance-attribute`

LGeneral

separator `instance-attribute`

LPreprocessing

do_not_normalize `instance-attribute`

min_len_normalize `instance-attribute`

NameDescriptor `dataclass`

is_upper `instance-attribute`

raw_name `instance-attribute`

snames `instance-attribute`

tokens `instance-attribute`

UnknownTokenVersion

prepare_name

`nlp`

`names`

`configs`

medcat.preprocessors.cleaners

LCDBMaker

min_letters_required instance-attribute

name_versions instance-attribute

LGeneral

separator instance-attribute

LPreprocessing

do_not_normalize instance-attribute

min_len_normalize instance-attribute

NameDescriptor dataclass

is_upper instance-attribute

raw_name instance-attribute

snames instance-attribute

tokens instance-attribute

UnknownTokenVersion

prepare_name

nlp

names

configs

min_letters_required `instance-attribute`

name_versions `instance-attribute`

separator `instance-attribute`

do_not_normalize `instance-attribute`

min_len_normalize `instance-attribute`

NameDescriptor `dataclass`

is_upper `instance-attribute`

raw_name `instance-attribute`

snames `instance-attribute`

tokens `instance-attribute`

`nlp`

`names`

`configs`