Skip to content

medcat.config

Modules:

Classes:

AnnotationOutput

Bases: SerialisableBaseModel

The annotation output part of the config

Attributes:

context_left class-attribute instance-attribute

context_left: int = -1

context_right class-attribute instance-attribute

context_right: int = -1

include_text_in_output class-attribute instance-attribute

include_text_in_output: bool = False

lowercase_context class-attribute instance-attribute

lowercase_context: bool = True

CDBMaker

Bases: SerialisableBaseModel

The Context Database (CDB) making part of the config

Attributes:

min_letters_required class-attribute instance-attribute

min_letters_required: int = 2

Minimum number of letters required in a name to be accepted for a concept

multi_separator class-attribute instance-attribute

multi_separator: str = '|'

If multiple names or type_ids for a concept present in one row of a CSV, they are separated by the specified character.

name_versions class-attribute instance-attribute

name_versions: list = ['LOWER', 'CLEAN']

Name versions to be generated.

remove_parenthesis class-attribute instance-attribute

remove_parenthesis: int = 5

Should preferred names with parenthesis be cleaned 0 means no, else it means if longer than or equal e.g. Head (Body part) -> Head

Components

Bases: SerialisableBaseModel

Attributes:

addons class-attribute instance-attribute

addons: list[ComponentConfig] = []

comp_order class-attribute instance-attribute

comp_order: list[str] = ['tagging', 'token_normalizing', 'ner', 'linking']

linking class-attribute instance-attribute

linking: Linking = Linking()

ner class-attribute instance-attribute

ner: Ner = Ner()

tagging class-attribute instance-attribute

token_normalizing class-attribute instance-attribute

token_normalizing: ComponentConfig = ComponentConfig()

Config

Bases: SerialisableBaseModel

Attributes:

annotation_output class-attribute instance-attribute

annotation_output: AnnotationOutput = AnnotationOutput()

cdb_maker class-attribute instance-attribute

cdb_maker: CDBMaker = CDBMaker()

components class-attribute instance-attribute

components: Components = Components()

general class-attribute instance-attribute

general: General = General()

meta class-attribute instance-attribute

meta: ModelMeta = Field(default_factory=ModelMeta)

preprocessing class-attribute instance-attribute

preprocessing: Preprocessing = Preprocessing()

Linking

Bases: ComponentConfig

The linking part of the config

Attributes:

additional class-attribute instance-attribute

additional: Optional[Any] = None

Some additional config for non-default linkers. E.g the 2-step linker uses this for alpha calculations and learning rate for type contexts.

always_calculate_similarity class-attribute instance-attribute

always_calculate_similarity: bool = False

Do we want to calculate context similarity even for concepts that are not ambiguous.

calculate_dynamic_threshold class-attribute instance-attribute

calculate_dynamic_threshold: bool = False

Concepts below this similarity will be ignored. Type can be static/dynamic - if dynamic each CUI has a different TH and it is calculated as the average confidence for that CUI * similarity_threshold. Take care that dynamic works only if the cdb was trained with calculate_dynamic_threshold = True.

context_ignore_center_tokens class-attribute instance-attribute

context_ignore_center_tokens: bool = False

If true when the context of a concept is calculated (embedding) the words making that concept are not taken into account

context_vector_sizes class-attribute instance-attribute

context_vector_sizes: dict = {'xlong': 27, 'long': 18, 'medium': 9, 'short': 3}

Context vector sizes that will be calculated and used for linking

context_vector_weights class-attribute instance-attribute

context_vector_weights: dict = {'xlong': 0.1, 'long': 0.4, 'medium': 0.4, 'short': 0.1}

Weight of each vector in the similarity score - make trainable at some point. Should add up to 1.

devalue_linked_concepts class-attribute instance-attribute

devalue_linked_concepts: bool = False

When adding a positive example, should it also be treated as Negative for concepts which link to the positive one via names (ambiguous names).

disamb_length_limit class-attribute instance-attribute

disamb_length_limit: int = 3

All concepts below this will always be disambiguated

filter_before_disamb class-attribute instance-attribute

filter_before_disamb: bool = False

If True it will filter before doing disamb. Useful for the trainer.

filters class-attribute instance-attribute

Filters

model_config class-attribute instance-attribute

model_config = ConfigDict(extra='allow')

negative_ignore_punct_and_num class-attribute instance-attribute

negative_ignore_punct_and_num: bool = True

Do we ignore punct/num when negative sampling

negative_probability class-attribute instance-attribute

negative_probability: float = 0.5

Probability for the negative context to be added for each positive addition

optim class-attribute instance-attribute

optim: dict = {'type': 'linear', 'base_lr': 1, 'min_lr': 5e-05}

Linear anneal

prefer_frequent_concepts class-attribute instance-attribute

prefer_frequent_concepts: float = 0.35

If >0 concepts that are more frequent will be preferred by a multiply of this amount

prefer_primary_name class-attribute instance-attribute

prefer_primary_name: float = 0.35

If >0 concepts for which a detection is its primary name will be preferred by that amount (0 to 1)

random_replacement_unsupervised class-attribute instance-attribute

random_replacement_unsupervised: float = 0.8

If <1 during unsupervised training the detected term will be randomly replaced with a probability of 1 - random_replacement_unsupervised Replaced with a synonym used for that term

similarity_threshold class-attribute instance-attribute

similarity_threshold: float = 0.25

similarity_threshold_type class-attribute instance-attribute

similarity_threshold_type: str = 'static'

subsample_after class-attribute instance-attribute

subsample_after: int = 30000

DISABLED in code permanetly: Subsample during unsupervised training if a concept has received more than

train class-attribute instance-attribute

train: bool = True

Should it train or not, this is set automatically ignore in 99% of cases and do not set manually

train_count_threshold class-attribute instance-attribute

train_count_threshold: int = 1

Concepts that have seen less training examples than this will not be used for similarity calculation and will have a similarity of -1.

LinkingFilters

LinkingFilters(**data)

Bases: SerialisableBaseModel

These describe the linking filters used alongside the model.

When no CUIs nor excluded CUIs are specified (the sets are empty), all CUIs are accepted. If there are CUIs specified then only those will be accepted. If there are excluded CUIs specified, they are excluded.

In some cases, there are extra filters as well as MedCATtrainer (MCT) export filters. These are expected to follow the following: extra_cui_filter ⊆ MCT filter ⊆ Model/config filter

While any other CUIs can be included in the the extra CUI filter or the MCT filter, they would not have any real effect.

Methods:

Attributes:

Source code in medcat-v2/medcat/config/config.py
329
330
331
332
333
334
335
336
337
338
339
340
def __init__(self, **data):
    if 'cuis' in data:
        cuis = data['cuis']
        if isinstance(cuis, dict) and len(cuis) == 0:
            logger.warning("Loading an old model where "
                           "config.linking.filters.cuis has been "
                           "dict to an empty dict instead of an empty "
                           "set. Converting the dict to a set in memory "
                           "as that is what is expected. Please consider "
                           "saving the model again.")
            data['cuis'] = set(cuis.keys())
    super().__init__(**data)

cuis class-attribute instance-attribute

cuis: set[str] = set()

cuis_exclude class-attribute instance-attribute

cuis_exclude: set[str] = set()

check_filters

check_filters(cui: str) -> bool

Checks is a CUI in the filters

Parameters:

  • cui

    (str) –

    The CUI in question

Returns:

  • bool ( bool ) –

    True if the CUI is allowed

Source code in medcat-v2/medcat/config/config.py
342
343
344
345
346
347
348
349
350
351
352
353
354
def check_filters(self, cui: str) -> bool:
    """Checks is a CUI in the filters

    Args:
        cui (str): The CUI in question

    Returns:
        bool: True if the CUI is allowed
    """
    if cui in self.cuis or not self.cuis:
        return cui not in self.cuis_exclude
    else:
        return False

ModelMeta

Bases: SerialisableBaseModel

Methods:

Attributes:

description class-attribute instance-attribute

description: str = 'N/A'

hash class-attribute instance-attribute

hash: str = ''

history class-attribute instance-attribute

history: list[str] = Field(default_factory=list)

last_saved class-attribute instance-attribute

last_saved: datetime = Field(default_factory=now)

location class-attribute instance-attribute

location: str = 'N/A'

medcat_version class-attribute instance-attribute

medcat_version: str = ''

ontology class-attribute instance-attribute

ontology: list[str] = []

saved_environ class-attribute instance-attribute

saved_environ: Environment = Field(default_factory=get_environment_info)

sup_trained class-attribute instance-attribute

sup_trained: list[TrainingDescriptor] = []

unsup_trained class-attribute instance-attribute

unsup_trained: list[TrainingDescriptor] = []

add_sup_training

add_sup_training(start_time: datetime, num_docs: int, project_name: str) -> None

Add supervised training information based on data.

This will mark down the time taken for training by comparing

the start time to the current time.

This will be called for every project being trained separately.

So if there's a MCT export being trained with multiple projects, multiple different training instances will be recorded.

Parameters:

  • start_time

    (datetime) –

    The time at which the training was started.

  • num_docs

    (int) –

    The number of documents that were trained.

  • project_name

    (str) –

    The project name.

Source code in medcat-v2/medcat/config/config.py
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
def add_sup_training(self, start_time: datetime, num_docs: int,
                     project_name: str) -> None:
    """Add supervised training information based on data.

    NOTE: This will mark down the time taken for training by comparing
          the start time to the current time.

    NOTE: This will be called for every project being trained separately.
          So if there's a MCT export being trained with multiple projects,
          multiple different training instances will be recorded.

    Args:
        start_time (datetime): The time at which the training was started.
        num_docs (int): The number of documents that were trained.
        project_name (str): The project name.
    """
    self.sup_trained.append(TrainingDescriptor(
        train_time_start=start_time, train_time_end=datetime.now(),
        project_name=project_name, num_docs=num_docs, num_epochs=1
    ))

add_unsup_training

add_unsup_training(start_time: datetime, num_docs: int, num_epochs: int = 1, project_name: str = 'N/A')

Add unsupervised training information based on data.

This will mark down the time taken for training by comparing

the start time to the current time.

Parameters:

  • start_time

    (datetime) –

    The time at which the training was started.

  • num_docs

    (int) –

    The number of documents trained.

  • num_epochs

    (int, default: 1 ) –

    The number of epochs. Defaults to 1.

  • project_name

    (str, default: 'N/A' ) –

    The project name. Defaults to 'N/A'.

Source code in medcat-v2/medcat/config/config.py
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
def add_unsup_training(self, start_time: datetime, num_docs: int,
                       num_epochs: int = 1, project_name: str = 'N/A'):
    """Add unsupervised training information based on data.

    NOTE: This will mark down the time taken for training by comparing
          the start time to the current time.

    Args:
        start_time (datetime): The time at which the training was started.
        num_docs (int): The number of documents trained.
        num_epochs (int, optional): The number of epochs. Defaults to 1.
        project_name (str, optional): The project name. Defaults to 'N/A'.
    """
    self.unsup_trained.append(TrainingDescriptor(
        train_time_start=start_time, train_time_end=datetime.now(),
        project_name=project_name, num_docs=num_docs,
        num_epochs=num_epochs))

mark_saved_now

mark_saved_now()
Source code in medcat-v2/medcat/config/config.py
544
545
546
547
def mark_saved_now(self):
    self.last_saved = datetime.now()
    self.saved_environ = get_environment_info()
    self.medcat_version = medcat_version

prepare_and_report_training

prepare_and_report_training(data_iterator: C, num_epochs: int, supervised: bool = False, project_name: str = 'N/A') -> Iterator[C]

Context manager for preparing training.

This is used so that we can get the number of items in the data during training.

Parameters:

  • data_iterator

    (C) –

    The data to be trained.

  • num_epochs

    (int) –

    The number of epochs to be used.

  • supervised

    (bool, default: False ) –

    Whether training is supervised. Defaults to False.

  • project_name

    (str, default: 'N/A' ) –

    The project name. Defaults to 'N/A'.

Yields:

  • C

    Iterator[C]: The same data that was input.

Source code in medcat-v2/medcat/config/config.py
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
@contextmanager
def prepare_and_report_training(self,
                                data_iterator: C,
                                num_epochs: int,
                                supervised: bool = False,
                                project_name: str = 'N/A'
                                ) -> Iterator[C]:
    """Context manager for preparing training.

    This is used so that we can get the number of items in the data
    during training.

    Args:
        data_iterator (C): The data to be trained.
        num_epochs (int): The number of epochs to be used.
        supervised (bool, optional): Whether training is supervised.
            Defaults to False.
        project_name (str, optional): The project name. Defaults to 'N/A'.

    Yields:
        Iterator[C]: The same data that was input.
    """
    _names, _counts = [], [0]  # NOTE: 0 count for fallback

    def callback(name: str, count: int) -> None:
        _names.append(name)
        _counts.append(count)
    wrapped = callback_iterator(f"TRAIN-{id(data_iterator)}",
                                data_iterator, callback)
    start_time = datetime.now()
    try:
        yield cast(C, wrapped)
    finally:
        # even if something fails, log the count
        num_docs = _counts[1]
        if supervised:
            self.add_sup_training(start_time=start_time,
                                  num_docs=num_docs,
                                  project_name=project_name)
        else:
            self.add_unsup_training(start_time=start_time,
                                    num_docs=num_docs,
                                    num_epochs=num_epochs,
                                    project_name=project_name)
        if len(_names) != 1:
            logger.warning(
                "Something went wrong during %ssupervised training. "
                "The number of documents trained was unable to be "
                "clearly obtained. Counted %d names (%s) at %s",
                'un' if not supervised else '', len(_names), _names,
                _counts)

NLPConfig

Bases: SerialisableBaseModel

Attributes:

disabled_components class-attribute instance-attribute

disabled_components: list = ['ner', 'parser', 'vectors', 'textcat', 'entity_linker', 'sentencizer', 'entity_ruler', 'merge_noun_chunks', 'merge_entities', 'merge_subtokens']

The list of components that will be disabled for the NLP.

NB! For these changes to take effect, the pipe would need to be recreated.

faster_spacy_tokenization class-attribute instance-attribute

faster_spacy_tokenization: bool = False

Allow skipping the spacy pipeline.

If True, uses basic tokenization only (spacy.make_doc) for ~3-4x overall speedup. If False, uses full linguistic pipeline including POS tagging, lemmatization, and stopword detection.

Impact of fast_tokenization=True: - No part-of-speech tags: All tokens treated uniformly during normalization - No lemmatization: Words used in surface form (e.g., "running" vs "run") - No stopword detection: All tokens in multi-token spans considered; all tokens used in context vector calculation - Real world performance (in terms of precision and recall) is likely to be lower

When to use fast mode: - Processing very large datasets where speed is critical - Text is already clean/normalized - Minor drops in precision/recall (typically 1-3%) are acceptable

When to use full mode (default): - Maximum accuracy is required - Working with noisy or varied text - Proper linguistic analysis improves your specific use case

Benchmark on your data to determine if the speedup justifies the accuracy tradeoff.

PS: Only applicable for spacy based tokenizer.

NB! For these changes to take effect, the pipe would need to be recreated.

model_config class-attribute instance-attribute

model_config = ConfigDict(extra='allow', validate_assignment=True)

modelname class-attribute instance-attribute

modelname: str = 'en_core_web_md'

What model will be used for tokenization.

NB! For these changes to take effect, the pipe would need to be recreated.

provider class-attribute instance-attribute

provider: str = 'regex'

The NLP provider.

Currently only regex and spacy are natively supported.

NB! For these changes to take effect, the pipe would need to be recreated.

Ner

Bases: ComponentConfig

The NER part of the config

Attributes:

check_upper_case_names class-attribute instance-attribute

check_upper_case_names: bool = False

Check uppercase to distinguish uppercase and lowercase words that have a different meaning.

custom_cnf class-attribute instance-attribute

custom_cnf: Optional[Any] = None

The custom config for the component.

max_skip_tokens class-attribute instance-attribute

max_skip_tokens: int = 2

When checking tokens for concepts you can have skipped tokens between used ones (usually spaces, new lines etc). This number tells you how many skipped can you have.

min_name_len class-attribute instance-attribute

min_name_len: int = 3

Do not detect names below this limit, skip them

model_config class-attribute instance-attribute

model_config = ConfigDict(extra='allow')

try_reverse_word_order class-attribute instance-attribute

try_reverse_word_order: bool = False

Try reverse word order for short concepts (2 words max), e.g. heart disease -> disease heart

upper_case_limit_len class-attribute instance-attribute

upper_case_limit_len: int = 4

Any name shorter than this must be uppercase in the text to be considered. If it is not uppercase it will be skipped.

Preprocessing

Bases: SerialisableBaseModel

The preprocessing part of the config

Attributes:

do_not_normalize class-attribute instance-attribute

do_not_normalize: set[str] = {'VBD', 'VBG', 'VBN', 'VBP', 'JJS', 'JJR'}

Should specific word types be normalized: e.g. running -> run Values are detailed part-of-speech tags. See: - https://spacy.io/usage/linguistic-features#pos-tagging - Label scheme section per model at https://spacy.io/models/en

keep_punct class-attribute instance-attribute

keep_punct: set = {'.', ':'}

All punct will be skipped by default, here you can set what will be kept

max_document_length class-attribute instance-attribute

max_document_length: int = 1000000

Documents longer than this will be trimmed.

NB! For these changes to take effect, the pipe would need to be recreated.

min_len_normalize class-attribute instance-attribute

min_len_normalize: int = 5

Nothing below this length will ever be normalized (input tokens or concept names), normalized means lemmatized in this case

skip_stopwords class-attribute instance-attribute

skip_stopwords: bool = False

Should stopwords be skipped/ignored when processing input

stopwords class-attribute instance-attribute

stopwords: Optional[set] = None

If None the default set of stowords from spacy will be used. This must be a Set.

NB! For these changes to take effect, the pipe would need to be recreated.

words_to_skip class-attribute instance-attribute

words_to_skip: set = {'nos'}

This words will be completely ignored from concepts and from the text (must be a Set)

TrainingDescriptor

Bases: SerialisableBaseModel

Attributes:

num_docs instance-attribute

num_docs: int

num_epochs class-attribute instance-attribute

num_epochs: int = 1

project_name instance-attribute

project_name: Optional[str]

train_time_end instance-attribute

train_time_end: datetime

train_time_start instance-attribute

train_time_start: datetime

UsageMonitor

Bases: SerialisableBaseModel

Attributes:

  • batch_size (int) –

    Number of logged events to write at once.

  • enabled (Literal[True, False, 'auto']) –

    Whether usage monitoring is enabled (True), disabled (False),

  • file_prefix (str) –

    The prefix for logged files. The suffix will be the model hash.

  • log_folder (str) –

    The folder which contains the usage logs. In certain situations,

batch_size class-attribute instance-attribute

batch_size: int = 100

Number of logged events to write at once.

enabled class-attribute instance-attribute

enabled: Literal[True, False, 'auto'] = False

Whether usage monitoring is enabled (True), disabled (False), or automatic ('auto'). If set to False, no logging is performed. If set to True, logs are saved in the location specified by log_folder. If set to 'auto', logs will be automatically enabled or disabled based on environmenta variable (MEDCAT_LOGS - setting it to False or 0 disabled logging) and distributed according to the OS preferred logs location (MEDCAT_LOGS_LOCATION). The defaults for the location are: - For Linux: ~/.local/share/medcat/logs/ - For Windows: C:\Users\%USERNAME%.cache\medcat\logs\

file_prefix class-attribute instance-attribute

file_prefix: str = 'usage_'

The prefix for logged files. The suffix will be the model hash.

log_folder class-attribute instance-attribute

log_folder: str = '.'

The folder which contains the usage logs. In certain situations, it may make sense to keep this separate from the overall logs. NOTE: Does not take affect if enabled is set to 'auto'