medcat.config
Modules:
Classes:
-
AnnotationOutput–The annotation output part of the config
-
CDBMaker–The Context Database (CDB) making part of the config
-
Components– -
Config– -
Linking–The linking part of the config
-
LinkingFilters–These describe the linking filters used alongside the model.
-
ModelMeta– -
NLPConfig– -
Ner–The NER part of the config
-
Preprocessing–The preprocessing part of the config
-
TrainingDescriptor– -
UsageMonitor–
AnnotationOutput
Bases: SerialisableBaseModel
The annotation output part of the config
Attributes:
-
context_left(int) – -
context_right(int) – -
include_text_in_output(bool) – -
lowercase_context(bool) –
CDBMaker
Bases: SerialisableBaseModel
The Context Database (CDB) making part of the config
Attributes:
-
min_letters_required(int) –Minimum number of letters required in a name to be accepted
-
multi_separator(str) –If multiple names or type_ids for a concept present in one row of a CSV,
-
name_versions(list) –Name versions to be generated.
-
remove_parenthesis(int) –Should preferred names with parenthesis be cleaned 0 means no,
min_letters_required
class-attribute
instance-attribute
min_letters_required: int = 2
Minimum number of letters required in a name to be accepted for a concept
multi_separator
class-attribute
instance-attribute
multi_separator: str = '|'
If multiple names or type_ids for a concept present in one row of a CSV, they are separated by the specified character.
name_versions
class-attribute
instance-attribute
name_versions: list = ['LOWER', 'CLEAN']
Name versions to be generated.
remove_parenthesis
class-attribute
instance-attribute
remove_parenthesis: int = 5
Should preferred names with parenthesis be cleaned 0 means no, else it means if longer than or equal e.g. Head (Body part) -> Head
Components
Bases: SerialisableBaseModel
Attributes:
-
addons(list[ComponentConfig]) – -
comp_order(list[str]) – -
linking(Linking) – -
ner(Ner) – -
tagging(ComponentConfig) – -
token_normalizing(ComponentConfig) –
comp_order
class-attribute
instance-attribute
token_normalizing
class-attribute
instance-attribute
token_normalizing: ComponentConfig = ComponentConfig()
Config
Bases: SerialisableBaseModel
Attributes:
-
annotation_output(AnnotationOutput) – -
cdb_maker(CDBMaker) – -
components(Components) – -
general(General) – -
meta(ModelMeta) – -
preprocessing(Preprocessing) –
annotation_output
class-attribute
instance-attribute
annotation_output: AnnotationOutput = AnnotationOutput()
Linking
Bases: ComponentConfig
The linking part of the config
Attributes:
-
additional(Optional[Any]) –Some additional config for non-default linkers.
-
always_calculate_similarity(bool) –Do we want to calculate context similarity even for concepts that are
-
calculate_dynamic_threshold(bool) –Concepts below this similarity will be ignored. Type can be
-
context_ignore_center_tokens(bool) –If true when the context of a concept is calculated (embedding)
-
context_vector_sizes(dict) –Context vector sizes that will be calculated and used for linking
-
context_vector_weights(dict) –Weight of each vector in the similarity score - make trainable at
-
devalue_linked_concepts(bool) –When adding a positive example, should it also be treated as Negative
-
disamb_length_limit(int) –All concepts below this will always be disambiguated
-
filter_before_disamb(bool) –If True it will filter before doing disamb. Useful for the trainer.
-
filters(LinkingFilters) –Filters
-
model_config– -
negative_ignore_punct_and_num(bool) –Do we ignore punct/num when negative sampling
-
negative_probability(float) –Probability for the negative context to be added for each
-
optim(dict) –Linear anneal
-
prefer_frequent_concepts(float) –If >0 concepts that are more frequent will be preferred
-
prefer_primary_name(float) –If >0 concepts for which a detection is its primary name
-
random_replacement_unsupervised(float) –If <1 during unsupervised training the detected term will be randomly
-
similarity_threshold(float) – -
similarity_threshold_type(str) – -
subsample_after(int) –DISABLED in code permanetly: Subsample during unsupervised
-
train(bool) –Should it train or not, this is set automatically ignore in 99% of
-
train_count_threshold(int) –Concepts that have seen less training examples than this will not be
additional
class-attribute
instance-attribute
Some additional config for non-default linkers. E.g the 2-step linker uses this for alpha calculations and learning rate for type contexts.
always_calculate_similarity
class-attribute
instance-attribute
always_calculate_similarity: bool = False
Do we want to calculate context similarity even for concepts that are not ambiguous.
calculate_dynamic_threshold
class-attribute
instance-attribute
calculate_dynamic_threshold: bool = False
Concepts below this similarity will be ignored. Type can be static/dynamic - if dynamic each CUI has a different TH and it is calculated as the average confidence for that CUI * similarity_threshold. Take care that dynamic works only if the cdb was trained with calculate_dynamic_threshold = True.
context_ignore_center_tokens
class-attribute
instance-attribute
context_ignore_center_tokens: bool = False
If true when the context of a concept is calculated (embedding) the words making that concept are not taken into account
context_vector_sizes
class-attribute
instance-attribute
context_vector_sizes: dict = {'xlong': 27, 'long': 18, 'medium': 9, 'short': 3}
Context vector sizes that will be calculated and used for linking
context_vector_weights
class-attribute
instance-attribute
context_vector_weights: dict = {'xlong': 0.1, 'long': 0.4, 'medium': 0.4, 'short': 0.1}
Weight of each vector in the similarity score - make trainable at some point. Should add up to 1.
devalue_linked_concepts
class-attribute
instance-attribute
devalue_linked_concepts: bool = False
When adding a positive example, should it also be treated as Negative for concepts which link to the positive one via names (ambiguous names).
disamb_length_limit
class-attribute
instance-attribute
disamb_length_limit: int = 3
All concepts below this will always be disambiguated
filter_before_disamb
class-attribute
instance-attribute
filter_before_disamb: bool = False
If True it will filter before doing disamb. Useful for the trainer.
model_config
class-attribute
instance-attribute
model_config = ConfigDict(extra='allow')
negative_ignore_punct_and_num
class-attribute
instance-attribute
negative_ignore_punct_and_num: bool = True
Do we ignore punct/num when negative sampling
negative_probability
class-attribute
instance-attribute
negative_probability: float = 0.5
Probability for the negative context to be added for each positive addition
optim
class-attribute
instance-attribute
optim: dict = {'type': 'linear', 'base_lr': 1, 'min_lr': 5e-05}
Linear anneal
prefer_frequent_concepts
class-attribute
instance-attribute
prefer_frequent_concepts: float = 0.35
If >0 concepts that are more frequent will be preferred by a multiply of this amount
prefer_primary_name
class-attribute
instance-attribute
prefer_primary_name: float = 0.35
If >0 concepts for which a detection is its primary name will be preferred by that amount (0 to 1)
random_replacement_unsupervised
class-attribute
instance-attribute
random_replacement_unsupervised: float = 0.8
If <1 during unsupervised training the detected term will be randomly replaced with a probability of 1 - random_replacement_unsupervised Replaced with a synonym used for that term
similarity_threshold_type
class-attribute
instance-attribute
similarity_threshold_type: str = 'static'
subsample_after
class-attribute
instance-attribute
subsample_after: int = 30000
DISABLED in code permanetly: Subsample during unsupervised training if a concept has received more than
train
class-attribute
instance-attribute
train: bool = True
Should it train or not, this is set automatically ignore in 99% of cases and do not set manually
train_count_threshold
class-attribute
instance-attribute
train_count_threshold: int = 1
Concepts that have seen less training examples than this will not be used for similarity calculation and will have a similarity of -1.
LinkingFilters
LinkingFilters(**data)
Bases: SerialisableBaseModel
These describe the linking filters used alongside the model.
When no CUIs nor excluded CUIs are specified (the sets are empty), all CUIs are accepted. If there are CUIs specified then only those will be accepted. If there are excluded CUIs specified, they are excluded.
In some cases, there are extra filters as well as MedCATtrainer (MCT) export filters. These are expected to follow the following: extra_cui_filter ⊆ MCT filter ⊆ Model/config filter
While any other CUIs can be included in the the extra CUI filter or the MCT filter, they would not have any real effect.
Methods:
-
check_filters–Checks is a CUI in the filters
Attributes:
Source code in medcat-v2/medcat/config/config.py
329 330 331 332 333 334 335 336 337 338 339 340 | |
check_filters
Checks is a CUI in the filters
Parameters:
-
(cuistr) –The CUI in question
Returns:
-
bool(bool) –True if the CUI is allowed
Source code in medcat-v2/medcat/config/config.py
342 343 344 345 346 347 348 349 350 351 352 353 354 | |
ModelMeta
Bases: SerialisableBaseModel
Methods:
-
add_sup_training–Add supervised training information based on data.
-
add_unsup_training–Add unsupervised training information based on data.
-
mark_saved_now– -
prepare_and_report_training–Context manager for preparing training.
Attributes:
-
description(str) – -
hash(str) – -
history(list[str]) – -
last_saved(datetime) – -
location(str) – -
medcat_version(str) – -
ontology(list[str]) – -
saved_environ(Environment) – -
sup_trained(list[TrainingDescriptor]) – -
unsup_trained(list[TrainingDescriptor]) –
saved_environ
class-attribute
instance-attribute
saved_environ: Environment = Field(default_factory=get_environment_info)
add_sup_training
add_sup_training(start_time: datetime, num_docs: int, project_name: str) -> None
Add supervised training information based on data.
This will mark down the time taken for training by comparing
the start time to the current time.
This will be called for every project being trained separately.
So if there's a MCT export being trained with multiple projects, multiple different training instances will be recorded.
Parameters:
-
(start_timedatetime) –The time at which the training was started.
-
(num_docsint) –The number of documents that were trained.
-
(project_namestr) –The project name.
Source code in medcat-v2/medcat/config/config.py
568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 | |
add_unsup_training
add_unsup_training(start_time: datetime, num_docs: int, num_epochs: int = 1, project_name: str = 'N/A')
Add unsupervised training information based on data.
This will mark down the time taken for training by comparing
the start time to the current time.
Parameters:
-
(start_timedatetime) –The time at which the training was started.
-
(num_docsint) –The number of documents trained.
-
(num_epochsint, default:1) –The number of epochs. Defaults to 1.
-
(project_namestr, default:'N/A') –The project name. Defaults to 'N/A'.
Source code in medcat-v2/medcat/config/config.py
550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 | |
mark_saved_now
mark_saved_now()
Source code in medcat-v2/medcat/config/config.py
544 545 546 547 | |
prepare_and_report_training
prepare_and_report_training(data_iterator: C, num_epochs: int, supervised: bool = False, project_name: str = 'N/A') -> Iterator[C]
Context manager for preparing training.
This is used so that we can get the number of items in the data during training.
Parameters:
-
(data_iteratorC) –The data to be trained.
-
(num_epochsint) –The number of epochs to be used.
-
(supervisedbool, default:False) –Whether training is supervised. Defaults to False.
-
(project_namestr, default:'N/A') –The project name. Defaults to 'N/A'.
Yields:
-
C–Iterator[C]: The same data that was input.
Source code in medcat-v2/medcat/config/config.py
589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 | |
NLPConfig
Bases: SerialisableBaseModel
Attributes:
-
disabled_components(list) –The list of components that will be disabled for the NLP.
-
faster_spacy_tokenization(bool) –Allow skipping the spacy pipeline.
-
model_config– -
modelname(str) –What model will be used for tokenization.
-
provider(str) –The NLP provider.
disabled_components
class-attribute
instance-attribute
disabled_components: list = ['ner', 'parser', 'vectors', 'textcat', 'entity_linker', 'sentencizer', 'entity_ruler', 'merge_noun_chunks', 'merge_entities', 'merge_subtokens']
The list of components that will be disabled for the NLP.
NB! For these changes to take effect, the pipe would need to be recreated.
faster_spacy_tokenization
class-attribute
instance-attribute
faster_spacy_tokenization: bool = False
Allow skipping the spacy pipeline.
If True, uses basic tokenization only (spacy.make_doc) for ~3-4x overall speedup. If False, uses full linguistic pipeline including POS tagging, lemmatization, and stopword detection.
Impact of fast_tokenization=True: - No part-of-speech tags: All tokens treated uniformly during normalization - No lemmatization: Words used in surface form (e.g., "running" vs "run") - No stopword detection: All tokens in multi-token spans considered; all tokens used in context vector calculation - Real world performance (in terms of precision and recall) is likely to be lower
When to use fast mode: - Processing very large datasets where speed is critical - Text is already clean/normalized - Minor drops in precision/recall (typically 1-3%) are acceptable
When to use full mode (default): - Maximum accuracy is required - Working with noisy or varied text - Proper linguistic analysis improves your specific use case
Benchmark on your data to determine if the speedup justifies the accuracy tradeoff.
PS: Only applicable for spacy based tokenizer.
NB! For these changes to take effect, the pipe would need to be recreated.
model_config
class-attribute
instance-attribute
model_config = ConfigDict(extra='allow', validate_assignment=True)
modelname
class-attribute
instance-attribute
modelname: str = 'en_core_web_md'
What model will be used for tokenization.
NB! For these changes to take effect, the pipe would need to be recreated.
provider
class-attribute
instance-attribute
provider: str = 'regex'
The NLP provider.
Currently only regex and spacy are natively supported.
NB! For these changes to take effect, the pipe would need to be recreated.
Ner
Bases: ComponentConfig
The NER part of the config
Attributes:
-
check_upper_case_names(bool) –Check uppercase to distinguish uppercase and lowercase words that have
-
custom_cnf(Optional[Any]) –The custom config for the component.
-
max_skip_tokens(int) –When checking tokens for concepts you can have skipped tokens between
-
min_name_len(int) –Do not detect names below this limit, skip them
-
model_config– -
try_reverse_word_order(bool) –Try reverse word order for short concepts (2 words max),
-
upper_case_limit_len(int) –Any name shorter than this must be uppercase in the text to be
check_upper_case_names
class-attribute
instance-attribute
check_upper_case_names: bool = False
Check uppercase to distinguish uppercase and lowercase words that have a different meaning.
custom_cnf
class-attribute
instance-attribute
The custom config for the component.
max_skip_tokens
class-attribute
instance-attribute
max_skip_tokens: int = 2
When checking tokens for concepts you can have skipped tokens between used ones (usually spaces, new lines etc). This number tells you how many skipped can you have.
min_name_len
class-attribute
instance-attribute
min_name_len: int = 3
Do not detect names below this limit, skip them
model_config
class-attribute
instance-attribute
model_config = ConfigDict(extra='allow')
try_reverse_word_order
class-attribute
instance-attribute
try_reverse_word_order: bool = False
Try reverse word order for short concepts (2 words max), e.g. heart disease -> disease heart
upper_case_limit_len
class-attribute
instance-attribute
upper_case_limit_len: int = 4
Any name shorter than this must be uppercase in the text to be considered. If it is not uppercase it will be skipped.
Preprocessing
Bases: SerialisableBaseModel
The preprocessing part of the config
Attributes:
-
do_not_normalize(set[str]) –Should specific word types be normalized: e.g. running -> run
-
keep_punct(set) –All punct will be skipped by default, here you can set what
-
max_document_length(int) –Documents longer than this will be trimmed.
-
min_len_normalize(int) –Nothing below this length will ever be normalized (input tokens or
-
skip_stopwords(bool) –Should stopwords be skipped/ignored when processing input
-
stopwords(Optional[set]) –If None the default set of stowords from spacy will be used.
-
words_to_skip(set) –This words will be completely ignored from concepts and from the text
do_not_normalize
class-attribute
instance-attribute
Should specific word types be normalized: e.g. running -> run Values are detailed part-of-speech tags. See: - https://spacy.io/usage/linguistic-features#pos-tagging - Label scheme section per model at https://spacy.io/models/en
keep_punct
class-attribute
instance-attribute
keep_punct: set = {'.', ':'}
All punct will be skipped by default, here you can set what will be kept
max_document_length
class-attribute
instance-attribute
max_document_length: int = 1000000
Documents longer than this will be trimmed.
NB! For these changes to take effect, the pipe would need to be recreated.
min_len_normalize
class-attribute
instance-attribute
min_len_normalize: int = 5
Nothing below this length will ever be normalized (input tokens or concept names), normalized means lemmatized in this case
skip_stopwords
class-attribute
instance-attribute
skip_stopwords: bool = False
Should stopwords be skipped/ignored when processing input
stopwords
class-attribute
instance-attribute
If None the default set of stowords from spacy will be used. This must be a Set.
NB! For these changes to take effect, the pipe would need to be recreated.
words_to_skip
class-attribute
instance-attribute
words_to_skip: set = {'nos'}
This words will be completely ignored from concepts and from the text (must be a Set)
TrainingDescriptor
Bases: SerialisableBaseModel
Attributes:
-
num_docs(int) – -
num_epochs(int) – -
project_name(Optional[str]) – -
train_time_end(datetime) – -
train_time_start(datetime) –
UsageMonitor
Bases: SerialisableBaseModel
Attributes:
-
batch_size(int) –Number of logged events to write at once.
-
enabled(Literal[True, False, 'auto']) –Whether usage monitoring is enabled (True), disabled (False),
-
file_prefix(str) –The prefix for logged files. The suffix will be the model hash.
-
log_folder(str) –The folder which contains the usage logs. In certain situations,
batch_size
class-attribute
instance-attribute
batch_size: int = 100
Number of logged events to write at once.
enabled
class-attribute
instance-attribute
enabled: Literal[True, False, 'auto'] = False
Whether usage monitoring is enabled (True), disabled (False),
or automatic ('auto').
If set to False, no logging is performed.
If set to True, logs are saved in the location specified by log_folder.
If set to 'auto', logs will be automatically enabled or disabled based on
environmenta variable (MEDCAT_LOGS - setting it to False or 0
disabled logging) and distributed according to the OS preferred logs
location (MEDCAT_LOGS_LOCATION).
The defaults for the location are:
- For Linux: ~/.local/share/medcat/logs/
- For Windows: C:\Users\%USERNAME%.cache\medcat\logs\
file_prefix
class-attribute
instance-attribute
file_prefix: str = 'usage_'
The prefix for logged files. The suffix will be the model hash.
log_folder
class-attribute
instance-attribute
log_folder: str = '.'
The folder which contains the usage logs. In certain situations,
it may make sense to keep this separate from the overall logs.
NOTE: Does not take affect if enabled is set to 'auto'