medcat.config.config_meta_cat

Classes:

ConfigMetaCAT –

The MetaCAT part of the config
General –

The General part of the MetaCAT config
Model –

The model part of the metaCAT config
Train –

The train part of the metaCAT config

Attributes:

logger –

logger `module-attribute`

logger = getLogger(__name__)

ConfigMetaCAT

Bases: ComponentConfig

The MetaCAT part of the config

Attributes:

comp_name (str) –
general (General) –
model (Model) –
model_config –
train (Train) –

comp_name `class-attribute` `instance-attribute`

comp_name: str = 'meta_cat'

general `class-attribute` `instance-attribute`

general: General = General()

model `class-attribute` `instance-attribute`

model: Model = Model()

model_config `class-attribute` `instance-attribute`

model_config = ConfigDict(extra='allow', validate_assignment=True)

train `class-attribute` `instance-attribute`

train: Train = Train()

General

Bases: DirtiableBaseModel

The General part of the MetaCAT config

Methods:

get_applicable_category_name –

Attributes:

alternative_category_names (list[str]) –

List that stores the variations of possible category names
alternative_class_names (list[list[str]]) –

List of lists that stores the variations of possible class names
batch_size_eval (int) –

Number of annotations to be meta-annotated at once in eval
category_name (Optional[str]) –

What category is this meta_cat model predicting/training.
category_value2id (dict[str, int]) –

Map from category values to ID, if empty it will be autocalculated
cntx_left (int) –

Number of tokens to take from the left of the concept
cntx_right (int) –

Number of tokens to take from the right of the concept
description (str) –

Should provide a basic description of this MetaCAT model
device (str) –

Device to used by the module to perform predicting/training.
disable_component_lock (bool) –

Whether to use the MetaCAT component lock.
lowercase (bool) –

If true all input text will be lowercased
model_config –
pipe_batch_size_in_chars (int) –

How many characters are piped at once into the meta_cat class
replace_center (Optional[Any]) –

If set the center (concept) will be replaced with this string
save_and_reuse_tokens (bool) –

This is a dangerous option, if not sure ALWAYS set to False. If set,
seed (int) –

The seed for random number generation.
serialiser (AvailableSerialisers) –

The serialiser to use when saving.
span_group (Optional[str]) –

If set, the spacy span group that the metacat model will assign
tokenizer_name (str) –

Tokenizer name used with MetaCAT.
vocab_size (int) –

Will be set automatically if the tokenizer is provided during

alternative_category_names `class-attribute` `instance-attribute`

alternative_category_names: list[str] = []

List that stores the variations of possible category names Example: For Experiencer, the alternate name is Subject alternative_category_names: ['Experiencer','Subject'] In the case that one specified in category_name parameter does not match the data, this ensures no error is raised and it is automatically mapped

alternative_class_names `class-attribute` `instance-attribute`

alternative_class_names: list[list[str]] = [[]]

List of lists that stores the variations of possible class names for each class mentioned in self.general.category_value2id. Example: For Presence task, the class names vary across NHS sites. To accommodate for this, alternative_class_names is populated as: [ ["Hypothetical (N/A)","Hypothetical"], ["Not present (False)","False"], ["Present (True)","True"] ] Each sub list contains the possible variations of the given class.

batch_size_eval `class-attribute` `instance-attribute`

batch_size_eval: int = 5000

Number of annotations to be meta-annotated at once in eval

category_name `class-attribute` `instance-attribute`

category_name: Optional[str] = None

What category is this meta_cat model predicting/training.

NB! For these changes to take effect, the pipe would need to be recreated.

category_value2id `class-attribute` `instance-attribute`

category_value2id: dict[str, int] = {}

Map from category values to ID, if empty it will be autocalculated during training

cntx_left `class-attribute` `instance-attribute`

cntx_left: int = 15

Number of tokens to take from the left of the concept

cntx_right `class-attribute` `instance-attribute`

cntx_right: int = 10

Number of tokens to take from the right of the concept

description `class-attribute` `instance-attribute`

description: str = 'No description'

Should provide a basic description of this MetaCAT model

device `class-attribute` `instance-attribute`

device: str = 'cpu'

Device to used by the module to perform predicting/training.

Reference

https://pytorch.org/docs/stable/tensor_attributes.html#torch.device

disable_component_lock `class-attribute` `instance-attribute`

disable_component_lock: bool = False

Whether to use the MetaCAT component lock.

If set to False (the default), a component lock is used that forces usage only on one thread at a time.

If set to True, the component lock is not used.

lowercase `class-attribute` `instance-attribute`

lowercase: bool = True

If true all input text will be lowercased

model_config `class-attribute` `instance-attribute`

model_config = ConfigDict(extra='allow', validate_assignment=True)

pipe_batch_size_in_chars `class-attribute` `instance-attribute`

pipe_batch_size_in_chars: int = 20000000

How many characters are piped at once into the meta_cat class

replace_center `class-attribute` `instance-attribute`

replace_center: Optional[Any] = None

If set the center (concept) will be replaced with this string

save_and_reuse_tokens `class-attribute` `instance-attribute`

save_and_reuse_tokens: bool = False

This is a dangerous option, if not sure ALWAYS set to False. If set, it will try to share the pre-calculated context tokens between MetaCAT models when serving. It will ignore differences in tokenizer and context size, so you need to be sure that the models for which this is turned on have the same tokenizer and context size, during a deployment.

seed `class-attribute` `instance-attribute`

seed: int = 13

The seed for random number generation.

NOTE: If used along RelCAT or additional NER, only one of the seeds will take effect NB! For these changes to take effect, the pipe would need to be recreated.

serialiser `class-attribute` `instance-attribute`

serialiser: AvailableSerialisers = dill

The serialiser to use when saving.

span_group `class-attribute` `instance-attribute`

span_group: Optional[str] = None

If set, the spacy span group that the metacat model will assign annotations. Otherwise defaults to doc._.ents or doc.ents per the annotate_overlapping settings

tokenizer_name `class-attribute` `instance-attribute`

tokenizer_name: str = 'bbpe'

Tokenizer name used with MetaCAT.

Choose from

'bbpe': Byte Pair Encoding Tokenizer
'bert-tokenizer': BERT Tokenizer

NB! For these changes to take effect, the pipe would need to be recreated.

vocab_size `class-attribute` `instance-attribute`

vocab_size: int = -1

Will be set automatically if the tokenizer is provided during meta_cat init

get_applicable_category_name

get_applicable_category_name(available_names: Container[str]) -> Optional[str]

Source code in medcat-v2/medcat/config/config_meta_cat.py

def get_applicable_category_name(
        self, available_names: Container[str]) -> Optional[str]:
    if self.category_name in available_names:
        return self.category_name
    matches = [cat for cat in self.alternative_category_names
               if cat in available_names]
    if len(matches) > 0:
        logger.info(
            "The category name provided in the config - '%s' is not "
            "present in the data. However, the corresponding name - '%s' "
            "from the category_name_mapping has been found. Updating the "
            "category name...", self.category_name, *matches)
        self.category_name = matches[0]
        return self.category_name
    return None

Model

Bases: DirtiableBaseModel

The model part of the metaCAT config

Attributes:

category_undersample (str) –

When using 2 phase learning, this category is used to undersample
dropout (float) –

The dropout for the model.
emb_grad (bool) –

Applicable only for LSTM:
hidden_size (int) –

Number of neurons in the hidden layer.
ignore_cpos (bool) –

If set to True center positions will be ignored when calculating
input_size (int) –

Specifies the size of the embedding layer.
model_architecture_config (dict[str, bool]) –

Specifies the architecture for BERT model.
model_config –
model_freeze_layers (bool) –

Applicable only when using BERT:
model_name (str) –

Model to be used for training or predicting.
model_variant (str) –

Applicable only when using BERT:
nclasses (int) –

Number of classes that this model will output.
num_directions (int) –

Applicable only for LSTM:
num_layers (int) –

Number of layers in the model (both LSTM and BERT)
padding_idx (int) –

The padding index.
phase_number (int) –

Indicates whether two phase learning is to be used for training.

category_undersample `class-attribute` `instance-attribute`

category_undersample: str = ''

When using 2 phase learning, this category is used to undersample the data

dropout `class-attribute` `instance-attribute`

dropout: float = 0.5

The dropout for the model.

NB! For these changes to take effect, the pipe would need to be recreated.

emb_grad `class-attribute` `instance-attribute`

emb_grad: bool = True

Applicable only for LSTM:

If True, the embeddings will also be trained.

NB! For these changes to take effect, the pipe would need to be recreated.

hidden_size `class-attribute` `instance-attribute`

hidden_size: int = 300

Number of neurons in the hidden layer.

NB! For these changes to take effect, the pipe would need to be recreated.

ignore_cpos `class-attribute` `instance-attribute`

ignore_cpos: bool = False

If set to True center positions will be ignored when calculating representation

input_size `class-attribute` `instance-attribute`

input_size: int = 300

Specifies the size of the embedding layer.

Applicable only for LSTM model and ignored for BERT as BERT's embedding size is predefined.

NB! For these changes to take effect, the pipe would need to be recreated.

model_architecture_config `class-attribute` `instance-attribute`

model_architecture_config: dict[str, bool] = {'fc2': True, 'fc3': False, 'lr_scheduler': True}

Specifies the architecture for BERT model.

If fc2 is set to True, then the 2nd fully connected layer is used

If fc2 is True and fc3 is set to True, then the 3rd fully connected layer is used

If lr_scheduler is set to True, then the learning rate scheduler is used with the optimizer

NB! For these changes to take effect, the pipe would need to be recreated.

model_config `class-attribute` `instance-attribute`

model_config = ConfigDict(extra='allow', validate_assignment=True, protected_namespaces=())

model_freeze_layers `class-attribute` `instance-attribute`

model_freeze_layers: bool = True

Applicable only when using BERT:

Determines the training approach for BERT.

If True: BERT layers are frozen and only the fully connected (FC) layer(s) on top are trained.
If False: Parameter-efficient fine-tuning will be applied using Low-Rank Adaptation (LoRA).

NB! For these changes to take effect, the pipe would need to be recreated.

model_name `class-attribute` `instance-attribute`

model_name: str = 'lstm'

Model to be used for training or predicting.

Choose from

'bert'
'lstm'

Note

When changing the model, make sure to change the tokenizer accordingly. NB! For these changes to take effect, the pipe would need to be recreated.

model_variant `class-attribute` `instance-attribute`

model_variant: str = 'bert-base-uncased'

Applicable only when using BERT:

Specifies the model variant to be used.

NB! For these changes to take effect, the pipe would need to be recreated.

nclasses `class-attribute` `instance-attribute`

nclasses: int = 2

Number of classes that this model will output.

NB! For these changes to take effect, the pipe would need to be recreated.

num_directions `class-attribute` `instance-attribute`

num_directions: int = 2

Applicable only for LSTM:

2 - bidirectional model, 1 - unidirectional

NB! For these changes to take effect, the pipe would need to be recreated.

num_layers `class-attribute` `instance-attribute`

num_layers: int = 2

Number of layers in the model (both LSTM and BERT)

NB! For these changes to take effect, the pipe would need to be recreated.

padding_idx `class-attribute` `instance-attribute`

padding_idx: int = -1

The padding index.

NB! For these changes to take effect, the pipe would need to be recreated.

phase_number `class-attribute` `instance-attribute`

phase_number: int = 0

Indicates whether two phase learning is to be used for training.

1: Phase 1 - Train model on undersampled data

2: Phase 2 - Continue training on full data

0: None - 2 phase learning is not performed

Paper reference - https://ieeexplore.ieee.org/document/7533053

Train

Bases: DirtiableBaseModel

The train part of the metaCAT config

Attributes:

auto_save_model (bool) –

Should do model be saved during training for best results
batch_size (int) –
class_weights (Optional[Any]) –
compute_class_weights (bool) –

If true and class weights not provided, the class weights will be
cui_filter (Optional[Any]) –

If set only this CUIs will be used for training
gamma (int) –

Focal Loss hyperparameter - determines importance the loss gives to
last_train_on (Optional[float]) –

When was the last training run
loss_funct (str) –

Loss function for the model.
lr (float) –
metric (dict[str, str]) –

What metric should be used for choosing the best model
model_config –
nepochs (int) –
prerequisites (dict) –
score_average (str) –

What to use for averaging F1/P/R across labels
shuffle_data (bool) –

Used only during training, if set the dataset will be shuffled before
test_size (float) –

auto_save_model `class-attribute` `instance-attribute`

auto_save_model: bool = True

Should do model be saved during training for best results

batch_size `class-attribute` `instance-attribute`

batch_size: int = 100

class_weights `class-attribute` `instance-attribute`

class_weights: Optional[Any] = None

compute_class_weights `class-attribute` `instance-attribute`

compute_class_weights: bool = False

If true and class weights not provided, the class weights will be calculated based on the data

cui_filter `class-attribute` `instance-attribute`

cui_filter: Optional[Any] = None

If set only this CUIs will be used for training

gamma `class-attribute` `instance-attribute`

gamma: int = 2

Focal Loss hyperparameter - determines importance the loss gives to hard-to-classify examples

last_train_on `class-attribute` `instance-attribute`

last_train_on: Optional[float] = None

When was the last training run

loss_funct `class-attribute` `instance-attribute`

loss_funct: str = 'cross_entropy'

Loss function for the model.

Choose from

'cross_entropy'
'focal_loss'

lr `class-attribute` `instance-attribute`

lr: float = 0.001

metric `class-attribute` `instance-attribute`

metric: dict[str, str] = {'base': 'weighted avg', 'score': 'f1-score'}

What metric should be used for choosing the best model

model_config `class-attribute` `instance-attribute`

model_config = ConfigDict(extra='allow', validate_assignment=True)

nepochs `class-attribute` `instance-attribute`

nepochs: int = 50

prerequisites `class-attribute` `instance-attribute`

prerequisites: dict = {}

score_average `class-attribute` `instance-attribute`

score_average: str = 'weighted'

What to use for averaging F1/P/R across labels

shuffle_data `class-attribute` `instance-attribute`

shuffle_data: bool = True

Used only during training, if set the dataset will be shuffled before train/test split

test_size `class-attribute` `instance-attribute`

test_size: float = 0.1

medcat.config.config_meta_cat

logger module-attribute

ConfigMetaCAT

comp_name class-attribute instance-attribute

general class-attribute instance-attribute

model class-attribute instance-attribute

model_config class-attribute instance-attribute

train class-attribute instance-attribute

General

alternative_category_names class-attribute instance-attribute

alternative_class_names class-attribute instance-attribute

batch_size_eval class-attribute instance-attribute

category_name class-attribute instance-attribute

category_value2id class-attribute instance-attribute

cntx_left class-attribute instance-attribute

cntx_right class-attribute instance-attribute

description class-attribute instance-attribute

device class-attribute instance-attribute

disable_component_lock class-attribute instance-attribute

lowercase class-attribute instance-attribute

model_config class-attribute instance-attribute

pipe_batch_size_in_chars class-attribute instance-attribute

replace_center class-attribute instance-attribute

save_and_reuse_tokens class-attribute instance-attribute

seed class-attribute instance-attribute

serialiser class-attribute instance-attribute

span_group class-attribute instance-attribute

tokenizer_name class-attribute instance-attribute

vocab_size class-attribute instance-attribute

get_applicable_category_name

Model

category_undersample class-attribute instance-attribute

dropout class-attribute instance-attribute

emb_grad class-attribute instance-attribute

hidden_size class-attribute instance-attribute

ignore_cpos class-attribute instance-attribute

input_size class-attribute instance-attribute

model_architecture_config class-attribute instance-attribute

model_config class-attribute instance-attribute

model_freeze_layers class-attribute instance-attribute

model_name class-attribute instance-attribute

model_variant class-attribute instance-attribute

nclasses class-attribute instance-attribute

num_directions class-attribute instance-attribute

num_layers class-attribute instance-attribute

padding_idx class-attribute instance-attribute

phase_number class-attribute instance-attribute

Train

auto_save_model class-attribute instance-attribute

batch_size class-attribute instance-attribute

class_weights class-attribute instance-attribute

compute_class_weights class-attribute instance-attribute

cui_filter class-attribute instance-attribute

gamma class-attribute instance-attribute

last_train_on class-attribute instance-attribute

loss_funct class-attribute instance-attribute

lr class-attribute instance-attribute

metric class-attribute instance-attribute

model_config class-attribute instance-attribute

nepochs class-attribute instance-attribute

prerequisites class-attribute instance-attribute

score_average class-attribute instance-attribute

shuffle_data class-attribute instance-attribute

test_size class-attribute instance-attribute

logger `module-attribute`

comp_name `class-attribute` `instance-attribute`

general `class-attribute` `instance-attribute`

model `class-attribute` `instance-attribute`

model_config `class-attribute` `instance-attribute`

train `class-attribute` `instance-attribute`

alternative_category_names `class-attribute` `instance-attribute`

alternative_class_names `class-attribute` `instance-attribute`

batch_size_eval `class-attribute` `instance-attribute`

category_name `class-attribute` `instance-attribute`

category_value2id `class-attribute` `instance-attribute`

cntx_left `class-attribute` `instance-attribute`

cntx_right `class-attribute` `instance-attribute`

description `class-attribute` `instance-attribute`

device `class-attribute` `instance-attribute`

disable_component_lock `class-attribute` `instance-attribute`

lowercase `class-attribute` `instance-attribute`

model_config `class-attribute` `instance-attribute`

pipe_batch_size_in_chars `class-attribute` `instance-attribute`

replace_center `class-attribute` `instance-attribute`

save_and_reuse_tokens `class-attribute` `instance-attribute`

seed `class-attribute` `instance-attribute`

serialiser `class-attribute` `instance-attribute`

span_group `class-attribute` `instance-attribute`

tokenizer_name `class-attribute` `instance-attribute`

vocab_size `class-attribute` `instance-attribute`

category_undersample `class-attribute` `instance-attribute`

dropout `class-attribute` `instance-attribute`

emb_grad `class-attribute` `instance-attribute`

hidden_size `class-attribute` `instance-attribute`

ignore_cpos `class-attribute` `instance-attribute`

input_size `class-attribute` `instance-attribute`

model_architecture_config `class-attribute` `instance-attribute`

model_config `class-attribute` `instance-attribute`

model_freeze_layers `class-attribute` `instance-attribute`

model_name `class-attribute` `instance-attribute`

model_variant `class-attribute` `instance-attribute`

nclasses `class-attribute` `instance-attribute`

num_directions `class-attribute` `instance-attribute`

num_layers `class-attribute` `instance-attribute`

padding_idx `class-attribute` `instance-attribute`

phase_number `class-attribute` `instance-attribute`

auto_save_model `class-attribute` `instance-attribute`

batch_size `class-attribute` `instance-attribute`

class_weights `class-attribute` `instance-attribute`

compute_class_weights `class-attribute` `instance-attribute`

cui_filter `class-attribute` `instance-attribute`

gamma `class-attribute` `instance-attribute`

last_train_on `class-attribute` `instance-attribute`

loss_funct `class-attribute` `instance-attribute`

lr `class-attribute` `instance-attribute`

metric `class-attribute` `instance-attribute`

model_config `class-attribute` `instance-attribute`

nepochs `class-attribute` `instance-attribute`

prerequisites `class-attribute` `instance-attribute`

score_average `class-attribute` `instance-attribute`

shuffle_data `class-attribute` `instance-attribute`

test_size `class-attribute` `instance-attribute`