medcat.config.config_meta_cat
Classes:
-
ConfigMetaCAT–The MetaCAT part of the config
-
General–The General part of the MetaCAT config
-
Model–The model part of the metaCAT config
-
Train–The train part of the metaCAT config
Attributes:
-
logger–
ConfigMetaCAT
Bases: ComponentConfig
The MetaCAT part of the config
Attributes:
model_config
class-attribute
instance-attribute
model_config = ConfigDict(extra='allow', validate_assignment=True)
General
Bases: DirtiableBaseModel
The General part of the MetaCAT config
Methods:
Attributes:
-
alternative_category_names(list[str]) –List that stores the variations of possible category names
-
alternative_class_names(list[list[str]]) –List of lists that stores the variations of possible class names
-
batch_size_eval(int) –Number of annotations to be meta-annotated at once in eval
-
category_name(Optional[str]) –What category is this meta_cat model predicting/training.
-
category_value2id(dict[str, int]) –Map from category values to ID, if empty it will be autocalculated
-
cntx_left(int) –Number of tokens to take from the left of the concept
-
cntx_right(int) –Number of tokens to take from the right of the concept
-
description(str) –Should provide a basic description of this MetaCAT model
-
device(str) –Device to used by the module to perform predicting/training.
-
disable_component_lock(bool) –Whether to use the MetaCAT component lock.
-
lowercase(bool) –If true all input text will be lowercased
-
model_config– -
pipe_batch_size_in_chars(int) –How many characters are piped at once into the meta_cat class
-
replace_center(Optional[Any]) –If set the center (concept) will be replaced with this string
-
save_and_reuse_tokens(bool) –This is a dangerous option, if not sure ALWAYS set to False. If set,
-
seed(int) –The seed for random number generation.
-
serialiser(AvailableSerialisers) –The serialiser to use when saving.
-
span_group(Optional[str]) –If set, the spacy span group that the metacat model will assign
-
tokenizer_name(str) –Tokenizer name used with MetaCAT.
-
vocab_size(int) –Will be set automatically if the tokenizer is provided during
alternative_category_names
class-attribute
instance-attribute
List that stores the variations of possible category names
Example: For Experiencer, the alternate name is Subject
alternative_category_names: ['Experiencer','Subject']
In the case that one specified in category_name parameter does not match
the data, this ensures no error is raised and it is automatically mapped
alternative_class_names
class-attribute
instance-attribute
List of lists that stores the variations of possible class names for each class mentioned in self.general.category_value2id. Example: For Presence task, the class names vary across NHS sites. To accommodate for this, alternative_class_names is populated as: [ ["Hypothetical (N/A)","Hypothetical"], ["Not present (False)","False"], ["Present (True)","True"] ] Each sub list contains the possible variations of the given class.
batch_size_eval
class-attribute
instance-attribute
batch_size_eval: int = 5000
Number of annotations to be meta-annotated at once in eval
category_name
class-attribute
instance-attribute
What category is this meta_cat model predicting/training.
NB! For these changes to take effect, the pipe would need to be recreated.
category_value2id
class-attribute
instance-attribute
Map from category values to ID, if empty it will be autocalculated during training
cntx_left
class-attribute
instance-attribute
cntx_left: int = 15
Number of tokens to take from the left of the concept
cntx_right
class-attribute
instance-attribute
cntx_right: int = 10
Number of tokens to take from the right of the concept
description
class-attribute
instance-attribute
description: str = 'No description'
Should provide a basic description of this MetaCAT model
device
class-attribute
instance-attribute
device: str = 'cpu'
Device to used by the module to perform predicting/training.
Reference
https://pytorch.org/docs/stable/tensor_attributes.html#torch.device
disable_component_lock
class-attribute
instance-attribute
disable_component_lock: bool = False
Whether to use the MetaCAT component lock.
If set to False (the default), a component lock is used that forces usage only on one thread at a time.
If set to True, the component lock is not used.
lowercase
class-attribute
instance-attribute
lowercase: bool = True
If true all input text will be lowercased
model_config
class-attribute
instance-attribute
model_config = ConfigDict(extra='allow', validate_assignment=True)
pipe_batch_size_in_chars
class-attribute
instance-attribute
pipe_batch_size_in_chars: int = 20000000
How many characters are piped at once into the meta_cat class
replace_center
class-attribute
instance-attribute
If set the center (concept) will be replaced with this string
save_and_reuse_tokens
class-attribute
instance-attribute
save_and_reuse_tokens: bool = False
This is a dangerous option, if not sure ALWAYS set to False. If set, it will try to share the pre-calculated context tokens between MetaCAT models when serving. It will ignore differences in tokenizer and context size, so you need to be sure that the models for which this is turned on have the same tokenizer and context size, during a deployment.
seed
class-attribute
instance-attribute
seed: int = 13
The seed for random number generation.
NOTE: If used along RelCAT or additional NER, only one of the seeds will take effect NB! For these changes to take effect, the pipe would need to be recreated.
serialiser
class-attribute
instance-attribute
serialiser: AvailableSerialisers = dill
The serialiser to use when saving.
span_group
class-attribute
instance-attribute
If set, the spacy span group that the metacat model will assign annotations. Otherwise defaults to doc._.ents or doc.ents per the annotate_overlapping settings
tokenizer_name
class-attribute
instance-attribute
tokenizer_name: str = 'bbpe'
Tokenizer name used with MetaCAT.
Choose from
- 'bbpe': Byte Pair Encoding Tokenizer
- 'bert-tokenizer': BERT Tokenizer
NB! For these changes to take effect, the pipe would need to be recreated.
vocab_size
class-attribute
instance-attribute
vocab_size: int = -1
Will be set automatically if the tokenizer is provided during meta_cat init
get_applicable_category_name
Source code in medcat-v2/medcat/config/config_meta_cat.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | |
Model
Bases: DirtiableBaseModel
The model part of the metaCAT config
Attributes:
-
category_undersample(str) –When using 2 phase learning, this category is used to undersample
-
dropout(float) –The dropout for the model.
-
emb_grad(bool) –Applicable only for LSTM:
-
hidden_size(int) –Number of neurons in the hidden layer.
-
ignore_cpos(bool) –If set to True center positions will be ignored when calculating
-
input_size(int) –Specifies the size of the embedding layer.
-
model_architecture_config(dict[str, bool]) –Specifies the architecture for BERT model.
-
model_config– -
model_freeze_layers(bool) –Applicable only when using BERT:
-
model_name(str) –Model to be used for training or predicting.
-
model_variant(str) –Applicable only when using BERT:
-
nclasses(int) –Number of classes that this model will output.
-
num_directions(int) –Applicable only for LSTM:
-
num_layers(int) –Number of layers in the model (both LSTM and BERT)
-
padding_idx(int) –The padding index.
-
phase_number(int) –Indicates whether two phase learning is to be used for training.
category_undersample
class-attribute
instance-attribute
category_undersample: str = ''
When using 2 phase learning, this category is used to undersample the data
dropout
class-attribute
instance-attribute
dropout: float = 0.5
The dropout for the model.
NB! For these changes to take effect, the pipe would need to be recreated.
emb_grad
class-attribute
instance-attribute
emb_grad: bool = True
Applicable only for LSTM:
If True, the embeddings will also be trained.
NB! For these changes to take effect, the pipe would need to be recreated.
hidden_size
class-attribute
instance-attribute
hidden_size: int = 300
Number of neurons in the hidden layer.
NB! For these changes to take effect, the pipe would need to be recreated.
ignore_cpos
class-attribute
instance-attribute
ignore_cpos: bool = False
If set to True center positions will be ignored when calculating representation
input_size
class-attribute
instance-attribute
input_size: int = 300
Specifies the size of the embedding layer.
Applicable only for LSTM model and ignored for BERT as BERT's embedding size is predefined.
NB! For these changes to take effect, the pipe would need to be recreated.
model_architecture_config
class-attribute
instance-attribute
Specifies the architecture for BERT model.
If fc2 is set to True, then the 2nd fully connected layer is used
If fc2 is True and fc3 is set to True, then the 3rd fully connected layer is used
If lr_scheduler is set to True, then the learning rate scheduler is used with the optimizer
NB! For these changes to take effect, the pipe would need to be recreated.
model_config
class-attribute
instance-attribute
model_config = ConfigDict(extra='allow', validate_assignment=True, protected_namespaces=())
model_freeze_layers
class-attribute
instance-attribute
model_freeze_layers: bool = True
Applicable only when using BERT:
Determines the training approach for BERT.
- If True: BERT layers are frozen and only the fully connected (FC) layer(s) on top are trained.
- If False: Parameter-efficient fine-tuning will be applied using Low-Rank Adaptation (LoRA).
NB! For these changes to take effect, the pipe would need to be recreated.
model_name
class-attribute
instance-attribute
model_name: str = 'lstm'
Model to be used for training or predicting.
Choose from
- 'bert'
- 'lstm'
Note
When changing the model, make sure to change the tokenizer accordingly. NB! For these changes to take effect, the pipe would need to be recreated.
model_variant
class-attribute
instance-attribute
model_variant: str = 'bert-base-uncased'
Applicable only when using BERT:
Specifies the model variant to be used.
NB! For these changes to take effect, the pipe would need to be recreated.
nclasses
class-attribute
instance-attribute
nclasses: int = 2
Number of classes that this model will output.
NB! For these changes to take effect, the pipe would need to be recreated.
num_directions
class-attribute
instance-attribute
num_directions: int = 2
Applicable only for LSTM:
2 - bidirectional model, 1 - unidirectional
NB! For these changes to take effect, the pipe would need to be recreated.
num_layers
class-attribute
instance-attribute
num_layers: int = 2
Number of layers in the model (both LSTM and BERT)
NB! For these changes to take effect, the pipe would need to be recreated.
padding_idx
class-attribute
instance-attribute
padding_idx: int = -1
The padding index.
NB! For these changes to take effect, the pipe would need to be recreated.
phase_number
class-attribute
instance-attribute
phase_number: int = 0
Indicates whether two phase learning is to be used for training.
1: Phase 1 - Train model on undersampled data
2: Phase 2 - Continue training on full data
0: None - 2 phase learning is not performed
Paper reference - https://ieeexplore.ieee.org/document/7533053
Train
Bases: DirtiableBaseModel
The train part of the metaCAT config
Attributes:
-
auto_save_model(bool) –Should do model be saved during training for best results
-
batch_size(int) – -
class_weights(Optional[Any]) – -
compute_class_weights(bool) –If true and class weights not provided, the class weights will be
-
cui_filter(Optional[Any]) –If set only this CUIs will be used for training
-
gamma(int) –Focal Loss hyperparameter - determines importance the loss gives to
-
last_train_on(Optional[float]) –When was the last training run
-
loss_funct(str) –Loss function for the model.
-
lr(float) – -
metric(dict[str, str]) –What metric should be used for choosing the best model
-
model_config– -
nepochs(int) – -
prerequisites(dict) – -
score_average(str) –What to use for averaging F1/P/R across labels
-
shuffle_data(bool) –Used only during training, if set the dataset will be shuffled before
-
test_size(float) –
auto_save_model
class-attribute
instance-attribute
auto_save_model: bool = True
Should do model be saved during training for best results
compute_class_weights
class-attribute
instance-attribute
compute_class_weights: bool = False
If true and class weights not provided, the class weights will be calculated based on the data
cui_filter
class-attribute
instance-attribute
If set only this CUIs will be used for training
gamma
class-attribute
instance-attribute
gamma: int = 2
Focal Loss hyperparameter - determines importance the loss gives to hard-to-classify examples
last_train_on
class-attribute
instance-attribute
When was the last training run
loss_funct
class-attribute
instance-attribute
loss_funct: str = 'cross_entropy'
Loss function for the model.
Choose from
- 'cross_entropy'
- 'focal_loss'
metric
class-attribute
instance-attribute
What metric should be used for choosing the best model
model_config
class-attribute
instance-attribute
model_config = ConfigDict(extra='allow', validate_assignment=True)
score_average
class-attribute
instance-attribute
score_average: str = 'weighted'
What to use for averaging F1/P/R across labels
shuffle_data
class-attribute
instance-attribute
shuffle_data: bool = True
Used only during training, if set the dataset will be shuffled before train/test split