medcat.config.config_rel_cat
Classes:
-
ConfigRelCAT–The RelCAT part of the config
-
General–The General part of the RelCAT config
-
Model–The model part of the RelCAT config
-
Train–The train part of the RelCAT config
ConfigRelCAT
Bases: ComponentConfig
The RelCAT part of the config
Methods:
-
load–Load the config from a file.
Attributes:
model_config
class-attribute
instance-attribute
model_config = ConfigDict(extra='allow', validate_assignment=True)
load
classmethod
load(load_path: str = './') -> ConfigRelCAT
Load the config from a file.
Parameters:
-
(load_pathstr, default:'./') –Path to RelCAT config. Defaults to "./".
Returns:
-
ConfigRelCAT(ConfigRelCAT) –The loaded config.
Source code in medcat-v2/medcat/config/config_rel_cat.py
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 | |
General
Bases: SerialisableBaseModel
The General part of the RelCAT config
Methods:
Attributes:
-
addl_rels_max_sample_size(int) –Limit the number of 'Other' samples selected for training/test. This is
-
annotation_schema_tag_ids(list[int]) –If a foreign non-MCAT trainer dataset is used, you can insert your own
-
cntx_left(int) –Number of tokens to take from the left of the concept
-
cntx_right(int) –Number of tokens to take from the right of the concept
-
create_addl_rels(bool) –When processing relations from a MedCAT export/docs, relations labeled
-
create_addl_rels_by_type(bool) –When creating the 'Other' relation class, actually split this class
-
device(str) –The device to use (CPU or GPU).
-
idx2labels(dict[int, str]) – -
labels2idx(dict[str, int]) – -
language(str) –Used for Spacy lang setting
-
limit_samples_per_class(int) –Number of samples per class, this limit is applied for train samples,
-
log_level(int) –The log level for RelCAT.
-
lowercase(bool) –If true all input text will be lowercased
-
max_seq_length(int) –The maximum sequence length.
-
model_config– -
model_name(str) –The name of the model used.
-
pin_memory(bool) –If True the data loader will copy the tensors to the GPU pinned memory.
-
relation_type_filter_pairs(list) –Map from category values to ID, if empty it will be
-
seed(int) –The seed for random number generation.
-
task(str) –The task for RelCAT.
-
tokenizer_name(str) –The name of the tokenizer user.
-
tokenizer_other_special_tokens(dict[str, str]) –The special tokens used by the tokenizer. The {PAD} is
-
tokenizer_relation_annotation_special_tokens_tags(list[str]) – -
tokenizer_special_tokens(bool) –Tokenizer.
-
vocab_size(Optional[int]) – -
window_size(int) –Max acceptable dinstance between entities (in characters),
addl_rels_max_sample_size
class-attribute
instance-attribute
addl_rels_max_sample_size: int = 200
Limit the number of 'Other' samples selected for training/test. This is applied per encountered medcat project, sample_size/num_projects.
annotation_schema_tag_ids
class-attribute
instance-attribute
If a foreign non-MCAT trainer dataset is used, you can insert your own Rel entity token delimiters into the tokenizer, copy those token IDs here, and also resize your tokenizer embeddings and adjust the hidden_size of the model, this will depend on the number of tokens you introduce for example: 30522 - [s1], 30523 - [e1], 30524 - [s2], 30525 - [e2], 30526 - [BLANK], 30527 - [ENT1], 30528 - [ENT2], 30529 - [/ENT1], 30530 - [/ENT2] Please note that the tokenizer special tokens are supposed to be in pairs of two for example [s1] and [e1], [s2] and [e2], the [BLANK] is just an example placeholder token. If you have more than four tokens here then you need to make sure they are present in the text, otherwise the pipeline will throw an error in the get_annotation_schema_tag() function.
cntx_left
class-attribute
instance-attribute
cntx_left: int = 15
Number of tokens to take from the left of the concept
cntx_right
class-attribute
instance-attribute
cntx_right: int = 15
Number of tokens to take from the right of the concept
create_addl_rels
class-attribute
instance-attribute
create_addl_rels: bool = False
When processing relations from a MedCAT export/docs, relations labeled as 'Other' are created from all the annotations pairs available
create_addl_rels_by_type
class-attribute
instance-attribute
create_addl_rels_by_type: bool = False
When creating the 'Other' relation class, actually split this class into subclasses based on concept types
device
class-attribute
instance-attribute
device: str = 'cpu'
The device to use (CPU or GPU).
NB! For these changes to take effect, the pipe would need to be recreated.
limit_samples_per_class
class-attribute
instance-attribute
limit_samples_per_class: int = -1
Number of samples per class, this limit is applied for train samples, o if train samples are 100 then test would be 20.
log_level
class-attribute
instance-attribute
The log level for RelCAT.
NB! For these changes to take effect, the pipe would need to be recreated.
lowercase
class-attribute
instance-attribute
lowercase: bool = True
If true all input text will be lowercased
max_seq_length
class-attribute
instance-attribute
max_seq_length: int = 512
The maximum sequence length.
NB! For these changes to take effect, the pipe would need to be recreated.
model_config
class-attribute
instance-attribute
model_config = ConfigDict(protected_namespaces=())
model_name
class-attribute
instance-attribute
model_name: str = 'bert-base-uncased'
The name of the model used.
NB! For these changes to take effect, the pipe would need to be recreated.
pin_memory
class-attribute
instance-attribute
pin_memory: bool = True
If True the data loader will copy the tensors to the GPU pinned memory.
relation_type_filter_pairs
class-attribute
instance-attribute
relation_type_filter_pairs: list = []
Map from category values to ID, if empty it will be autocalculated during training
seed
class-attribute
instance-attribute
seed: int = 13
The seed for random number generation.
NB! For these changes to take effect, the pipe would need to be recreated.
tokenizer_name
class-attribute
instance-attribute
tokenizer_name: str = 'bert'
The name of the tokenizer user.
NB! For these changes to take effect, the pipe would need to be recreated.
tokenizer_other_special_tokens
class-attribute
instance-attribute
The special tokens used by the tokenizer. The {PAD} is for Lllama tokenizer.
tokenizer_relation_annotation_special_tokens_tags
class-attribute
instance-attribute
tokenizer_special_tokens
class-attribute
instance-attribute
tokenizer_special_tokens: bool = False
Tokenizer.
NB! For these changes to take effect, the pipe would need to be recreated.
window_size
class-attribute
instance-attribute
window_size: int = 300
Max acceptable dinstance between entities (in characters), care when using this as it can produce sentences that are over 512 tokens (limit is given by tokenizer)
convert_keys_to_int
classmethod
convert_keys_to_int(value)
Source code in medcat-v2/medcat/config/config_rel_cat.py
112 113 114 115 116 | |
Model
Bases: SerialisableBaseModel
The model part of the RelCAT config
Attributes:
-
dropout(float) – -
emb_grad(bool) –If True the embeddings will also be trained
-
freeze_layers(bool) –If we update the weights during training
-
hidden_layers(int) –hidden_size * 5, 5 being the number of tokens,
-
hidden_size(int) –The hidden size.
-
ignore_cpos(bool) –If set to True center positions will be ignored when calculating
-
input_size(int) – -
llama_use_pooled_output(bool) –If set to True, used only in Llama model, it will add the extra tensor
-
model_config– -
model_size(int) –The size of the model.
-
num_directions(int) –2 - bidirectional model, 1 - unidirectional
-
padding_idx(int) –
emb_grad
class-attribute
instance-attribute
emb_grad: bool = True
If True the embeddings will also be trained
freeze_layers
class-attribute
instance-attribute
freeze_layers: bool = True
If we update the weights during training
hidden_layers
class-attribute
instance-attribute
hidden_layers: int = 3
hidden_size * 5, 5 being the number of tokens, default (s1,s2,e1,e2+CLS).
NB! For these changes to take effect, the pipe would need to be recreated.
hidden_size
class-attribute
instance-attribute
hidden_size: int = 768
The hidden size.
NB! For these changes to take effect, the pipe would need to be recreated.
ignore_cpos
class-attribute
instance-attribute
ignore_cpos: bool = False
If set to True center positions will be ignored when calculating representation
llama_use_pooled_output
class-attribute
instance-attribute
llama_use_pooled_output: bool = False
If set to True, used only in Llama model, it will add the extra tensor formed from selecting the max of the last hidden layer
model_config
class-attribute
instance-attribute
model_config = ConfigDict(extra='allow', validate_assignment=True, protected_namespaces=())
model_size
class-attribute
instance-attribute
model_size: int = 5120
The size of the model.
NB! For these changes to take effect, the pipe would need to be recreated.
num_directions
class-attribute
instance-attribute
num_directions: int = 2
2 - bidirectional model, 1 - unidirectional
Train
Bases: SerialisableBaseModel
The train part of the RelCAT config
Attributes:
-
adam_betas(tuple[float, float]) – -
adam_epsilon(float) – -
adam_weight_decay(float) – -
auto_save_model(bool) –Should the model be saved during training for best results
-
batch_size(int) –batch size
-
batching_minority_limit(Union[list[int], int]) –Maximum number of samples the minority class can have.
-
batching_samples_per_class(list) –Number of samples per class in each batch
-
class_weights(Union[list[float], None]) – -
enable_class_weights(bool) – -
gradient_acc_steps(int) – -
lr(float) –Learning rate
-
max_grad_norm(float) – -
model_config– -
multistep_lr_gamma(float) – -
multistep_milestones(list[int]) – -
nclasses(int) –Number of classes that this model will output
-
nepochs(int) –Epochs
-
score_average(str) –What to use for averaging F1/P/R across labels
-
shuffle_data(bool) –Used only during training, if set the dataset will be shuffled before
-
stratified_batching(bool) –Train the model with stratified batching
-
test_size(float) –
auto_save_model
class-attribute
instance-attribute
auto_save_model: bool = True
Should the model be saved during training for best results
batching_minority_limit
class-attribute
instance-attribute
Maximum number of samples the minority class can have. Since the minority class elements need to be repeated, this is used to facilitate that example: batching_samples_per_class - [6,6,6,8,8,8,6,8,8] batching_minority_limit - 6
batching_samples_per_class
class-attribute
instance-attribute
batching_samples_per_class: list = []
Number of samples per class in each batch example for batch size 64: [6,6,6,8,8,8,6,8,8]
model_config
class-attribute
instance-attribute
model_config = ConfigDict(extra='allow', validate_assignment=True)
multistep_milestones
class-attribute
instance-attribute
nclasses
class-attribute
instance-attribute
nclasses: int = 2
Number of classes that this model will output
score_average
class-attribute
instance-attribute
score_average: str = 'weighted'
What to use for averaging F1/P/R across labels
shuffle_data
class-attribute
instance-attribute
shuffle_data: bool = True
Used only during training, if set the dataset will be shuffled before train/test split
stratified_batching
class-attribute
instance-attribute
stratified_batching: bool = False
Train the model with stratified batching