medcat.config.config_rel_cat

Classes:

ConfigRelCAT –

The RelCAT part of the config
General –

The General part of the RelCAT config
Model –

The model part of the RelCAT config
Train –

The train part of the RelCAT config

ConfigRelCAT

Bases: ComponentConfig

The RelCAT part of the config

Methods:

load –

Load the config from a file.

Attributes:

general (General) –
model (Model) –
model_config –
train (Train) –

general `class-attribute` `instance-attribute`

general: General = General()

model `class-attribute` `instance-attribute`

model: Model = Model()

model_config `class-attribute` `instance-attribute`

model_config = ConfigDict(extra='allow', validate_assignment=True)

train `class-attribute` `instance-attribute`

train: Train = Train()

load `classmethod`

load(load_path: str = './') -> ConfigRelCAT

Load the config from a file.

Parameters:

load_path
(str, default: './' ) –

Path to RelCAT config. Defaults to "./".

Returns:

ConfigRelCAT ( ConfigRelCAT ) –

The loaded config.

Source code in medcat-v2/medcat/config/config_rel_cat.py

@classmethod
def load(cls, load_path: str = "./") -> "ConfigRelCAT":
    """Load the config from a file.

    Args:
        load_path (str): Path to RelCAT config. Defaults to "./".

    Returns:
        ConfigRelCAT: The loaded config.
    """
    config = cls()
    if os.path.exists(load_path):
        if os.path.sep + "config" not in load_path:
            load_path = os.path.join(load_path, "config")
        config: ConfigRelCAT = deserialise(load_path)  # type: ignore
        logging.info("Loaded config.json")

    return config

General

Bases: SerialisableBaseModel

The General part of the RelCAT config

Methods:

convert_keys_to_int –

Attributes:

addl_rels_max_sample_size (int) –

Limit the number of 'Other' samples selected for training/test. This is
annotation_schema_tag_ids (list[int]) –

If a foreign non-MCAT trainer dataset is used, you can insert your own
cntx_left (int) –

Number of tokens to take from the left of the concept
cntx_right (int) –

Number of tokens to take from the right of the concept
create_addl_rels (bool) –

When processing relations from a MedCAT export/docs, relations labeled
create_addl_rels_by_type (bool) –

When creating the 'Other' relation class, actually split this class
device (str) –

The device to use (CPU or GPU).
idx2labels (dict[int, str]) –
labels2idx (dict[str, int]) –
language (str) –

Used for Spacy lang setting
limit_samples_per_class (int) –

Number of samples per class, this limit is applied for train samples,
log_level (int) –

The log level for RelCAT.
lowercase (bool) –

If true all input text will be lowercased
max_seq_length (int) –

The maximum sequence length.
model_config –
model_name (str) –

The name of the model used.
pin_memory (bool) –

If True the data loader will copy the tensors to the GPU pinned memory.
relation_type_filter_pairs (list) –

Map from category values to ID, if empty it will be
seed (int) –

The seed for random number generation.
task (str) –

The task for RelCAT.
tokenizer_name (str) –

The name of the tokenizer user.
tokenizer_other_special_tokens (dict[str, str]) –

The special tokens used by the tokenizer. The {PAD} is
tokenizer_relation_annotation_special_tokens_tags (list[str]) –
tokenizer_special_tokens (bool) –

Tokenizer.
vocab_size (Optional[int]) –
window_size (int) –

Max acceptable dinstance between entities (in characters),

addl_rels_max_sample_size `class-attribute` `instance-attribute`

addl_rels_max_sample_size: int = 200

Limit the number of 'Other' samples selected for training/test. This is applied per encountered medcat project, sample_size/num_projects.

annotation_schema_tag_ids `class-attribute` `instance-attribute`

annotation_schema_tag_ids: list[int] = [30522, 30523, 30524, 30525]

If a foreign non-MCAT trainer dataset is used, you can insert your own Rel entity token delimiters into the tokenizer, copy those token IDs here, and also resize your tokenizer embeddings and adjust the hidden_size of the model, this will depend on the number of tokens you introduce for example: 30522 - [s1], 30523 - [e1], 30524 - [s2], 30525 - [e2], 30526 - [BLANK], 30527 - [ENT1], 30528 - [ENT2], 30529 - [/ENT1], 30530 - [/ENT2] Please note that the tokenizer special tokens are supposed to be in pairs of two for example [s1] and [e1], [s2] and [e2], the [BLANK] is just an example placeholder token. If you have more than four tokens here then you need to make sure they are present in the text, otherwise the pipeline will throw an error in the get_annotation_schema_tag() function.

cntx_left `class-attribute` `instance-attribute`

cntx_left: int = 15

Number of tokens to take from the left of the concept

cntx_right `class-attribute` `instance-attribute`

cntx_right: int = 15

Number of tokens to take from the right of the concept

create_addl_rels `class-attribute` `instance-attribute`

create_addl_rels: bool = False

When processing relations from a MedCAT export/docs, relations labeled as 'Other' are created from all the annotations pairs available

create_addl_rels_by_type `class-attribute` `instance-attribute`

create_addl_rels_by_type: bool = False

When creating the 'Other' relation class, actually split this class into subclasses based on concept types

device `class-attribute` `instance-attribute`

device: str = 'cpu'

The device to use (CPU or GPU).

NB! For these changes to take effect, the pipe would need to be recreated.

idx2labels `class-attribute` `instance-attribute`

idx2labels: dict[int, str] = {}

labels2idx `class-attribute` `instance-attribute`

labels2idx: dict[str, int] = {}

language `class-attribute` `instance-attribute`

language: str = 'en'

Used for Spacy lang setting

limit_samples_per_class `class-attribute` `instance-attribute`

limit_samples_per_class: int = -1

Number of samples per class, this limit is applied for train samples, o if train samples are 100 then test would be 20.

log_level `class-attribute` `instance-attribute`

log_level: int = INFO

The log level for RelCAT.

NB! For these changes to take effect, the pipe would need to be recreated.

lowercase `class-attribute` `instance-attribute`

lowercase: bool = True

If true all input text will be lowercased

max_seq_length `class-attribute` `instance-attribute`

max_seq_length: int = 512

The maximum sequence length.

NB! For these changes to take effect, the pipe would need to be recreated.

model_config `class-attribute` `instance-attribute`

model_config = ConfigDict(protected_namespaces=())

model_name `class-attribute` `instance-attribute`

model_name: str = 'bert-base-uncased'

The name of the model used.

NB! For these changes to take effect, the pipe would need to be recreated.

pin_memory `class-attribute` `instance-attribute`

pin_memory: bool = True

If True the data loader will copy the tensors to the GPU pinned memory.

relation_type_filter_pairs `class-attribute` `instance-attribute`

relation_type_filter_pairs: list = []

Map from category values to ID, if empty it will be autocalculated during training

seed `class-attribute` `instance-attribute`

seed: int = 13

The seed for random number generation.

NB! For these changes to take effect, the pipe would need to be recreated.

task `class-attribute` `instance-attribute`

task: str = 'train'

The task for RelCAT.

tokenizer_name `class-attribute` `instance-attribute`

tokenizer_name: str = 'bert'

The name of the tokenizer user.

NB! For these changes to take effect, the pipe would need to be recreated.

tokenizer_other_special_tokens `class-attribute` `instance-attribute`

tokenizer_other_special_tokens: dict[str, str] = {'pad_token': '[PAD]'}

The special tokens used by the tokenizer. The {PAD} is for Lllama tokenizer.

tokenizer_relation_annotation_special_tokens_tags `class-attribute` `instance-attribute`

tokenizer_relation_annotation_special_tokens_tags: list[str] = ['[s1]', '[e1]', '[s2]', '[e2]']

tokenizer_special_tokens `class-attribute` `instance-attribute`

tokenizer_special_tokens: bool = False

Tokenizer.

NB! For these changes to take effect, the pipe would need to be recreated.

vocab_size `class-attribute` `instance-attribute`

vocab_size: Optional[int] = None

window_size `class-attribute` `instance-attribute`

window_size: int = 300

Max acceptable dinstance between entities (in characters), care when using this as it can produce sentences that are over 512 tokens (limit is given by tokenizer)

convert_keys_to_int `classmethod`

convert_keys_to_int(value)

Source code in medcat-v2/medcat/config/config_rel_cat.py

@classmethod
def convert_keys_to_int(cls, value):
    if isinstance(value, dict):
        return {int(k): v for k, v in value.items()}
    return value

Model

Bases: SerialisableBaseModel

The model part of the RelCAT config

Attributes:

dropout (float) –
emb_grad (bool) –

If True the embeddings will also be trained
freeze_layers (bool) –

If we update the weights during training
hidden_layers (int) –

hidden_size * 5, 5 being the number of tokens,
hidden_size (int) –

The hidden size.
ignore_cpos (bool) –

If set to True center positions will be ignored when calculating
input_size (int) –
llama_use_pooled_output (bool) –

If set to True, used only in Llama model, it will add the extra tensor
model_config –
model_size (int) –

The size of the model.
num_directions (int) –

2 - bidirectional model, 1 - unidirectional
padding_idx (int) –

dropout `class-attribute` `instance-attribute`

dropout: float = 0.2

emb_grad `class-attribute` `instance-attribute`

emb_grad: bool = True

If True the embeddings will also be trained

freeze_layers `class-attribute` `instance-attribute`

freeze_layers: bool = True

If we update the weights during training

hidden_layers `class-attribute` `instance-attribute`

hidden_layers: int = 3

hidden_size * 5, 5 being the number of tokens, default (s1,s2,e1,e2+CLS).

NB! For these changes to take effect, the pipe would need to be recreated.

hidden_size `class-attribute` `instance-attribute`

hidden_size: int = 768

The hidden size.

NB! For these changes to take effect, the pipe would need to be recreated.

ignore_cpos `class-attribute` `instance-attribute`

ignore_cpos: bool = False

If set to True center positions will be ignored when calculating representation

input_size `class-attribute` `instance-attribute`

input_size: int = 300

llama_use_pooled_output `class-attribute` `instance-attribute`

llama_use_pooled_output: bool = False

If set to True, used only in Llama model, it will add the extra tensor formed from selecting the max of the last hidden layer

model_config `class-attribute` `instance-attribute`

model_config = ConfigDict(extra='allow', validate_assignment=True, protected_namespaces=())

model_size `class-attribute` `instance-attribute`

model_size: int = 5120

The size of the model.

NB! For these changes to take effect, the pipe would need to be recreated.

num_directions `class-attribute` `instance-attribute`

num_directions: int = 2

2 - bidirectional model, 1 - unidirectional

padding_idx `class-attribute` `instance-attribute`

padding_idx: int = -1

Train

Bases: SerialisableBaseModel

The train part of the RelCAT config

Attributes:

adam_betas (tuple[float, float]) –
adam_epsilon (float) –
adam_weight_decay (float) –
auto_save_model (bool) –

Should the model be saved during training for best results
batch_size (int) –

batch size
batching_minority_limit (Union[list[int], int]) –

Maximum number of samples the minority class can have.
batching_samples_per_class (list) –

Number of samples per class in each batch
class_weights (Union[list[float], None]) –
enable_class_weights (bool) –
gradient_acc_steps (int) –
lr (float) –

Learning rate
max_grad_norm (float) –
model_config –
multistep_lr_gamma (float) –
multistep_milestones (list[int]) –
nclasses (int) –

Number of classes that this model will output
nepochs (int) –

Epochs
score_average (str) –

What to use for averaging F1/P/R across labels
shuffle_data (bool) –

Used only during training, if set the dataset will be shuffled before
stratified_batching (bool) –

Train the model with stratified batching
test_size (float) –

adam_betas `class-attribute` `instance-attribute`

adam_betas: tuple[float, float] = (0.9, 0.999)

adam_epsilon `class-attribute` `instance-attribute`

adam_epsilon: float = 1e-08

adam_weight_decay `class-attribute` `instance-attribute`

adam_weight_decay: float = 0

auto_save_model `class-attribute` `instance-attribute`

auto_save_model: bool = True

Should the model be saved during training for best results

batch_size `class-attribute` `instance-attribute`

batch_size: int = 25

batch size

batching_minority_limit `class-attribute` `instance-attribute`

batching_minority_limit: Union[list[int], int] = 0

Maximum number of samples the minority class can have. Since the minority class elements need to be repeated, this is used to facilitate that example: batching_samples_per_class - [6,6,6,8,8,8,6,8,8] batching_minority_limit - 6

batching_samples_per_class `class-attribute` `instance-attribute`

batching_samples_per_class: list = []

Number of samples per class in each batch example for batch size 64: [6,6,6,8,8,8,6,8,8]

class_weights `class-attribute` `instance-attribute`

class_weights: Union[list[float], None] = None

enable_class_weights `class-attribute` `instance-attribute`

enable_class_weights: bool = False

gradient_acc_steps `class-attribute` `instance-attribute`

gradient_acc_steps: int = 1

lr `class-attribute` `instance-attribute`

lr: float = 0.0001

Learning rate

max_grad_norm `class-attribute` `instance-attribute`

max_grad_norm: float = 1.0

model_config `class-attribute` `instance-attribute`

model_config = ConfigDict(extra='allow', validate_assignment=True)

multistep_lr_gamma `class-attribute` `instance-attribute`

multistep_lr_gamma: float = 0.8

multistep_milestones `class-attribute` `instance-attribute`

multistep_milestones: list[int] = [2, 4, 6, 8, 12, 15, 18, 20, 22, 24, 26, 30]

nclasses `class-attribute` `instance-attribute`

nclasses: int = 2

Number of classes that this model will output

nepochs `class-attribute` `instance-attribute`

nepochs: int = 1

Epochs

score_average `class-attribute` `instance-attribute`

score_average: str = 'weighted'

What to use for averaging F1/P/R across labels

shuffle_data `class-attribute` `instance-attribute`

shuffle_data: bool = True

Used only during training, if set the dataset will be shuffled before train/test split

stratified_batching `class-attribute` `instance-attribute`

stratified_batching: bool = False

Train the model with stratified batching

test_size `class-attribute` `instance-attribute`

test_size: float = 0.2

medcat.config.config_rel_cat

ConfigRelCAT

general class-attribute instance-attribute

model class-attribute instance-attribute

model_config class-attribute instance-attribute

train class-attribute instance-attribute

load classmethod

load_path

General

addl_rels_max_sample_size class-attribute instance-attribute

annotation_schema_tag_ids class-attribute instance-attribute

cntx_left class-attribute instance-attribute

cntx_right class-attribute instance-attribute

create_addl_rels class-attribute instance-attribute

create_addl_rels_by_type class-attribute instance-attribute

device class-attribute instance-attribute

idx2labels class-attribute instance-attribute

labels2idx class-attribute instance-attribute

language class-attribute instance-attribute

limit_samples_per_class class-attribute instance-attribute

log_level class-attribute instance-attribute

lowercase class-attribute instance-attribute

max_seq_length class-attribute instance-attribute

model_config class-attribute instance-attribute

model_name class-attribute instance-attribute

pin_memory class-attribute instance-attribute

relation_type_filter_pairs class-attribute instance-attribute

seed class-attribute instance-attribute

task class-attribute instance-attribute

tokenizer_name class-attribute instance-attribute

tokenizer_other_special_tokens class-attribute instance-attribute

tokenizer_relation_annotation_special_tokens_tags class-attribute instance-attribute

tokenizer_special_tokens class-attribute instance-attribute

vocab_size class-attribute instance-attribute

window_size class-attribute instance-attribute

convert_keys_to_int classmethod

Model

dropout class-attribute instance-attribute

emb_grad class-attribute instance-attribute

freeze_layers class-attribute instance-attribute

hidden_layers class-attribute instance-attribute

hidden_size class-attribute instance-attribute

ignore_cpos class-attribute instance-attribute

input_size class-attribute instance-attribute

llama_use_pooled_output class-attribute instance-attribute

model_config class-attribute instance-attribute

model_size class-attribute instance-attribute

num_directions class-attribute instance-attribute

padding_idx class-attribute instance-attribute

Train

adam_betas class-attribute instance-attribute

adam_epsilon class-attribute instance-attribute

adam_weight_decay class-attribute instance-attribute

auto_save_model class-attribute instance-attribute

batch_size class-attribute instance-attribute

batching_minority_limit class-attribute instance-attribute

batching_samples_per_class class-attribute instance-attribute

class_weights class-attribute instance-attribute

enable_class_weights class-attribute instance-attribute

gradient_acc_steps class-attribute instance-attribute

lr class-attribute instance-attribute

max_grad_norm class-attribute instance-attribute

model_config class-attribute instance-attribute

multistep_lr_gamma class-attribute instance-attribute

multistep_milestones class-attribute instance-attribute

nclasses class-attribute instance-attribute

nepochs class-attribute instance-attribute

score_average class-attribute instance-attribute

shuffle_data class-attribute instance-attribute

stratified_batching class-attribute instance-attribute

test_size class-attribute instance-attribute

general `class-attribute` `instance-attribute`

model `class-attribute` `instance-attribute`

model_config `class-attribute` `instance-attribute`

train `class-attribute` `instance-attribute`

load `classmethod`

`load_path`

addl_rels_max_sample_size `class-attribute` `instance-attribute`

annotation_schema_tag_ids `class-attribute` `instance-attribute`

cntx_left `class-attribute` `instance-attribute`

cntx_right `class-attribute` `instance-attribute`

create_addl_rels `class-attribute` `instance-attribute`

create_addl_rels_by_type `class-attribute` `instance-attribute`

device `class-attribute` `instance-attribute`

idx2labels `class-attribute` `instance-attribute`

labels2idx `class-attribute` `instance-attribute`

language `class-attribute` `instance-attribute`

limit_samples_per_class `class-attribute` `instance-attribute`

log_level `class-attribute` `instance-attribute`

lowercase `class-attribute` `instance-attribute`

max_seq_length `class-attribute` `instance-attribute`

model_config `class-attribute` `instance-attribute`

model_name `class-attribute` `instance-attribute`

pin_memory `class-attribute` `instance-attribute`

relation_type_filter_pairs `class-attribute` `instance-attribute`

seed `class-attribute` `instance-attribute`

task `class-attribute` `instance-attribute`

tokenizer_name `class-attribute` `instance-attribute`

tokenizer_other_special_tokens `class-attribute` `instance-attribute`

tokenizer_relation_annotation_special_tokens_tags `class-attribute` `instance-attribute`

tokenizer_special_tokens `class-attribute` `instance-attribute`

vocab_size `class-attribute` `instance-attribute`

window_size `class-attribute` `instance-attribute`

convert_keys_to_int `classmethod`

dropout `class-attribute` `instance-attribute`

emb_grad `class-attribute` `instance-attribute`

freeze_layers `class-attribute` `instance-attribute`

hidden_layers `class-attribute` `instance-attribute`

hidden_size `class-attribute` `instance-attribute`

ignore_cpos `class-attribute` `instance-attribute`

input_size `class-attribute` `instance-attribute`

llama_use_pooled_output `class-attribute` `instance-attribute`

model_config `class-attribute` `instance-attribute`

model_size `class-attribute` `instance-attribute`

num_directions `class-attribute` `instance-attribute`

padding_idx `class-attribute` `instance-attribute`

adam_betas `class-attribute` `instance-attribute`

adam_epsilon `class-attribute` `instance-attribute`

adam_weight_decay `class-attribute` `instance-attribute`

auto_save_model `class-attribute` `instance-attribute`

batch_size `class-attribute` `instance-attribute`

batching_minority_limit `class-attribute` `instance-attribute`

batching_samples_per_class `class-attribute` `instance-attribute`

class_weights `class-attribute` `instance-attribute`

enable_class_weights `class-attribute` `instance-attribute`

gradient_acc_steps `class-attribute` `instance-attribute`

lr `class-attribute` `instance-attribute`

max_grad_norm `class-attribute` `instance-attribute`

model_config `class-attribute` `instance-attribute`

multistep_lr_gamma `class-attribute` `instance-attribute`

multistep_milestones `class-attribute` `instance-attribute`

nclasses `class-attribute` `instance-attribute`

nepochs `class-attribute` `instance-attribute`

score_average `class-attribute` `instance-attribute`

shuffle_data `class-attribute` `instance-attribute`

stratified_batching `class-attribute` `instance-attribute`

test_size `class-attribute` `instance-attribute`