medcat.trainer
Classes:
-
Trainer–
Attributes:
-
logger–
Trainer
Trainer(cdb: CDB, caller: Callable[[str], MutableDocument], pipeline: Pipeline)
Methods:
-
add_and_train_concept–Add a name to an existing concept, or add a new concept, or do not
-
train_supervised_raw–Train supervised based on the raw data provided.
-
train_unsupervised–Runs training on the data, note that the maximum length of a line
-
unlink_concept_name–Unlink a concept name from the CUI (or all CUIs if full_unlink),
Attributes:
-
caller– -
cdb– -
config– -
strict_train(bool) –
Source code in medcat-v2/medcat/trainer.py
30 31 32 33 34 35 | |
caller
instance-attribute
caller = caller
cdb
instance-attribute
cdb = cdb
config
instance-attribute
config = config
add_and_train_concept
add_and_train_concept(cui: str, name: str, mut_doc: Optional[MutableDocument] = None, mut_entity: Optional[Union[list[MutableToken], MutableEntity]] = None, ontologies: set[str] = set(), name_status: str = 'A', type_ids: set[str] = set(), description: str = '', full_build: bool = True, negative: bool = False, devalue_others: bool = False, do_add_concept: bool = True) -> None
Add a name to an existing concept, or add a new concept, or do not do anything if the name or concept already exists. Perform training if spacy_entity and spacy_doc are set.
Parameters:
-
(cuistr) –CUI of the concept.
-
(namestr) –Name to be linked to the concept (in the case of MedCATtrainer this is simply the selected value in text, no preprocessing or anything needed).
-
(mut_docOptional[MutableDocument], default:None) –Spacy representation of the document that was manually annotated.
-
–mut_entity (mut_entityOptional[Union[list[MutableToken], MutableEntity]]): Given the spacy document, this is the annotated span of text - list of annotated tokens that are marked with this CUI.
-
(ontologiesset[str], default:set()) –ontologies in which the concept exists (e.g. SNOMEDCT, HPO)
-
(name_statusstr, default:'A') –One of
P,N,A -
(type_idsset[str], default:set()) –Semantic type identifier (have a look at TUIs in UMLS or SNOMED-CT)
-
(descriptionstr, default:'') –Description of this concept.
-
(full_buildbool, default:True) –If True the dictionary self.addl_info will also be populated, contains a lot of extra information about concepts, but can be very memory consuming. This is not necessary for normal functioning of MedCAT (Default Value
False). -
(negativebool, default:False) –Is this a negative or positive example.
-
(devalue_othersbool, default:False) –If set, cuis to which this name is assigned and are not
cuiwill receive negative training given that negative=False. -
(do_add_conceptbool, default:True) –Whether to add concept to CDB.
Source code in medcat-v2/medcat/trainer.py
527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 | |
train_supervised_raw
train_supervised_raw(data: MedCATTrainerExport, reset_cui_count: bool = False, nepochs: int = 1, print_stats: int = 0, use_filters: bool = False, terminate_last: bool = False, use_overlaps: bool = False, use_cui_doc_limit: bool = False, test_size: float = 0, devalue_others: bool = False, use_groups: bool = False, never_terminate: bool = False, train_from_false_positives: bool = False, extra_cui_filter: Optional[set[str]] = None, disable_progress: bool = False, train_addons: bool = False) -> tuple
Train supervised based on the raw data provided.
The raw data is expected in the following format:
{'projects':
[ # list of projects
{ # project 1
'name': '
Please take care that this is more a simulated online training then upervised.
When filtering, the filters within the CAT model are used first, then the ones from MedCATtrainer (MCT) export filters, and finally the extra_cui_filter (if set). That is to say, the expectation is: extra_cui_filter ⊆ MCT filter ⊆ Model/config filter.
Parameters:
-
(datadict[str, list[dict[str, dict]]]) –The raw data, e.g from MedCATtrainer on export.
-
(reset_cui_countbool, default:False) –Used for training with weight_decay (annealing). Each concept has a count that is there from the beginning of the CDB, that count is used for annealing. Resetting the count will significantly increase the training impact. This will reset the count only for concepts that exist in the the training data.
-
(nepochsint, default:1) –Number of epochs for which to run the training.
-
(print_statsint, default:0) –If > 0 it will print stats every print_stats epochs.
-
(use_filtersbool, default:False) –Each project in medcattrainer can have filters, do we want to respect those filters when calculating metrics.
-
(terminate_lastbool, default:False) –If true, concept termination will be done after all training.
-
(use_overlapsbool, default:False) –Allow overlapping entities, nearly always False as it is very difficult to annotate overlapping entities.
-
(use_cui_doc_limitbool, default:False) –If True the metrics for a CUI will be only calculated if that CUI appears in a document, in other words if the document was annotated for that CUI. Useful in very specific situations when during the annotation process the set of CUIs changed.
-
(test_sizefloat, default:0) –If > 0 the data set will be split into train test based on this ration. Should be between 0 and 1. Usually 0.1 is fine.
-
(devalue_othersbool, default:False) –Check add_name for more details.
-
(use_groupsbool, default:False) –If True concepts that have groups will be combined and stats will be reported on groups.
-
(never_terminatebool, default:False) –If True no termination will be applied
-
(train_from_false_positivesbool, default:False) –If True it will use false positive examples detected by medcat and train from them as negative examples.
-
(extra_cui_filterOptional[set], default:None) –This filter will be intersected with all other filters, or if all others are not set then only this one will be used.
-
(checkpointOptional[Optional[medcat.utils.checkpoint.Checkpoint]) –The MedCAT Checkpoint object
-
(disable_progressbool, default:False) –Whether to disable the progress output (tqdm). Defaults to False.
-
(train_addonsbool, default:False) –Whether to also train the addons (e.g MetaCATs). Defaults to False.
Returns:
-
tuple(tuple) –Consisting of the following parts fp (dict): False positives for each CUI. fn (dict): False negatives for each CUI. tp (dict): True positives for each CUI. p (dict): Precision for each CUI. r (dict): Recall for each CUI. f1 (dict): F1 for each CUI. cui_counts (dict): Number of occurrence for each CUI. examples (dict): FP/FN examples of sentences for each CUI.
Source code in medcat-v2/medcat/trainer.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 | |
train_unsupervised
train_unsupervised(data_iterator: Iterable[str], nepochs: int = 1, fine_tune: bool = True, progress_print: int = 1000) -> None
Runs training on the data, note that the maximum length of a line or document is 1M characters. Anything longer will be trimmed.
Parameters:
-
(data_iteratorIterable) –Simple iterator over sentences/documents, e.g. a open file or an array or anything that we can use in a for loop.
-
(nepochsint, default:1) –Number of epochs for which to run the training.
-
(fine_tunebool, default:True) –If False old training will be removed.
-
(progress_printint, default:1000) –Print progress after N lines.
-
(checkpointOptional[CheckpointUT]) –The MedCAT checkpoint object
-
(is_resumedbool) –If True resume the previous training; If False, start a fresh new training.
Source code in medcat-v2/medcat/trainer.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | |
unlink_concept_name
Unlink a concept name from the CUI (or all CUIs if full_unlink),
removes the link from the Concept Database (CDB). As a consequence
medcat will never again link the name to this CUI - meaning the
name will not be detected as a concept in the future.
Parameters:
-
(cuistr) –The CUI from which the
namewill be removed. -
(namestr) –The span of text to be removed from the linking dictionary.
-
(preprocessed_namebool, default:False) –Whether the name being used is preprocessed.
Examples:
>>> # To never again link C0020538 to HTN
>>> cat.unlink_concept_name('C0020538', 'htn', False)
Source code in medcat-v2/medcat/trainer.py
484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 | |