medcat.stats
Modules:
Functions:
-
get_k_fold_stats–Get the k-fold stats for the model with the specified data.
-
get_stats–TODO: Refactor and make nice
get_k_fold_stats
get_k_fold_stats(cat: CAT, mct_export_data: MedCATTrainerExport, k: int = 3, use_project_filters: bool = False, split_type: SplitType = DOCUMENTS_WEIGHTED, include_std: bool = False, *args, **kwargs) -> tuple
Get the k-fold stats for the model with the specified data.
First this will split the MCT export into k folds. You can do
this either per document or per-annotation.
For each of the k folds, it will start from the base model,
train it with with the other k-1 folds and record the metrics.
After that the base model state is restored before doing the next fold.
After all the folds have been done, the metrics are averaged.
Parameters:
-
(catCAT) –The model pack.
-
(mct_export_dataMedCATTrainerExport) –The MCT export.
-
(kint, default:3) –The number of folds. Defaults to 3.
-
(use_project_filtersbool, default:False) –Whether to use per project filters. Defaults to
False. -
(split_typeSplitType, default:DOCUMENTS_WEIGHTED) –Whether to use annodations or docs. Defaults to DOCUMENTS_WEIGHTED.
-
(include_stdbool, default:False) –Whether to include stanrdard deviation. Defaults to False.
-
–*argsArguments passed to the
CAT.train_supervised_rawmethod. -
–**kwargsKeyword arguments passed to the
CAT.train_supervised_rawmethod.
Returns:
-
tuple(tuple) –The averaged metrics. Potentially with their corresponding standard deviations.
Source code in medcat-v2/medcat/stats/kfold.py
463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 | |
get_stats
get_stats(cat: CAT, data: MedCATTrainerExport, epoch: int = 0, use_project_filters: bool = False, use_overlaps: bool = False, extra_cui_filter: Optional[set[str]] = None, do_print: bool = True) -> tuple[dict[str, int], dict[str, int], dict[str, int], dict[str, float], dict[str, float], dict[str, float], dict[str, int], dict]
TODO: Refactor and make nice Print metrics on a dataset (F1, P, R), it will also print the concepts that have the most FP,FN,TP.
Parameters:
-
(catCAT) –(CAT): The model pack.
-
(datadict) –The json object that we get from MedCATtrainer on export.
-
(epochint, default:0) –Used during training, so we know what epoch is it.
-
(use_project_filtersbool, default:False) –Each project in MedCATtrainer can have filters, do we want to respect those filters when calculating metrics.
-
(use_overlapsbool, default:False) –Allow overlapping entities, nearly always False as it is very difficult to annotate overlapping entities.
-
(use_cui_doc_limitbool) –If True the metrics for a CUI will be only calculated if that CUI appears in a document, in other words if the document was annotated for that CUI. Useful in very specific situations when during the annotation process the set of CUIs changed.
-
(use_groupsbool) –If True concepts that have groups will be combined and stats will be reported on groups.
-
(extra_cui_filterOptional[set], default:None) –This filter will be intersected with all other filters, or if all others are not set then only this one will be used.
-
(do_printbool, default:True) –Whether to print stats out. Defaults to True.
Returns:
-
fps(dict) –False positives for each CUI.
-
fns(dict) –False negatives for each CUI.
-
tps(dict) –True positives for each CUI.
-
cui_prec(dict) –Precision for each CUI.
-
cui_rec(dict) –Recall for each CUI.
-
cui_f1(dict) –F1 for each CUI.
-
cui_counts(dict) –Number of occurrence for each CUI.
-
examples(dict) –Examples for each of the fp, fn, tp. Format will be examples['fp']['cui'][
].
Source code in medcat-v2/medcat/stats/stats.py
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 | |