medcat.stats.kfold
Classes:
-
FoldCreator–The FoldCreator based on a MCT export.
-
PerAnnsFoldCreator– -
PerCUIMetrics– -
PerDocsFoldCreator– -
SimpleFoldCreator– -
SplitType–The split type.
-
WeightedDocumentsCreator–
Functions:
-
get_fold_creator–Get the appropriate fold creator.
-
get_k_fold_stats–Get the k-fold stats for the model with the specified data.
-
get_metrics_mean–The the mean of the provided metrics.
-
get_per_fold_metrics–Get per fold metrics for a given set of folds.
Attributes:
FloatValuedMetric
module-attribute
IntValuedMetric
module-attribute
FoldCreator
FoldCreator(mct_export: MedCATTrainerExport, nr_of_folds: int)
Bases: ABC
The FoldCreator based on a MCT export.
Parameters:
-
(mct_exportMedCATTrainerExport) –The MCT export dict.
-
(nr_of_foldsint) –Number of folds to create.
-
(use_annotationsbool) –Whether to fold on number of annotations or documents.
Methods:
-
create_folds–Create folds.
Attributes:
Source code in medcat-v2/medcat/stats/kfold.py
59 60 61 62 | |
create_folds
abstractmethod
create_folds() -> list[MedCATTrainerExport]
Create folds.
Raises:
-
ValueError–If something went wrong.
Returns:
-
list[MedCATTrainerExport]–list[MedCATTrainerExport]: The created folds.
Source code in medcat-v2/medcat/stats/kfold.py
109 110 111 112 113 114 115 116 117 118 | |
PerAnnsFoldCreator
PerAnnsFoldCreator(mct_export: MedCATTrainerExport, nr_of_folds: int)
Bases: SimpleFoldCreator
Source code in medcat-v2/medcat/stats/kfold.py
177 178 179 | |
PerCUIMetrics
Bases: BaseModel
Methods:
Attributes:
add
add(val, weight: int = 1)
Source code in medcat-v2/medcat/stats/kfold.py
334 335 336 | |
get_mean
get_mean()
Source code in medcat-v2/medcat/stats/kfold.py
338 339 340 | |
get_std
get_std()
Source code in medcat-v2/medcat/stats/kfold.py
342 343 344 345 | |
PerDocsFoldCreator
PerDocsFoldCreator(mct_export: MedCATTrainerExport, nr_of_folds: int)
Bases: FoldCreator
Methods:
Attributes:
Source code in medcat-v2/medcat/stats/kfold.py
153 154 155 156 157 158 | |
per_doc_simple
instance-attribute
per_doc_simple = nr_of_docs // nr_of_folds
create_folds
create_folds() -> list[MedCATTrainerExport]
Source code in medcat-v2/medcat/stats/kfold.py
169 170 171 172 | |
SimpleFoldCreator
SimpleFoldCreator(mct_export: MedCATTrainerExport, nr_of_folds: int, counter: Callable[[MedCATTrainerExport], int])
Bases: FoldCreator
Methods:
Attributes:
Source code in medcat-v2/medcat/stats/kfold.py
123 124 125 126 127 128 | |
per_fold
instance-attribute
per_fold = _init_per_fold()
total
instance-attribute
total = _counter(mct_export)
create_folds
create_folds() -> list[MedCATTrainerExport]
Source code in medcat-v2/medcat/stats/kfold.py
145 146 147 148 | |
SplitType
Bases: Enum
The split type.
Attributes:
-
ANNOTATIONS–Split over number of annotations.
-
DOCUMENTS–Split over number of documents.
-
DOCUMENTS_WEIGHTED–Split over number of documents based on the number of annotations.
ANNOTATIONS
class-attribute
instance-attribute
ANNOTATIONS = auto()
Split over number of annotations.
DOCUMENTS_WEIGHTED
class-attribute
instance-attribute
DOCUMENTS_WEIGHTED = auto()
Split over number of documents based on the number of annotations. So essentially this ensures that the same document isn't in 2 folds while trying to more equally distribute documents with different number of annotations. For example: If we have 6 documents that we want to split into 3 folds. The number of annotations per document are as follows: [40, 40, 20, 10, 5, 5] If we were to split this trivially over documents, we'd end up with the 3 folds with number of annotations that are far from even: [80, 30, 10] However, if we use the annotations as weights, we would be able to create folds that have more evenly distributed annotations, e.g: [[D1,], [D2], [D3, D4, D5, D6]] where D# denotes the number of the documents, with the number of annotations being equal: [ 40, 40, 20 + 10 + 5 + 5 = 40]
WeightedDocumentsCreator
WeightedDocumentsCreator(mct_export: MedCATTrainerExport, nr_of_folds: int, weight_calculator: Callable[[MedCATTrainerExportDocument], int])
Bases: FoldCreator
Methods:
Source code in medcat-v2/medcat/stats/kfold.py
221 222 223 224 225 226 227 228 229 | |
create_folds
create_folds() -> list[MedCATTrainerExport]
Source code in medcat-v2/medcat/stats/kfold.py
231 232 233 234 235 236 237 238 239 240 241 242 243 244 | |
get_fold_creator
get_fold_creator(mct_export: MedCATTrainerExport, nr_of_folds: int, split_type: SplitType) -> FoldCreator
Get the appropriate fold creator.
Parameters:
-
(mct_exportMedCATTrainerExport) –The MCT export.
-
(nr_of_foldsint) –Number of folds to use.
-
(split_typeSplitType) –The type of split to use.
Raises:
-
ValueError–In case of an unknown split type.
Returns:
-
FoldCreator(FoldCreator) –The corresponding fold creator.
Source code in medcat-v2/medcat/stats/kfold.py
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 | |
get_k_fold_stats
get_k_fold_stats(cat: CAT, mct_export_data: MedCATTrainerExport, k: int = 3, use_project_filters: bool = False, split_type: SplitType = DOCUMENTS_WEIGHTED, include_std: bool = False, *args, **kwargs) -> tuple
Get the k-fold stats for the model with the specified data.
First this will split the MCT export into k folds. You can do
this either per document or per-annotation.
For each of the k folds, it will start from the base model,
train it with with the other k-1 folds and record the metrics.
After that the base model state is restored before doing the next fold.
After all the folds have been done, the metrics are averaged.
Parameters:
-
(catCAT) –The model pack.
-
(mct_export_dataMedCATTrainerExport) –The MCT export.
-
(kint, default:3) –The number of folds. Defaults to 3.
-
(use_project_filtersbool, default:False) –Whether to use per project filters. Defaults to
False. -
(split_typeSplitType, default:DOCUMENTS_WEIGHTED) –Whether to use annodations or docs. Defaults to DOCUMENTS_WEIGHTED.
-
(include_stdbool, default:False) –Whether to include stanrdard deviation. Defaults to False.
-
–*argsArguments passed to the
CAT.train_supervised_rawmethod. -
–**kwargsKeyword arguments passed to the
CAT.train_supervised_rawmethod.
Returns:
-
tuple(tuple) –The averaged metrics. Potentially with their corresponding standard deviations.
Source code in medcat-v2/medcat/stats/kfold.py
463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 | |
get_metrics_mean
get_metrics_mean(metrics: list[tuple[dict, dict, dict, dict, dict, dict, dict, dict]], include_std: bool) -> tuple[dict, dict, dict, dict, dict, dict, dict, dict]
The the mean of the provided metrics.
Parameters:
-
(metricslist[tuple[dict, dict, dict, dict, dict, dict, dict, dict]) –The metrics.
-
(include_stdbool) –Whether to include the standard deviation.
Returns:
-
fps(dict) –False positives for each CUI.
-
fns(dict) –False negatives for each CUI.
-
tps(dict) –True positives for each CUI.
-
cui_prec(dict) –Precision for each CUI.
-
cui_rec(dict) –Recall for each CUI.
-
cui_f1(dict) –F1 for each CUI.
-
cui_counts(dict) –Number of occurrence for each CUI.
-
examples(dict) –Examples for each of the fp, fn, tp. Format will be examples['fp']['cui'][
].
Source code in medcat-v2/medcat/stats/kfold.py
370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 | |
get_per_fold_metrics
get_per_fold_metrics(cat: CAT, folds: list[MedCATTrainerExport], use_project_filters: bool, *args, **kwargs) -> list[tuple]
Get per fold metrics for a given set of folds.
This method captures the state of the before processing each fold. For each fold, it trains on all other folds, and runs metrics on the fold itself.
Parameters:
-
(catCAT) –The model pack.
-
(foldslist[MedCATTrainerExport]) –The folds.
-
(use_project_filtersbool) –Whether to use project filters.
Returns:
Source code in medcat-v2/medcat/stats/kfold.py
277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |