medcat.components.addons.meta_cat.data_utils
Functions:
-
encode_category_values–Converts the category values in the data outputted by
-
find_alternate_classname–Find and map to alternative class names for the given category.
-
prepare_for_oversampled_data–Convert the data from a json format into a CSV-like format for
-
prepare_from_json–Convert the data from a json format into a CSV-like format for
-
undersample_data–Undersamples the data for 2 phase learning
Attributes:
-
logger–
encode_category_values
encode_category_values(data: list[tuple[list, list, str]], existing_category_value2id: Optional[dict] = None, alternative_class_names: list[list[str]] = [], config: Optional[ConfigMetaCAT] = None) -> tuple[list[tuple[list, list, str]], list, dict]
Converts the category values in the data outputted by
prepare_from_json into integer values.
Parameters:
-
(datalist[tuple[list, list, str]]) –Output of
prepare_from_json. -
(existing_category_value2idOptional[dict], default:None) –Map from category_value to id (old/existing).
-
(alternative_class_nameslist[list[str]], default:[]) –A list of lists of strings, where each list contains variations of a class name. Usually read from the config at
config.general.alternative_class_names. -
(configOptional[ConfigMetaCAT], default:None) –The MetaCAT Config.
Returns:
-
list[tuple[list, list, str]]–list[tuple[list, list, str]]: New data with integers inplace of strings for category values.
-
list(list) –New undersampled data (for 2 phase learning) with integers inplace of strings for category values
-
dict(dict) –Map from category value to ID for all categories in the data.
Raises:
-
Exception–If categoryvalue2id is pre-defined and its labels do not match the labels found in the data
Source code in medcat-v2/medcat/components/addons/meta_cat/data_utils.py
327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 | |
find_alternate_classname
find_alternate_classname(category_value2id: dict, category_values: set[str], alternative_class_names: list[list[str]]) -> dict
Find and map to alternative class names for the given category.
Example
For Temporality category, 'Recent' is an alternative to 'Present'.
Parameters:
-
(category_value2iddict) –The pre-defined category_value2id
-
(category_valuesset[str]) –Contains the classes (labels) found in the data
-
(alternative_class_nameslist[list[str]]) –Contains the mapping of alternative class names
Returns:
-
category_value2id(dict) –Updated category_value2id with keys corresponding to alternative class names
Raises:
-
Exception–If no alternatives are found for labels in category_value2id that don't match any of the labels in the data
-
Exception–If the alternatives defined for labels in category_value2id that don't match any of the labels in the data
Source code in medcat-v2/medcat/components/addons/meta_cat/data_utils.py
213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 | |
prepare_for_oversampled_data
prepare_for_oversampled_data(data: list, tokenizer: TokenizerWrapperBase) -> list
Convert the data from a json format into a CSV-like format for training. This function is not very efficient (the one working with documents as part of the meta_cat.pipe method is much better). If your dataset is > 1M documents think about rewriting this function - but would be strange to have more than 1M manually annotated documents.
Parameters:
-
(datalist) –Oversampled data expected in the following format: [[['text','of','the','document'], [index of medical entity], "label" ], ['text','of','the','document'], [index of medical entity], "label" ]]
-
(tokenizerTokenizerWrapperBase) –Something to split text into tokens for the LSTM/BERT/whatever meta models.
Returns:
-
data_sampled(list) –The processed data in the format that can be merged with the output from prepare_from_json. [[<[tokens]>, [index of medical entity], "label" ], <[tokens]>, [index of medical entity], "label" ]]
Source code in medcat-v2/medcat/components/addons/meta_cat/data_utils.py
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | |
prepare_from_json
prepare_from_json(data: dict, cntx_left: int, cntx_right: int, tokenizer: TokenizerWrapperBase, cui_filter: Optional[set] = None, replace_center: Optional[str] = None, prerequisites: dict = {}, lowercase: bool = True) -> dict[str, list[tuple[list, list, str]]]
Convert the data from a json format into a CSV-like format for training. This function is not very efficient (the one working with documents as part of the meta_cat.pipe method is much better). If your dataset is > 1M documents think about rewriting this function - but would be strange to have more than 1M manually annotated documents.
Parameters:
-
(datadict) –Loaded output of MedCATtrainer. If we have a
my_export.jsonfrom MedCATtrainer, than data = json.load(). -
(cntx_leftint) –Size of context to get from the left of the concept
-
(cntx_rightint) –Size of context to get from the right of the concept
-
(tokenizerTokenizerWrapperBase) –Something to split text into tokens for the LSTM/BERT/whatever meta models.
-
(replace_centerOptional[str], default:None) –If not None the center word (concept) will be replaced with whatever this is.
-
(prerequisitesdict, default:{}) –A map of prerequisites, for example our data has two meta-annotations (experiencer, negation). Assume I want to create a dataset for
negationbut only in those cases whereexperiencer=patient, my prerequisites would be: {'Experiencer': 'Patient'} - Take care that the CASE has to match whatever is in the data. Defaults to{}. -
(lowercasebool, default:True) –Should the text be lowercased before tokenization. Defaults to True.
-
(cui_filterOptional[set], default:None) –CUI filter if set. Defaults to None.
Returns:
-
out_data(dict) –Example: {'category_name': [('
', '<[tokens]>', ' '), ...], ...}
Source code in medcat-v2/medcat/components/addons/meta_cat/data_utils.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | |
undersample_data
undersample_data(data: list, category_value2id: dict, label_data_, config: ConfigMetaCAT) -> list
Undersamples the data for 2 phase learning
Parameters:
-
(datalist) –Output of
prepare_from_json. -
(category_value2iddict) –Map from category_value to id.
-
–label_data_Map that stores the number of samples for each label
-
(configConfigMetaCAT) –MetaCAT config
Returns:
-
data_undersampled(list) –Return the data created for 2 phase learning) with integers inplace of strings for category values
Source code in medcat-v2/medcat/components/addons/meta_cat/data_utils.py
278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | |