Skip to content

medcat.utils.regression.targeting

Classes:

Attributes:

logger module-attribute

logger = getLogger(__name__)

FinalTarget

Bases: BaseModel

The final target.

This involves the final phrase (which (potentially) has other placeholder replaced in it), the placeholder to be replaced, and the CUI and specific name being used.

Attributes:

cui instance-attribute

cui: str

final_phrase instance-attribute

final_phrase: str

name instance-attribute

name: str

placeholder instance-attribute

placeholder: str

OptionSet

Bases: BaseModel

The targeting option set.

This describes all the target placeholders and concepts needed.

Methods:

Attributes:

allow_any_combinations class-attribute instance-attribute

allow_any_combinations: bool = False

options instance-attribute

estimate_num_of_subcases

estimate_num_of_subcases() -> int

Get the number of distinct subcases.

This includes ones that can be calculated without the knowledge of the underlying CDB. I.e it doesn't care for the number of names involved per CUI but only takes into account what is described in the option set itself.

If any combination is allowed, then the answer is the combination of the number of target concepts per option. If any combination is not allowed, then the answer is simply the number of target concepts for an option (they should all have the same number).

Returns:

  • int ( int ) –

    Te number of subcases.

Source code in medcat-v2/medcat/utils/regression/targeting.py
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
def estimate_num_of_subcases(self) -> int:
    """Get the number of distinct subcases.

    This includes ones that can be calculated without the knowledge of the
    underlying CDB. I.e it doesn't care for the number of names involved
    per CUI but only takes into account what is described in the option
    set itself.

    If any combination is allowed, then the answer is the combination of
    the number of target concepts per option. If any combination is not
    allowed, then the answer is simply the number of target concepts for
    an option (they should all have the same number).

    Returns:
        int: Te number of subcases.
    """
    num_of_opts = len(self.options)
    if self.allow_any_combinations:
        total_cases = 1
        for cur_opt in self.options:
            total_cases *= len(cur_opt.target_cuis)
    else:
        total_cases = len(self.options[0].target_cuis)
    return num_of_opts * total_cases

from_dict classmethod

from_dict(section: dict[str, Any]) -> OptionSet

Construct a OptionSet instance from a dict.

The assumed structure is: { 'placeholders': [ { 'placeholder': , 'cuis': , 'prefname-only': 'true' }, ], 'any-combination': }

The prefname-only key is optional.

Parameters:

Raises:

Returns:

  • OptionSet ( OptionSet ) –

    The resulting OptionSet

Source code in medcat-v2/medcat/utils/regression/targeting.py
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
@classmethod
def from_dict(cls, section: dict[str, Any]) -> 'OptionSet':
    """Construct a OptionSet instance from a dict.

    The assumed structure is:
    {
        'placeholders': [
            {
            'placeholder': <e.g {DIAGNOSIS}'>,
            'cuis': <the CUI>,
            'prefname-only': 'true'
            }, <potentially more>],
        'any-combination': <True or False>
    }

    The prefname-only key is optional.

    Args:
        section (dict[str, Any]): The dict to parse

    Raises:
        ProblematicOptionSetException:
            If incorrect number of CUIs when not allowing any combination
        ProblematicOptionSetException:
            If placeholders not a list
        ProblematicOptionSetException:
            If multiple placehodlers with same place holder

    Returns:
        OptionSet: The resulting OptionSet
    """
    options: list['TargetPlaceholder'] = []
    allow_any_in = section.get('any-combination', 'false')
    if isinstance(allow_any_in, str):
        allow_any_combinations = allow_any_in.lower() == 'true'
    elif isinstance(allow_any_in, bool):
        allow_any_combinations = allow_any_in
    else:
        raise ProblematicOptionSetException(
            f"Unknown 'any-combination' value: {allow_any_in}")
    if 'placeholders' not in section:
        raise ProblematicOptionSetException(
            "Misconfigured - no placeholders")
    section_placeholders = section['placeholders']
    if not isinstance(section_placeholders, list):
        raise ProblematicOptionSetException(
            "Misconfigured - placehodlers not a list "
            f"({section_placeholders})")
    used_ph = set()
    for part in section_placeholders:
        placeholder = part['placeholder']
        if not isinstance(placeholder, str):
            raise ProblematicOptionSetException(
                f"Unknown placeholder of type {type(placeholder)}. "
                "Expected a string. Perhaps you need to surrong the "
                "placeholder with single quotes (') in the yaml? "
                f"Received: {placeholder}")
        if placeholder in used_ph:
            raise ProblematicOptionSetException(
                "Misconfigured - multiple identical placeholders")
        used_ph.add(placeholder)
        target_cuis: list[str] = part['cuis']
        if not isinstance(target_cuis, list):
            raise ProblematicOptionSetException(
                f"Target CUIs not a list ({type(target_cuis)}): "
                f"{repr(target_cuis)}")
        if 'prefname-only' in part:
            opn = part['prefname-only']
            if isinstance(opn, bool):
                onlyprefnames = opn
            else:
                onlyprefnames = str(opn).lower() == 'true'
        else:
            onlyprefnames = False
        option = TargetPlaceholder(
            placeholder=placeholder, target_cuis=target_cuis,
            onlyprefnames=onlyprefnames)
        options.append(option)
    if not options:
        raise ProblematicOptionSetException(
            "Misconfigured - 0 placeholders found (empty list)")
    if not allow_any_combinations:
        # NOTE: need to have same number of target_cuis
        #       for each placeholder
        # NOTE: there needs to be at least on option / placeholder anyway
        nr_of_cuis = [len(opt.target_cuis) for opt in options]
        if not all(nr == nr_of_cuis[0] for nr in nr_of_cuis):
            raise ProblematicOptionSetException(
                "Unequal number of cuis when any-combination: false: "
                f"{nr_of_cuis}. When any-combination: false the number of "
                "CUIs for each placeholder should be equal.")
    return OptionSet(options=options,
                     allow_any_combinations=allow_any_combinations)

get_preprocessors_and_targets

get_preprocessors_and_targets(translation: TranslationLayer) -> Iterator[TargetedPhraseChanger]

Get the targeted phrase changers.

Parameters:

Yields:

Source code in medcat-v2/medcat/utils/regression/targeting.py
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
def get_preprocessors_and_targets(self, translation: TranslationLayer
                                  ) -> Iterator[TargetedPhraseChanger]:
    """Get the targeted phrase changers.

    Args:
        translation (TranslationLayer): The translaton layer.

    Yields:
        Iterator[TargetedPhraseChanger]: Thetarget phrase changers.
    """
    num_of_opts = len(self.options)
    if num_of_opts == 1:
        # NOTE: when there's only 1 option, the other option doesn't work
        #       since it has nothing to iterate over regarding 'other'
        #       options
        opt = self.options[0]
        for target_cui in opt.target_cuis:
            yield TargetedPhraseChanger(changer=PhraseChanger.empty(),
                                        placeholder=opt.placeholder,
                                        cui=target_cui,
                                        onlyprefnames=opt.onlyprefnames)
        return
    for opt_nr in range(num_of_opts):
        other_opts = list(self.options)
        cur_opt = other_opts.pop(opt_nr)
        for changer, target_cui in self._get_all_combinations(
                cur_opt, other_opts, translation):
            yield TargetedPhraseChanger(
                changer=changer,
                placeholder=cur_opt.placeholder,
                cui=target_cui,
                onlyprefnames=cur_opt.onlyprefnames)

to_dict

to_dict() -> dict

Convert the OptionSet to a dict.

Returns:

  • dict ( dict ) –

    The dict representation

Source code in medcat-v2/medcat/utils/regression/targeting.py
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
def to_dict(self) -> dict:
    """Convert the OptionSet to a dict.

    Returns:
        dict: The dict representation
    """
    placeholders = [
        {
            'placeholder': opt.placeholder,
            'cuis': opt.target_cuis,
            'prefname-only': str(opt.onlyprefnames),
        }
        for opt in self.options
    ]
    return {
        'placeholders': placeholders,
        'any-combination': str(self.allow_any_combinations)
    }

PhraseChanger

Bases: BaseModel

The phrase changer.

This is class used as a preprocessor for phrases with multiple placeholders. It allows swapping in the rest of the placeholders while leaving in the one that's being tested for.

Methods:

  • empty

    Gets the empty phrase changer.

Attributes:

preprocess_placeholders instance-attribute

preprocess_placeholders: list[tuple[str, str]]

empty classmethod

empty() -> PhraseChanger

Gets the empty phrase changer.

That is a phrase changer that makes no changes to the phrase.

Returns:

Source code in medcat-v2/medcat/utils/regression/targeting.py
236
237
238
239
240
241
242
243
244
245
@classmethod
def empty(cls) -> 'PhraseChanger':
    """Gets the empty phrase changer.

    That is a phrase changer that makes no changes to the phrase.

    Returns:
        PhraseChanger: The empty phrase changer.
    """
    return cls(preprocess_placeholders=[])

ProblematicOptionSetException

ProblematicOptionSetException(*args: object)

Bases: ValueError

Source code in medcat-v2/medcat/utils/regression/targeting.py
487
488
def __init__(self, *args: object) -> None:
    super().__init__(*args)

TargetPlaceholder

Bases: BaseModel

A class describing the options for a specific placeholder.

Attributes:

onlyprefnames class-attribute instance-attribute

onlyprefnames: bool = False

placeholder instance-attribute

placeholder: str

target_cuis instance-attribute

target_cuis: list[str]

TargetedPhraseChanger

Bases: BaseModel

The target phrase changer.

It includes the phrase changer (for preprocessing) along with the relevant concept and the placeholder it will replace.

Attributes:

changer instance-attribute

changer: PhraseChanger

cui instance-attribute

cui: str

onlyprefnames instance-attribute

onlyprefnames: bool

placeholder instance-attribute

placeholder: str

TranslationLayer

TranslationLayer(cui2info: dict[str, CUIInfo], name2info: dict[str, NameInfo], cui2children: dict[str, set[str]], separator: str, whitespace: str = ' ')

The translation layer for translating: - CUIs to names - names to CUIs - type_ids to CUIs - CUIs to chil CUIs

The idea is to decouple these translations from the CDB instance in case something changes there.

Parameters:

Methods:

Attributes:

Source code in medcat-v2/medcat/utils/regression/targeting.py
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def __init__(self, cui2info: dict[str, CUIInfo],
             name2info: dict[str, NameInfo],
             cui2children: dict[str, set[str]],
             separator: str, whitespace: str = ' ') -> None:
    self.cui2info = cui2info
    self.name2info = name2info
    self.separator = separator
    self.whitespace = whitespace
    self.type_id2cuis: dict[str, set[str]] = {}
    for cui, ci in self.cui2info.items():
        type_ids = ci["type_ids"]
        for type_id in type_ids:
            if type_id not in self.type_id2cuis:
                self.type_id2cuis[type_id] = set()
            self.type_id2cuis[type_id].add(cui)
    self.cui2children = cui2children
    for cui in self.cui2info:
        if cui not in cui2children:
            self.cui2children[cui] = set()

cui2children instance-attribute

cui2children = cui2children

cui2info instance-attribute

cui2info = cui2info

name2info instance-attribute

name2info = name2info

separator instance-attribute

separator = separator

type_id2cuis instance-attribute

type_id2cuis: dict[str, set[str]] = {}

whitespace instance-attribute

whitespace = whitespace

from_CDB classmethod

from_CDB(cdb: CDB) -> TranslationLayer

Construct a TranslationLayer object from a context database (CDB).

This translation layer will refer to the same dicts that the CDB refers to. While there is no obvious reason these should be modified, it's something to keep in mind.

Parameters:

  • cdb

    (CDB) –

    The CDB

Returns:

Source code in medcat-v2/medcat/utils/regression/targeting.py
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
@classmethod
def from_CDB(cls, cdb: CDB) -> 'TranslationLayer':
    """Construct a TranslationLayer object from a context database (CDB).

    This translation layer will refer to the same dicts that the CDB
    refers to. While there is no obvious reason these should be modified,
    it's something to keep in mind.

    Args:
        cdb (CDB): The CDB

    Returns:
        TranslationLayer: The subsequent TranslationLayer
    """
    if 'pt2ch' not in cdb.addl_info:
        logger.warning(
            "No parent to child information presented so "
            "they cannot be used")
        parent2child = {}
    else:
        parent2child = cdb.addl_info['pt2ch']
    return TranslationLayer(
        cui2info=cdb.cui2info,
        name2info=cdb.name2info,
        cui2children=parent2child,
        separator=cdb.config.general.separator)

get_children_of

get_children_of(found_cuis: Iterable[str], cui: str, depth: int = 1) -> list[str]

Get the children of the specifeid CUI in the listed CUIs (if they exist).

Parameters:

  • found_cuis

    (Iterable[str]) –

    The list of CUIs to look in

  • cui

    (str) –

    The target parent CUI

  • depth

    (int, default: 1 ) –

    The depth to carry out the search for

Returns:

  • list[str]

    list[str]: The list of children found

Source code in medcat-v2/medcat/utils/regression/targeting.py
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
def get_children_of(self, found_cuis: Iterable[str],
                    cui: str, depth: int = 1) -> list[str]:
    """Get the children of the specifeid CUI in the
    listed CUIs (if they exist).

    Args:
        found_cuis (Iterable[str]): The list of CUIs to look in
        cui (str): The target parent CUI
        depth (int): The depth to carry out the search for

    Returns:
        list[str]: The list of children found
    """
    if cui not in self.cui2children:
        return []  # no children
    children = self.cui2children[cui]
    found_children = []
    for child in children:
        if child in found_cuis:
            found_children.append(child)
    if depth > 1:
        for child in children:
            found_children.extend(self.get_children_of(
                found_cuis, child, depth - 1))
    return found_children

get_direct_children

get_direct_children(cui: str) -> list[str]

Get the direct children of a concept.

This means only the children, but not grandchildren.

If the underlying CDB doesn't list children for this CUI, an empty list is returned.

Parameters:

  • cui

    (str) –

    The concept in question.

Returns:

  • list[str]

    list[str]: The (potentially empty) list of direct children.

Source code in medcat-v2/medcat/utils/regression/targeting.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
def get_direct_children(self, cui: str) -> list[str]:
    """Get the direct children of a concept.

    This means only the children, but not grandchildren.

    If the underlying CDB doesn't list children for this CUI,
    an empty list is returned.

    Args:
        cui (str): The concept in question.

    Returns:
        list[str]: The (potentially empty) list of direct children.
    """
    return list(self.cui2children.get(cui, []))

get_direct_parents cached

get_direct_parents(cui: str) -> list[str]

Get the direct parent(s) of a concept.

This method can be quite a CPU heavy one since it relies

on running through all the parent-children relationships since the child->parent(s) relationship isn't normally kept track of.

Parameters:

  • cui

    (str) –

    description

Returns:

  • list[str]

    list[str]: description

Source code in medcat-v2/medcat/utils/regression/targeting.py
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
@lru_cache(maxsize=10_000)
def get_direct_parents(self, cui: str) -> list[str]:
    """Get the direct parent(s) of a concept.

    PS: This method can be quite a CPU heavy one since it relies
        on running through all the parent-children relationships
        since the child->parent(s) relationship isn't normally
        kept track of.

    Args:
        cui (str): _description_

    Returns:
        list[str]: _description_
    """
    parents = []
    for pot_parent, children in self.cui2children.items():
        if cui in children:
            parents.append(pot_parent)
    return parents

get_first_name

get_first_name(cui: str) -> str

Get the preprocessed (potentially) arbitrarily first name of the given concept.

If the concept does not exist, the CUI itself is returned.

PS: The "first" name may not be consistent across runs since it relies on set order.

Parameters:

  • cui

    (str) –

    The concept ID.

Returns:

  • str ( str ) –

    The first name.

Source code in medcat-v2/medcat/utils/regression/targeting.py
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
def get_first_name(self, cui: str) -> str:
    """Get the preprocessed (potentially) arbitrarily first name
    of the given concept.

    If the concept does not exist, the CUI itself is returned.

    PS: The "first" name may not be consistent across runs since it
    relies on set order.

    Args:
        cui (str): The concept ID.

    Returns:
        str: The first name.
    """
    for name in self.cui2info[cui]["names"]:
        return name.replace(self.separator, self.whitespace)
    return cui

get_names_of

get_names_of(cui: str, only_prefnames: bool) -> list[str]

Get the preprocessed names of a CUI.

This method preporcesses the names by replacing the separator (generally ~) with the appropriate whitespace ().

If the concept is not in the underlying CDB, an empty list is returned.

Parameters:

  • cui

    (str) –

    The concept in question.

  • only_prefnames

    (bool) –

    Whether to only return a preferred name.

Returns:

  • list[str]

    list[str]: The list of names.

Source code in medcat-v2/medcat/utils/regression/targeting.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def get_names_of(self, cui: str, only_prefnames: bool) -> list[str]:
    """Get the preprocessed names of a CUI.

    This method preporcesses the names by replacing the separator
    (generally `~`) with the appropriate whitespace (` `).

    If the concept is not in the underlying CDB, an empty list is returned.

    Args:
        cui (str): The concept in question.
        only_prefnames (bool): Whether to only return a preferred name.

    Returns:
        list[str]: The list of names.
    """
    if cui not in self.cui2info:
        logger.warning(
            "CUI %s Is not defined in CDB / translation layer", cui)
        return []
    if only_prefnames:
        return [self.get_preferred_name(cui).replace(
            self.separator, self.whitespace)]
    return [name.replace(self.separator, self.whitespace)
            # NOTE: sorting the order here in case we're using
            #       edits in which case the order of the names
            #       needs to be the same, otherwise different
            #       edits will be used across runs
            for name in sorted(self.cui2info[cui]["names"])]

get_preferred_name

get_preferred_name(cui: str) -> str

Get the preferred name of a concept.

If no preferred name is found, the random 'first' name is selected.

Parameters:

  • cui

    (str) –

    The concept ID.

Returns:

  • str ( str ) –

    The preferred name.

Source code in medcat-v2/medcat/utils/regression/targeting.py
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def get_preferred_name(self, cui: str) -> str:
    """Get the preferred name of a concept.

    If no preferred name is found, the random 'first' name is selected.

    Args:
        cui (str): The concept ID.

    Returns:
        str: The preferred name.
    """
    if cui not in self.cui2info:
        logger.warning(
            "CUI %s Is not defined in CDB / translation layer", cui)
        return cui
    pref_name = self.cui2info[cui]["preferred_name"]
    if pref_name is None:
        logger.warning("CUI %s does not have a preferred name. "
                       "Using a random 'first' name of all the names", cui)
        return self.get_first_name(cui)
    return pref_name