Skip to content

medcat.components.ner.trf.deid

De-identification model.

This describes a wrapper on the regular CAT model. The idea is to simplify the use of a DeId-specific model.

It tackles two use cases 1) Creation of a deid model 2) Loading and use of a deid model

I.e for use case 1:

Instead of: cat = CAT(cdb=ner.cdb, addl_ner=ner)

You can use: deid = DeIdModel.create(ner)

And for use case 2:

Instead of: cat = CAT.load_model_pack(model_pack_path) anon_text = deid_text(cat, text)

You can use: deid = DeIdModel.load_model_pack(model_pack_path) anon_text = deid.deid_text(text)

Or if/when structured output is desired: deid = DeIdModel.load_model_pack(model_pack_path) anon_doc = deid(text) # the spacy document

The wrapper also exposes some CAT parts directly: - config - cdb

Classes:

Functions:

  • match_rules

    Match a set of rules - pat / cui combos as post processing labels.

  • merge_all_preds

    Conveniance method to merge predictions from rule based and deID model

  • merge_preds

    Merge predictions from rule based and deID model predictions.

Attributes:

logger module-attribute

logger = getLogger(__name__)

DeIdModel

DeIdModel(cat: CAT)

Bases: NerModel

The DeID model.

This wraps a CAT instance and simplifies its use as a de-identification model.

It provides methods for creating one from a TransformersNER as well as loading from a model pack (along with some validation).

It also exposes some useful parts of the CAT it wraps such as the config and the concept database.

Methods:

Attributes:

Source code in medcat-v2/medcat/components/ner/trf/deid.py
68
69
def __init__(self, cat: CAT) -> None:
    self.cat = cat

cat instance-attribute

cat = cat

create classmethod

create(cdb: CDB, cnf: ConfigTransformersNER)
Source code in medcat-v2/medcat/components/ner/trf/deid.py
199
200
201
202
203
204
205
206
@classmethod
def create(cls, cdb: CDB, cnf: ConfigTransformersNER):
    cdb.config.components.ner.comp_name = TransformersNER.name
    cdb.config.components.ner.custom_cnf = cnf
    # no-action linker
    cdb.config.components.linking.comp_name = 'no_action'
    cat = CAT(cdb=cdb, vocab=None, config=cdb.config)
    return cls(cat)

deid_multi_text

deid_multi_text(texts: Iterable[str], redact: bool = False, n_process: Optional[int] = None) -> list[str]
Source code in medcat-v2/medcat/components/ner/trf/deid.py
123
124
125
126
127
128
129
130
131
def deid_multi_text(self, texts: Iterable[str], redact: bool = False,
                    n_process: Optional[int] = None) -> list[str]:
    warnings.warn(
        "deid_multi_text() is deprecated and will be removed in a "
        "future release. Use deid_multi_texts() instead.",
        DeprecationWarning,
        stacklevel=2
    )
    return self.deid_multi_texts(texts, redact, n_process)

deid_multi_texts

deid_multi_texts(texts: Iterable[str], redact: bool = False, n_process: Optional[int] = None) -> list[str]
Source code in medcat-v2/medcat/components/ner/trf/deid.py
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
def deid_multi_texts(self, texts: Iterable[str], redact: bool = False,
                     n_process: Optional[int] = None) -> list[str]:
    if n_process is None:
        n_process = 1

    entities = self.cat.get_entities_multi_texts(
        texts, n_process=n_process)
    out: list[str] = []
    for raw_text, (_, _ents) in zip(texts, entities):
        ents = _ents['entities']
        text: str
        if isinstance(raw_text, tuple):
            text = raw_text[1]
        elif isinstance(raw_text, str):
            text = raw_text
        else:
            raise ValueError("Unknown raw text: "
                             f"{type(raw_text)}: {raw_text}")
        new_text = replace_entities_in_text(
            text, ents, get_cui_name=self.cat.cdb.get_name, redact=redact)
        out.append(new_text)
    return out

deid_text

deid_text(text: str, redact: bool = False) -> str

Deidentify text and potentially redact information.

De-identified text. If redaction is enabled, identifiable entities will be replaced with starts (e.g *****). Otherwise, the replacement will be the CUI or in other words, the type of information that was hidden (e.g [PATIENT]).

Parameters:

  • text

    (str) –

    The text to deidentify.

  • redact

    (bool, default: False ) –

    Whether to redact the information.

Returns:

  • str ( str ) –

    The deidentified text.

Source code in medcat-v2/medcat/components/ner/trf/deid.py
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def deid_text(self, text: str, redact: bool = False) -> str:
    """Deidentify text and potentially redact information.

    De-identified text.
    If redaction is enabled, identifiable entities will be
    replaced with starts (e.g `*****`).
    Otherwise, the replacement will be the CUI or in other words,
    the type of information that was hidden (e.g [PATIENT]).

    Args:
        text (str): The text to deidentify.
        redact (bool): Whether to redact the information.

    Returns:
        str: The deidentified text.
    """
    new_text, _entities = self.deid_text_with_entities(text, redact=redact)
    return new_text

deid_text_with_entities

deid_text_with_entities(text: str, redact: bool = False) -> tuple[str, Entities]

Deidentify text and potentially redact information.

De-identified text. If redaction is enabled, identifiable entities will be replaced with starts (e.g *****). Otherwise, the replacement will be the CUI or in other words, the type of information that was hidden (e.g [PATIENT]).

Parameters:

  • text

    (str) –

    The text to deidentify.

  • redact

    (bool, default: False ) –

    Whether to redact the information.

Returns:

  • tuple[str, Entities]

    Tuple[str, Entities]: A tuple containing: - The deidentified text as a string. - The entities found and linked within the text.

Source code in medcat-v2/medcat/components/ner/trf/deid.py
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
def deid_text_with_entities(self, text: str, redact: bool = False
                                ) -> tuple[str, Entities]:
    """Deidentify text and potentially redact information.

    De-identified text.
    If redaction is enabled, identifiable entities will be
    replaced with starts (e.g `*****`).
    Otherwise, the replacement will be the CUI or in other words,
    the type of information that was hidden (e.g [PATIENT]).

    Args:
        text (str): The text to deidentify.
        redact (bool): Whether to redact the information.

    Returns:
        Tuple[str, Entities]: A tuple containing:
            - The deidentified text as a string.
            - The entities found and linked within the text.
    """
    entities = self.cat.get_entities(text)
    new_text = replace_entities_in_text(
        text,
        entities['entities'],
        self.cat.cdb.get_name,
        redact=redact,
    )
    return new_text, entities

load_model_pack classmethod

load_model_pack(model_pack_path: str, config: Optional[dict] = None) -> DeIdModel

Load DeId model from model pack.

The method first loads the CAT instance.

It then makes sure that the model pack corresponds to a valid DeId model.

Parameters:

  • config

    (Optional[dict], default: None ) –

    Config for DeId model pack (primarily for stride of overlap window)

  • model_pack_path

    (str) –

    The model pack path.

Raises:

  • ValueError

    If the model pack does not correspond to a DeId model.

Returns:

  • DeIdModel ( DeIdModel ) –

    The resulting DeI model.

Source code in medcat-v2/medcat/components/ner/trf/deid.py
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
@classmethod
def load_model_pack(cls, model_pack_path: str,
                    config: Optional[dict] = None) -> 'DeIdModel':
    """Load DeId model from model pack.

    The method first loads the CAT instance.

    It then makes sure that the model pack corresponds to a
    valid DeId model.

    Args:
        config: Config for DeId model pack (primarily for stride of
            overlap window)
        model_pack_path (str): The model pack path.

    Raises:
        ValueError: If the model pack does not correspond to a DeId model.

    Returns:
        DeIdModel: The resulting DeI model.
    """
    ner_model = NerModel.load_model_pack(model_pack_path, config=config)
    cat = ner_model.cat
    if not cls._is_deid_model(cat):
        raise ValueError(
            f"The model saved at {model_pack_path} is not a deid model "
            f"({cls._get_reason_not_deid(cat)})")
    model = cls(ner_model.cat)
    return model

train

train(json_path: Union[str, list, None], *args, **kwargs) -> tuple[Any, Any, Any]
Source code in medcat-v2/medcat/components/ner/trf/deid.py
71
72
73
74
def train(self, json_path: Union[str, list, None],
          *args, **kwargs) -> tuple[Any, Any, Any]:
    return super().train(json_path,
                         *args, **kwargs)  # type: ignore

match_rules

match_rules(rules: list[tuple[str, str]], texts: list[str], cui2preferred_name: dict[str, str]) -> list[list[Entity]]

Match a set of rules - pat / cui combos as post processing labels. Uses a cat DeID model for pretty name mapping. Args: rules (list[tuple[str, str]]): List of tuples of pattern and cui texts (list[str]): List of texts to match rules on cui2preferred_name (dict[str, str]): Dictionary of CUI to preferred name, likely to be cat.cdb.cui2preferred_name. Examples: >>> cat = CAT.load_model_pack(model_pack_path) ... >>> rules = [ ('(123) 456-7890', '134'), ('1234567890', '134'), ('123.456.7890', '134'), ('1234567890', '134'), ('1234567890', '134'), ] >>> texts = [ 'My phone number is (123) 456-7890', 'My phone number is 1234567890', 'My phone number is 123.456.7890', 'My phone number is 1234567890', ] >>> matches = match_rules(rules, texts, cat.cdb.cui2preferred_name) Returns: List[List[Dict]]: List of lists of predictions from match_rules

Source code in medcat-v2/medcat/components/ner/trf/deid.py
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
def match_rules(rules: list[tuple[str, str]], texts: list[str],
                cui2preferred_name: dict[str, str]) -> list[list[Entity]]:
    """Match a set of rules - pat / cui combos as post processing labels.
    Uses a cat DeID model for pretty name mapping.
    Args:
        rules (list[tuple[str, str]]): List of tuples of pattern and cui
        texts (list[str]): List of texts to match rules on
        cui2preferred_name (dict[str, str]): Dictionary of CUI to
            preferred name, likely to be cat.cdb.cui2preferred_name.
    Examples:
        >>> cat = CAT.load_model_pack(model_pack_path)
        ...
        >>> rules = [
            ('(123) 456-7890', '134'),
            ('1234567890', '134'),
            ('123.456.7890', '134'),
            ('1234567890', '134'),
            ('1234567890', '134'),
        ]
        >>> texts = [
            'My phone number is (123) 456-7890',
            'My phone number is 1234567890',
            'My phone number is 123.456.7890',
            'My phone number is 1234567890',
        ]
        >>> matches = match_rules(rules, texts, cat.cdb.cui2preferred_name)
    Returns:
        List[List[Dict]]: List of lists of predictions from `match_rules`
    """
    # Iterate through each text and pattern combination
    rule_matches_per_text: list[list[Entity]] = []
    for i, text in enumerate(texts):
        matches_in_text: list[Entity] = []
        for pattern, concept in rules:
            # Find all matches of current pattern in current text
            text_matches = re.finditer(pattern, text, flags=re.M)
            # Add each match with its pattern and text info
            for match in text_matches:
                matches_in_text.append({
                    'source_value': match.group(),
                    'pretty_name': cui2preferred_name[concept],
                    'start': match.start(),
                    'end': match.end(),
                    'cui': concept,
                    'acc': 1.0
                })
        rule_matches_per_text.append(matches_in_text)
    return rule_matches_per_text

merge_all_preds

Conveniance method to merge predictions from rule based and deID model predictions.

Parameters:

  • model_preds_by_text

    (list[list[Entity]]) –

    List of predictions from cat.get_entities(), then [list(m['entities'].values()) for m in model_preds]

  • rule_matches_per_text

    (list[list[Entity]]) –

    List of predictions from output of running match_rules

  • accept_preds

    (bool, default: True ) –

    Uses the predicted label from the model, model_preds_by_text, over the rule matches if they overlap. Defaults to using model preds over rules.

Returns: list[list[Entity]]: List of lists of predictions from merge_all_preds

Source code in medcat-v2/medcat/components/ner/trf/deid.py
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
def merge_all_preds(model_preds_by_text: list[list[Entity]],
                    rule_matches_per_text: list[list[Entity]],
                    accept_preds: bool = True) -> list[list[Entity]]:
    """Conveniance method to merge predictions from rule based and deID model
    predictions.

    Args:
        model_preds_by_text (list[list[Entity]]):
            List of predictions from `cat.get_entities()`, then
            `[list(m['entities'].values()) for m in model_preds]`
        rule_matches_per_text (list[list[Entity]]):
            List of predictions from output of running `match_rules`
        accept_preds (bool):
            Uses the predicted label from the model,
            model_preds_by_text, over the rule matches if they overlap.
            Defaults to using model preds over rules.
    Returns:
        list[list[Entity]]: List of lists of predictions from `merge_all_preds`
    """
    assert len(model_preds_by_text) == len(rule_matches_per_text), (
        "model_preds_by_text and rule_matches_per_text must have the same "
        "length as they should be CAT.get_entities and match_rules outputs of "
        "the same text")
    return [
        merge_preds(model_preds_by_text[i],
                    rule_matches_per_text[i],
                    accept_preds)
        for i in range(len(model_preds_by_text))]

merge_preds

merge_preds(model_preds: list[Entity], rule_matches: list[Entity], accept_preds: bool = True) -> list[Entity]

Merge predictions from rule based and deID model predictions. Args: model_preds (list[Entity]): predictions from cat.get_entities() rule_matches (list[Entity]): predictions from output of running match_rules on the same text accept_preds (bool): uses the predicted label from the model, model_preds, over the rule matches if they overlap. Defaults to using model preds over rules. Examples: >>> # a list of predictions from cat.get_entities() >>> model_preds = [ [ {'cui': '134', 'start': 10, 'end': 20, 'acc': 1.0, 'pretty_name': 'Phone Number'}, {'cui': '134', 'start': 25, 'end': 35, 'acc': 1.0, 'pretty_name': 'Phone Number'} ] ] >>> # a list of predictions from match_rules >>> rule_matches = [ [ {'cui': '134', 'start': 10, 'end': 20, 'acc': 1.0, 'pretty_name': 'Phone Number'}, {'cui': '134', 'start': 25, 'end': 35, 'acc': 1.0, 'pretty_name': 'Phone Number'} ] ] >>> merged_preds = merge_preds(model_preds, rule_matches) Returns: list[Entity]: List of predictions from merge_preds

Source code in medcat-v2/medcat/components/ner/trf/deid.py
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
def merge_preds(model_preds: list[Entity],
                rule_matches: list[Entity],
                accept_preds: bool = True) -> list[Entity]:
    """Merge predictions from rule based and deID model predictions.
    Args:
        model_preds (list[Entity]): predictions from `cat.get_entities()`
        rule_matches (list[Entity]): predictions from output of running
            `match_rules` on the same text
        accept_preds (bool): uses the predicted label from the model,
            model_preds, over the rule matches if they overlap.
            Defaults to using model preds over rules.
    Examples:
        >>> # a list of predictions from `cat.get_entities()`
        >>> model_preds = [
            [
                {'cui': '134', 'start': 10, 'end': 20, 'acc': 1.0,
                 'pretty_name': 'Phone Number'},
                {'cui': '134', 'start': 25, 'end': 35, 'acc': 1.0,
                 'pretty_name': 'Phone Number'}
            ]
        ]
        >>> # a list of predictions from `match_rules`
        >>> rule_matches = [
            [
                {'cui': '134', 'start': 10, 'end': 20, 'acc': 1.0,
                 'pretty_name': 'Phone Number'},
                {'cui': '134', 'start': 25, 'end': 35, 'acc': 1.0,
                 'pretty_name': 'Phone Number'}
            ]
        ]
        >>> merged_preds = merge_preds(model_preds, rule_matches)
    Returns:
        list[Entity]: List of predictions from `merge_preds`
    """
    if accept_preds:
        labels1 = model_preds
        labels2 = rule_matches
    else:
        labels1 = rule_matches
        labels2 = model_preds

    # Keep only non-overlapping model predictions
    labels2 = [span2 for span2 in labels2
               if not any(not (span2['end'] <= span1['start'] or
                               span1['end'] <= span2['start'])
                          for span1 in labels1)]
    # merge preds and sort on start
    merged_preds = labels1 + labels2
    merged_preds.sort(key=lambda x: x['start'])
    merged_preds
    return merged_preds