Skip to content

medcat.components.ner.vocab_based_annotator

I would just ignore this whole class, it's just a lot of rules that work nicely for CDB once the software is trained the main thing are the context vectors.

Functions:

Attributes:

logger module-attribute

logger = getLogger(__name__)

annotate_name

annotate_name(tokenizer: BaseTokenizer, name: str, tkns: list[MutableToken], doc: MutableDocument, cdb: CDB, cur_id: int | None, label: str)
Source code in medcat-v2/medcat/components/ner/vocab_based_annotator.py
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def annotate_name(tokenizer: BaseTokenizer, name: str,
                  tkns: list[MutableToken],
                  doc: MutableDocument, cdb: CDB,
                  cur_id: int | None,
                  label: str):
    entity: MutableEntity = tokenizer.create_entity(
        doc, tkns[0].base.index, tkns[-1].base.index + 1, label=label)
    # Only set this property when using a vocab approach
    # and where this name fits a name in the cdb.
    # All standard name entity recognition models will not set this.
    entity.detected_name = name
    entity.link_candidates = list(cdb.name2info[name]['per_cui_status'])

    if cur_id is None:
        logger.warning(
            "`medcat.components.ner.vocab_based_annotator.annotate_name` "
            "was called with no `cur_id`. This behaviour is not fully "
            "supported anymore.")
        start_index = entity.base.start_char_index
        span_len = len(name)
        cur_id = start_index * _START_INDEX_MULT + span_len
        # NOTE: These will be unique if the maximum length of each
        #       entity does not exceed _START_INDEX_MULT (1000)
        logger.warning(
            "Using the text start index %d (multiplied by %d) and adding "
            "the span length %d to get the id of %d", start_index,
            _START_INDEX_MULT, span_len, cur_id)
        logger.warning(
            "Setting MutableDocument.ner_ents during the method "
            "`medcat.components.ner.vocab_based_annotator.annotate_name` "
            "because the old API (without an ID) was used")
        doc.ner_ents.append(entity)  # TODO: remove this

    entity.id = cur_id
    entity.confidence = -1  # This does not calculate confidence

    # Not necessary, but why not
    logger.debug("NER detected an entity.\n\tDetected name: %s" +
                 "\n\tLink candidates: %s\n", entity.detected_name,
                 entity.link_candidates)
    return entity

maybe_annotate_name

maybe_annotate_name(tokenizer: BaseTokenizer, name: str, tkns: list[MutableToken], doc: MutableDocument, cdb: CDB, config: Config, cur_id: int | None = None, label: str = 'concept') -> Optional[MutableEntity]

Given a name it will check should it be annotated based on config rules. If yes the annotation will be added to the doc.entities array.

Parameters:

  • tokenizer

    (BaseTokenizer) –

    The tokenizer (probably SpaCy).

  • name

    (str) –

    The name found in the text of the document.

  • tkns

    (list[MutableToken]) –

    Tokens that belong to this name in the spacy document.

  • doc

    (BaseDocument) –

    Spacy document to be annotated with named entities.

  • cdb

    (CDB) –

    Concept database.

  • config

    (Config) –

    Global config for medcat.

  • cur_id

    (int | None, default: None ) –

    The potential ID for the entity. Defaults to None.

  • label

    (str, default: 'concept' ) –

    Label for this name (usually concept if we are using a vocab based approach).

Returns:

Source code in medcat-v2/medcat/components/ner/vocab_based_annotator.py
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def maybe_annotate_name(tokenizer: BaseTokenizer, name: str,
                        tkns: list[MutableToken],
                        doc: MutableDocument, cdb: CDB, config: Config,
                        cur_id: int | None = None,
                        label: str = 'concept'
                        ) -> Optional[MutableEntity]:
    """Given a name it will check should it be annotated based on config rules.
    If yes the annotation will be added to the doc.entities array.

    Args:
        tokenizer (BaseTokenizer):
            The tokenizer (probably SpaCy).
        name (str):
            The name found in the text of the document.
        tkns (list[MutableToken]):
            Tokens that belong to this name in the spacy document.
        doc (BaseDocument):
            Spacy document to be annotated with named entities.
        cdb (CDB):
            Concept database.
        config (Config):
            Global config for medcat.
        cur_id (int | None):
            The potential ID for the entity. Defaults to None.
        label (str):
            Label for this name (usually `concept` if we are using
            a vocab based approach).

    Returns:
        Optional[BaseEntity]: The entity, if relevant.
    """

    logger.debug("Maybe annotating name: %s", name)

    # Check uppercase to distinguish uppercase and lowercase
    # words that have a different meaning.
    if config.components.ner.check_upper_case_names:
        # Check whether name is completely uppercase in CDB.
        is_upper = (cdb.name2info[name]['is_upper']
                    if name in cdb.name2info else False)
        if is_upper:
            # Check whether tokens are also in uppercase. If tokens
            # are not in uppercase, there is a mismatch.
            if not all([x.base.is_upper for x in tkns]):
                return None

    if len(name) >= config.components.ner.min_name_len:
        # Check the upper case limit, last part checks if it is
        # one token and uppercase
        if (len(name) >= config.components.ner.upper_case_limit_len or
                (len(tkns) == 1 and tkns[0].base.is_upper)):
            # Everything is fine, mark name
            return annotate_name(
                tokenizer, name, tkns, doc, cdb, cur_id, label)

    return None