medcat.components.ner.trf.tokenizer
Classes:
-
TransformersTokenizer–Args:
Attributes:
-
logger–
TransformersTokenizer
TransformersTokenizer(hf_tokenizer: Optional[PreTrainedTokenizerBase] = None, max_len: int = 512, id2type: Optional[Dict] = None, cui2name: Optional[Dict] = None)
hf_tokenizer Must be able to return token offsets. max_len: Max sequence length, if longer it will be split into multiple examples. id2type: Can be ignored in most cases, should be a map from token to 'start' or 'sub' meaning is the token a subword or the start/full word. For BERT 'start' is everything that does not begin with ##. cui2name: Map from CUI to full name for labels.
Methods:
-
calculate_label_map– -
encode–Used with huggingface datasets map function to convert medcat_ner
-
ensure_tokenizer– -
load– -
save–
Attributes:
-
cui2name– -
hf_tokenizer– -
id2type– -
label_map– -
max_len–
Source code in medcat-v2/medcat/components/ner/trf/tokenizer.py
24 25 26 27 28 29 30 31 32 33 34 | |
cui2name
instance-attribute
cui2name = cui2name
hf_tokenizer
instance-attribute
hf_tokenizer = hf_tokenizer
id2type
instance-attribute
id2type = id2type
label_map
instance-attribute
label_map = {'O': 0, 'X': 1}
max_len
instance-attribute
max_len = max_len
calculate_label_map
calculate_label_map(dataset) -> None
Source code in medcat-v2/medcat/components/ner/trf/tokenizer.py
36 37 38 39 40 | |
encode
encode(examples: Dict, ignore_subwords: bool = False) -> Dict
Used with huggingface datasets map function to convert medcat_ner dataset into the appropriate form for NER with BERT. It will split long text segments into max_len sequences (performs chunking).
Parameters:
-
(examplesDict) –Stream of examples.
-
(ignore_subwordsbool, default:False) –If set to
Truesubwords of any token will get the special labelX.
Returns:
-
Dict(Dict) –The same dict, modified.
Source code in medcat-v2/medcat/components/ner/trf/tokenizer.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | |
ensure_tokenizer
ensure_tokenizer() -> PreTrainedTokenizerBase
Source code in medcat-v2/medcat/components/ner/trf/tokenizer.py
141 142 143 144 | |
load
classmethod
load(path: str) -> TransformersTokenizer
Source code in medcat-v2/medcat/components/ner/trf/tokenizer.py
146 147 148 149 150 151 152 153 154 | |
save
save(path: str) -> None
Source code in medcat-v2/medcat/components/ner/trf/tokenizer.py
137 138 139 | |