medcat.tokenizing.tokenizers
Classes:
-
BaseTokenizer–The base tokenizer protocol.
-
SaveableTokenizer–
Functions:
-
create_tokenizer–Create the tokenizer given the init arguments.
-
get_tokenizer_creator–Get the creator method for the tokenizer.
-
list_available_tokenizers–Get the available tokenizers.
-
register_tokenizer–Register a new tokenizer.
Attributes:
TOKENIZER_PREFIX
module-attribute
TOKENIZER_PREFIX = 'tokenizer_internals_'
BaseTokenizer
Bases: Protocol
The base tokenizer protocol.
Methods:
-
create_entity–Create an entity from a document.
-
create_new_tokenizer– -
entity_from_tokens–Get an entity from the list of tokens.
-
get_doc_class–Get the document implementation class used by the tokenizer.
-
get_entity_class–Get the entity implementation class used by the tokenizer.
create_entity
create_entity(doc: MutableDocument, token_start_index: int, token_end_index: int, label: str) -> MutableEntity
Create an entity from a document.
Parameters:
-
(docMutableDocument) –The document to use.
-
(token_start_indexint) –The token start index.
-
(token_end_indexint) –The token end index.
-
(labelstr) –The label.
Returns:
-
MutableEntity(MutableEntity) –The resulting entity.
Source code in medcat-v2/medcat/tokenizing/tokenizers.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |
create_new_tokenizer
classmethod
create_new_tokenizer(config: Config) -> Self
Source code in medcat-v2/medcat/tokenizing/tokenizers.py
50 51 52 | |
entity_from_tokens
entity_from_tokens(tokens: list[MutableToken]) -> MutableEntity
Get an entity from the list of tokens.
Parameters:
-
(tokenslist[MutableToken]) –List of tokens.
Returns:
-
MutableEntity(MutableEntity) –The resulting entity.
Source code in medcat-v2/medcat/tokenizing/tokenizers.py
36 37 38 39 40 41 42 43 44 45 | |
get_doc_class
get_doc_class() -> Type[MutableDocument]
Get the document implementation class used by the tokenizer.
This can be used (e.g) to register addon paths.
Returns:
-
Type[MutableDocument]–Type[MutableDocument]: The document class.
Source code in medcat-v2/medcat/tokenizing/tokenizers.py
54 55 56 57 58 59 60 61 62 | |
get_entity_class
get_entity_class() -> Type[MutableEntity]
Get the entity implementation class used by the tokenizer.
Returns:
-
Type[MutableEntity]–Type[MutableEntity]: The entity class.
Source code in medcat-v2/medcat/tokenizing/tokenizers.py
64 65 66 67 68 69 70 | |
SaveableTokenizer
Bases: Protocol
Methods:
-
load_internals_from–Attempt to load internals from a folder path.
-
save_internals_to–Save tokenizer internals to specified folder.
load_internals_from
load_internals_from(folder_path: str) -> bool
Attempt to load internals from a folder path.
If the specified folder exists, internals will be loaded. If the folder doesn't exist, nothing will be loaded.
The given folder's basename should start with TOKENIZER_PREFIX.
Parameters:
-
(folder_pathstr) –The path to the folder to load internals from.
Returns:
-
bool(bool) –Whether the loading was successful.
Source code in medcat-v2/medcat/tokenizing/tokenizers.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | |
save_internals_to
save_internals_to(folder_path: str) -> str
Save tokenizer internals to specified folder.
The returning folder's basename should start with TOKENIZER_PREFIX.
Parameters:
-
(folder_pathstr) –The folder to use for the internals.
Returns:
-
str(str) –The subfolder the internals were saved to.
Source code in medcat-v2/medcat/tokenizing/tokenizers.py
76 77 78 79 80 81 82 83 84 85 86 | |
create_tokenizer
create_tokenizer(tokenizer_name: str, config: Config) -> BaseTokenizer
Create the tokenizer given the init arguments.
Parameters:
-
(tokenizer_namestr) –The tokenizer name.
-
(configConfig) –The config to be passed to the constructor.
Returns:
-
BaseTokenizer(BaseTokenizer) –The created tokenizer.
Source code in medcat-v2/medcat/tokenizing/tokenizers.py
132 133 134 135 136 137 138 139 140 141 142 | |
get_tokenizer_creator
get_tokenizer_creator(tokenizer_name: str) -> Callable[[Config], BaseTokenizer]
Get the creator method for the tokenizer.
While this is generally just the class instance (i.e refers
to the ___init__), another callable can be used internally.
Parameters:
-
(tokenizer_namestr) –The name of the tokenizer.
Returns:
-
Callable[[Config], BaseTokenizer]–Callable[[Config], BaseTokenizer]: The creator for the tokenizer.
Source code in medcat-v2/medcat/tokenizing/tokenizers.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 | |
list_available_tokenizers
Get the available tokenizers.
Returns:
-
list[tuple[str, str]]–list[tuple[str, str]]: The list of the name, and class name of the available tokenizer.
Source code in medcat-v2/medcat/tokenizing/tokenizers.py
145 146 147 148 149 150 151 152 | |
register_tokenizer
register_tokenizer(name: str, clazz: Type[BaseTokenizer]) -> None
Register a new tokenizer.
Parameters:
-
(namestr) –The name of the tokenizer.
-
(clazzType[BaseTokenizer]) –The class of the tokenizer (i.e creator).
Source code in medcat-v2/medcat/tokenizing/tokenizers.py
155 156 157 158 159 160 161 162 163 164 | |