Skip to content

medcat.tokenizing.tokenizers

Classes:

Functions:

Attributes:

TOKENIZER_PREFIX module-attribute

TOKENIZER_PREFIX = 'tokenizer_internals_'

logger module-attribute

logger = getLogger(__name__)

BaseTokenizer

Bases: Protocol

The base tokenizer protocol.

Methods:

create_entity

Create an entity from a document.

Parameters:

  • doc

    (MutableDocument) –

    The document to use.

  • token_start_index

    (int) –

    The token start index.

  • token_end_index

    (int) –

    The token end index.

  • label

    (str) –

    The label.

Returns:

Source code in medcat-v2/medcat/tokenizing/tokenizers.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def create_entity(self, doc: MutableDocument,
                  token_start_index: int, token_end_index: int,
                  label: str) -> MutableEntity:
    """Create an entity from a document.

    Args:
        doc (MutableDocument): The document to use.
        token_start_index (int): The token start index.
        token_end_index (int): The token end index.
        label (str): The label.

    Returns:
        MutableEntity: The resulting entity.
    """
    pass

create_new_tokenizer classmethod

create_new_tokenizer(config: Config) -> Self
Source code in medcat-v2/medcat/tokenizing/tokenizers.py
50
51
52
@classmethod
def create_new_tokenizer(cls, config: Config) -> Self:
    pass

entity_from_tokens

entity_from_tokens(tokens: list[MutableToken]) -> MutableEntity

Get an entity from the list of tokens.

Parameters:

Returns:

Source code in medcat-v2/medcat/tokenizing/tokenizers.py
36
37
38
39
40
41
42
43
44
45
def entity_from_tokens(self, tokens: list[MutableToken]) -> MutableEntity:
    """Get an entity from the list of tokens.

    Args:
        tokens (list[MutableToken]): List of tokens.

    Returns:
        MutableEntity: The resulting entity.
    """
    pass

get_doc_class

get_doc_class() -> Type[MutableDocument]

Get the document implementation class used by the tokenizer.

This can be used (e.g) to register addon paths.

Returns:

Source code in medcat-v2/medcat/tokenizing/tokenizers.py
54
55
56
57
58
59
60
61
62
def get_doc_class(self) -> Type[MutableDocument]:
    """Get the document implementation class used by the tokenizer.

    This can be used (e.g) to register addon paths.

    Returns:
        Type[MutableDocument]: The document class.
    """
    pass

get_entity_class

get_entity_class() -> Type[MutableEntity]

Get the entity implementation class used by the tokenizer.

Returns:

Source code in medcat-v2/medcat/tokenizing/tokenizers.py
64
65
66
67
68
69
70
def get_entity_class(self) -> Type[MutableEntity]:
    """Get the entity implementation class used by the tokenizer.

    Returns:
        Type[MutableEntity]: The entity class.
    """
    pass

SaveableTokenizer

Bases: Protocol

Methods:

load_internals_from

load_internals_from(folder_path: str) -> bool

Attempt to load internals from a folder path.

If the specified folder exists, internals will be loaded. If the folder doesn't exist, nothing will be loaded.

The given folder's basename should start with TOKENIZER_PREFIX.

Parameters:

  • folder_path

    (str) –

    The path to the folder to load internals from.

Returns:

  • bool ( bool ) –

    Whether the loading was successful.

Source code in medcat-v2/medcat/tokenizing/tokenizers.py
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
def load_internals_from(self, folder_path: str) -> bool:
    """Attempt to load internals from a folder path.

    If the specified folder exists, internals will be loaded.
    If the folder doesn't exist, nothing will be loaded.

    The given folder's basename should start with `TOKENIZER_PREFIX`.

    Args:
        folder_path (str): The path to the folder to load internals from.

    Returns:
        bool: Whether the loading was successful.
    """
    pass

save_internals_to

save_internals_to(folder_path: str) -> str

Save tokenizer internals to specified folder.

The returning folder's basename should start with TOKENIZER_PREFIX.

Parameters:

  • folder_path

    (str) –

    The folder to use for the internals.

Returns:

  • str ( str ) –

    The subfolder the internals were saved to.

Source code in medcat-v2/medcat/tokenizing/tokenizers.py
76
77
78
79
80
81
82
83
84
85
86
def save_internals_to(self, folder_path: str) -> str:
    """Save tokenizer internals to specified folder.

    The returning folder's basename should start with `TOKENIZER_PREFIX`.

    Args:
        folder_path (str): The folder to use for the internals.

    Returns:
        str: The subfolder the internals were saved to.
    """

create_tokenizer

create_tokenizer(tokenizer_name: str, config: Config) -> BaseTokenizer

Create the tokenizer given the init arguments.

Parameters:

  • tokenizer_name

    (str) –

    The tokenizer name.

  • config

    (Config) –

    The config to be passed to the constructor.

Returns:

Source code in medcat-v2/medcat/tokenizing/tokenizers.py
132
133
134
135
136
137
138
139
140
141
142
def create_tokenizer(tokenizer_name: str, config: Config) -> BaseTokenizer:
    """Create the tokenizer given the init arguments.

    Args:
        tokenizer_name (str): The tokenizer name.
        config (Config): The config to be passed to the constructor.

    Returns:
        BaseTokenizer: The created tokenizer.
    """
    return _TOKENIZERS_REGISTRY.get_component(tokenizer_name)(config)

get_tokenizer_creator

get_tokenizer_creator(tokenizer_name: str) -> Callable[[Config], BaseTokenizer]

Get the creator method for the tokenizer.

While this is generally just the class instance (i.e refers to the ___init__), another callable can be used internally.

Parameters:

  • tokenizer_name

    (str) –

    The name of the tokenizer.

Returns:

Source code in medcat-v2/medcat/tokenizing/tokenizers.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
def get_tokenizer_creator(tokenizer_name: str
                          ) -> Callable[[Config], BaseTokenizer]:
    """Get the creator method for the tokenizer.

    While this is generally just the class instance (i.e refers
    to the `___init__`), another callable can be used internally.

    Args:
        tokenizer_name (str): The name of the tokenizer.

    Returns:
        Callable[[Config], BaseTokenizer]: The creator for the tokenizer.
    """
    return _TOKENIZERS_REGISTRY.get_component(tokenizer_name)

list_available_tokenizers

list_available_tokenizers() -> list[tuple[str, str]]

Get the available tokenizers.

Returns:

  • list[tuple[str, str]]

    list[tuple[str, str]]: The list of the name, and class name of the available tokenizer.

Source code in medcat-v2/medcat/tokenizing/tokenizers.py
145
146
147
148
149
150
151
152
def list_available_tokenizers() -> list[tuple[str, str]]:
    """Get the available tokenizers.

    Returns:
        list[tuple[str, str]]: The list of the name, and class name
            of the available tokenizer.
    """
    return _TOKENIZERS_REGISTRY.list_components()

register_tokenizer

register_tokenizer(name: str, clazz: Type[BaseTokenizer]) -> None

Register a new tokenizer.

Parameters:

  • name

    (str) –

    The name of the tokenizer.

  • clazz

    (Type[BaseTokenizer]) –

    The class of the tokenizer (i.e creator).

Source code in medcat-v2/medcat/tokenizing/tokenizers.py
155
156
157
158
159
160
161
162
163
164
def register_tokenizer(name: str, clazz: Type[BaseTokenizer]) -> None:
    """Register a new tokenizer.

    Args:
        name (str): The name of the tokenizer.
        clazz (Type[BaseTokenizer]): The class of the tokenizer (i.e creator).
    """
    _TOKENIZERS_REGISTRY.register(name, clazz)
    logger.debug("Registered tokenizer '%s': '%s.%s'",
                 name, clazz.__module__, clazz.__name__)