Skip to content

medcat.tokenizing.tokens

Classes:

BaseDocument

Bases: Protocol

The base document protocol.

Represents the unchangeable parts of the whole document.

Methods:

  • isupper

    Whether the entire document is upper case.

Attributes:

  • text (str) –

    The document raw text.

text property

text: str

The document raw text.

isupper

isupper() -> bool

Whether the entire document is upper case.

Source code in medcat-v2/medcat/tokenizing/tokens.py
363
364
365
def isupper(self) -> bool:
    """Whether the entire document is upper case."""
    pass

BaseEntity

Bases: Protocol

Base entity protocol.

This describes the static (unchangeable) parts of an entity or sequence of tokens.

Attributes:

end_char_index property

end_char_index: int

The character index of the last token.

end_index property

end_index: int

The index of the last token in the entity.

label property

label: int

The label of the entity (NOTE: seems unused).

start_char_index property

start_char_index: int

The character index of the first token.

start_index property

start_index: int

The index of the first token in the entity.

text property

text: str

The text of the entire entity.

BaseToken

Bases: Protocol

Base token protocol.

This represents the static (unchangeable) parts of a token.

Attributes:

  • char_index (int) –

    The character index of the start of this token

  • index (int) –

    The index (in terms of tokens) of this token in the document.

  • is_digit (bool) –

    Whether the token represents a digit.

  • is_stop (bool) –

    Whether the token represents a stop token.

  • is_upper (bool) –

    Whether the text is upper case.

  • lower (str) –

    The lower case text representation.

  • text (str) –

    The text represented by this token.

  • text_versions (list[str]) –

    The different versions of text (e.g normalised and lower)

  • text_with_ws (str) –

    The text with tailing whitespace (where applicable).

char_index property

char_index: int

The character index of the start of this token

index property

index: int

The index (in terms of tokens) of this token in the document.

is_digit property

is_digit: bool

Whether the token represents a digit.

is_stop property

is_stop: bool

Whether the token represents a stop token.

is_upper property

is_upper: bool

Whether the text is upper case.

lower property

lower: str

The lower case text representation.

text property

text: str

The text represented by this token.

text_versions property

text_versions: list[str]

The different versions of text (e.g normalised and lower)

text_with_ws property

text_with_ws: str

The text with tailing whitespace (where applicable).

MutableDocument

Bases: Protocol

The mutable parts of the document.

Represents parts of the document that can / should be changed by the various components.

Methods:

Attributes:

base property

The base document.

linked_ents property

linked_ents: list[MutableEntity]

The linked entities associated with the document.

This should be set by the linker.

ner_ents property

ner_ents: list[MutableEntity]

All entities recognised by NER.

This should be set by the NER component.

get_addon_data

get_addon_data(path: str) -> Any

Get data added to the entity.

See add_data for details.

Parameters:

  • path

    (str) –

    The data ID / path.

Returns:

  • Any ( Any ) –

    The stored value.

Source code in medcat-v2/medcat/tokenizing/tokens.py
449
450
451
452
453
454
455
456
457
458
459
460
def get_addon_data(self, path: str) -> Any:
    """Get data added to the entity.

    See `add_data` for details.

    Args:
        path (str): The data ID / path.

    Returns:
        Any: The stored value.
    """
    pass

get_available_addon_paths

get_available_addon_paths() -> list[str]

Gets the available addon data paths for this document.

This will only include paths that have values set.

Returns:

  • list[str]

    list[str]: List of available addon data paths.

Source code in medcat-v2/medcat/tokenizing/tokens.py
462
463
464
465
466
467
468
469
470
def get_available_addon_paths(self) -> list[str]:
    """Gets the available addon data paths for this document.

    This will only include paths that have values set.

    Returns:
        list[str]: List of available addon data paths.
    """
    pass

get_tokens

Get the tokens that span the specified character indices.

Parameters:

  • start_index

    (int) –

    The starting character index.

  • end_index

    (int) –

    The ending character index.

Returns:

Source code in medcat-v2/medcat/tokenizing/tokens.py
410
411
412
413
414
415
416
417
418
419
420
421
422
def get_tokens(self, start_index: int, end_index: int
               ) -> list[MutableToken]:
    """Get the tokens that span the specified character indices.

    Args:
        start_index (int): The starting character index.
        end_index (int): The ending character index.

    Returns:
        list[MutableToken]:
            The list of tokens.
    """
    pass

has_addon_data

has_addon_data(path: str) -> bool

Checks whether the addon data for a specific path has been set.

Parameters:

  • path

    (str) –

    The path to check.

Returns:

  • bool ( bool ) –

    Whether the addon data had been set.

Source code in medcat-v2/medcat/tokenizing/tokens.py
438
439
440
441
442
443
444
445
446
447
def has_addon_data(self, path: str) -> bool:
    """Checks whether the addon data for a specific path has been set.

    Args:
        path (str): The path to check.

    Returns:
        bool: Whether the addon data had been set.
    """
    pass

register_addon_path classmethod

register_addon_path(path: str, def_val: Any = None, force: bool = True) -> None

Register a custom/arbitrary data path.

This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT).

PS: If using this, it is important to use paths namespaced to the component you're using in order to avoid conflicts.

Parameters:

  • path

    (str) –

    The path to be used. Should be prefixed by component name (e.g meta_cat_id for an ID tied to the meta_cat addon)

  • def_val

    (Any, default: None ) –

    Default value. Defaults to None.

  • force

    (bool, default: True ) –

    Whether to forcefully add the value. Defaults to True.

Source code in medcat-v2/medcat/tokenizing/tokens.py
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
@classmethod
def register_addon_path(cls, path: str, def_val: Any = None,
                        force: bool = True) -> None:
    """Register a custom/arbitrary data path.

    This can be used to store arbitrary data along with the entity for
    use in an addon (e.g MetaCAT).

    PS: If using this, it is important to use paths namespaced to the
    component you're using in order to avoid conflicts.

    Args:
        path (str): The path to be used. Should be prefixed by component
            name (e.g `meta_cat_id` for an ID tied to the `meta_cat` addon)
        def_val (Any): Default value. Defaults to `None`.
        force (bool): Whether to forcefully add the value.
            Defaults to True.
    """
    pass

set_addon_data

set_addon_data(path: str, val: Any) -> None

Used to add arbitrary data to the entity.

This is generally used by addons to keep track of their data.

NB! The path used needs to be registered using the register_addon_path class method.

Parameters:

  • path

    (str) –

    The data ID / path.

  • val

    (Any) –

    The value to be added.

Source code in medcat-v2/medcat/tokenizing/tokens.py
424
425
426
427
428
429
430
431
432
433
434
435
436
def set_addon_data(self, path: str, val: Any) -> None:
    """Used to add arbitrary data to the entity.

    This is generally used by addons to keep track of their data.

    NB! The path used needs to be registered using the
    `register_addon_path` class method.

    Args:
        path (str): The data ID / path.
        val (Any): The value to be added.
    """
    pass

MutableEntity

Bases: Protocol

The mutable part of an entity.

This represent the changeable part of an entnity. That is, parts that should be changed by the various components.

Methods:

Attributes:

base property

base: BaseEntity

The base / static entity part.

confidence property writable

confidence: float

The confidence for the lnked entity.

NOTE: This seems to be unused!

context_similarity property writable

context_similarity: float

The context similarity of the lnked entity.

This should be set by the linker component.

cui property writable

cui: str

The CUI of the lnked entity.

This should be set by the linker component.

detected_name property writable

detected_name: str

The detected name (if any) for this entity.

This should be set by the NER component.

id property writable

id: int

The ID of the entity within the document.

This counts all the entities recognised, not just ones that were successfully linked.

This should be set by the NER.

link_candidates: list[str]

The candidates for the detected name (if any) for this entity.

This should be set by the NER component.

get_addon_data

get_addon_data(path: str) -> Any

Get data added to the entity.

See add_data for details.

Parameters:

  • path

    (str) –

    The data ID / path.

Returns:

  • Any ( Any ) –

    The stored value.

Source code in medcat-v2/medcat/tokenizing/tokens.py
205
206
207
208
209
210
211
212
213
214
215
216
def get_addon_data(self, path: str) -> Any:
    """Get data added to the entity.

    See `add_data` for details.

    Args:
        path (str): The data ID / path.

    Returns:
        Any: The stored value.
    """
    pass

get_available_addon_paths

get_available_addon_paths() -> list[str]

Gets the available addon data paths for this entity.

This will only include paths that have values set.

Returns:

  • list[str]

    list[str]: List of available addon data paths.

Source code in medcat-v2/medcat/tokenizing/tokens.py
218
219
220
221
222
223
224
225
226
def get_available_addon_paths(self) -> list[str]:
    """Gets the available addon data paths for this entity.

    This will only include paths that have values set.

    Returns:
        list[str]: List of available addon data paths.
    """
    pass

has_addon_data

has_addon_data(path: str) -> bool

Checks whether the addon data for a specific path has been set.

Parameters:

  • path

    (str) –

    The path to check.

Returns:

  • bool ( bool ) –

    Whether the addon data had been set.

Source code in medcat-v2/medcat/tokenizing/tokens.py
194
195
196
197
198
199
200
201
202
203
def has_addon_data(self, path: str) -> bool:
    """Checks whether the addon data for a specific path has been set.

    Args:
        path (str): The path to check.

    Returns:
        bool: Whether the addon data had been set.
    """
    pass

register_addon_path classmethod

register_addon_path(path: str, def_val: Any = None, force: bool = True) -> None

Register a custom/arbitrary data path.

This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT).

PS: If using this, it is important to use paths namespaced to the component you're using in order to avoid conflicts.

Parameters:

  • path

    (str) –

    The path to be used. Should be prefixed by component name (e.g meta_cat_id for an ID tied to the meta_cat addon)

  • def_val

    (Any, default: None ) –

    Default value. Defaults to None.

  • force

    (bool, default: True ) –

    Whether to forcefully add the value. Defaults to True.

Source code in medcat-v2/medcat/tokenizing/tokens.py
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
@classmethod
def register_addon_path(cls, path: str, def_val: Any = None,
                        force: bool = True) -> None:
    """Register a custom/arbitrary data path.

    This can be used to store arbitrary data along with the entity for
    use in an addon (e.g MetaCAT).

    PS: If using this, it is important to use paths namespaced to the
    component you're using in order to avoid conflicts.

    Args:
        path (str): The path to be used. Should be prefixed by component
            name (e.g `meta_cat_id` for an ID tied to the `meta_cat` addon)
        def_val (Any): Default value. Defaults to `None`.
        force (bool): Whether to forcefully add the value.
            Defaults to True.
    """
    pass

set_addon_data

set_addon_data(path: str, val: Any) -> None

Used to add arbitrary data to the entity.

This is generally used by addons to keep track of their data.

NB! The path used needs to be registered using the register_addon_path class method.

Parameters:

  • path

    (str) –

    The data ID / path.

  • val

    (Any) –

    The value to be added.

Source code in medcat-v2/medcat/tokenizing/tokens.py
180
181
182
183
184
185
186
187
188
189
190
191
192
def set_addon_data(self, path: str, val: Any) -> None:
    """Used to add arbitrary data to the entity.

    This is generally used by addons to keep track of their data.

    NB! The path used needs to be registered using the
    `register_addon_path` class method.

    Args:
        path (str): The data ID / path.
        val (Any): The value to be added.
    """
    pass

MutableToken

Bases: Protocol

The mutable part of a token.

This protocol describes all the parts of a token that could be expected to change.

Attributes:

base property

base: BaseToken

The base portion of the token.

is_punctuation property writable

is_punctuation: bool

Whether the token represents punctuation.

lemma property

lemma: str

The lemmatised version of the text.

norm property writable

norm: str

The normalised text.

tag property

tag: Optional[str]

Optional tag (e.g) for normalization.

to_skip property writable

to_skip: bool

Whether the token should be skipped.

UnregisteredDataPathException

UnregisteredDataPathException(cls: Type, path: str)

Bases: ValueError

Attributes:

Source code in medcat-v2/medcat/tokenizing/tokens.py
495
496
497
498
499
def __init__(self, cls: Type, path: str):
    super().__init__(
        f"Unregistered path '{path}' for class: {cls}")
    self.cls = cls
    self.path = path

cls instance-attribute

cls = cls

path instance-attribute

path = path