medcat.tokenizing.tokens

Classes:

BaseDocument –

The base document protocol.
BaseEntity –

Base entity protocol.
BaseToken –

Base token protocol.
MutableDocument –

The mutable parts of the document.
MutableEntity –

The mutable part of an entity.
MutableToken –

The mutable part of a token.
UnregisteredDataPathException –

BaseDocument

Bases: Protocol

The base document protocol.

Represents the unchangeable parts of the whole document.

Methods:

isupper –

Whether the entire document is upper case.

Attributes:

text (str) –

The document raw text.

text `property`

text: str

The document raw text.

isupper

isupper() -> bool

Whether the entire document is upper case.

Source code in medcat-v2/medcat/tokenizing/tokens.py

def isupper(self) -> bool:
    """Whether the entire document is upper case."""
    pass

BaseEntity

Bases: Protocol

Base entity protocol.

This describes the static (unchangeable) parts of an entity or sequence of tokens.

Attributes:

end_char_index (int) –

The character index of the last token.
end_index (int) –

The index of the last token in the entity.
label (int) –

The label of the entity (NOTE: seems unused).
start_char_index (int) –

The character index of the first token.
start_index (int) –

The index of the first token in the entity.
text (str) –

The text of the entire entity.

end_char_index `property`

end_char_index: int

The character index of the last token.

end_index `property`

end_index: int

The index of the last token in the entity.

label `property`

label: int

The label of the entity (NOTE: seems unused).

start_char_index `property`

start_char_index: int

The character index of the first token.

start_index `property`

start_index: int

The index of the first token in the entity.

text `property`

text: str

The text of the entire entity.

BaseToken

Bases: Protocol

Base token protocol.

This represents the static (unchangeable) parts of a token.

Attributes:

char_index (int) –

The character index of the start of this token
index (int) –

The index (in terms of tokens) of this token in the document.
is_digit (bool) –

Whether the token represents a digit.
is_stop (bool) –

Whether the token represents a stop token.
is_upper (bool) –

Whether the text is upper case.
lower (str) –

The lower case text representation.
text (str) –

The text represented by this token.
text_versions (list[str]) –

The different versions of text (e.g normalised and lower)
text_with_ws (str) –

The text with tailing whitespace (where applicable).

char_index `property`

char_index: int

The character index of the start of this token

index `property`

index: int

The index (in terms of tokens) of this token in the document.

is_digit `property`

is_digit: bool

Whether the token represents a digit.

is_stop `property`

is_stop: bool

Whether the token represents a stop token.

is_upper `property`

is_upper: bool

Whether the text is upper case.

lower `property`

lower: str

The lower case text representation.

text `property`

text: str

The text represented by this token.

text_versions `property`

text_versions: list[str]

The different versions of text (e.g normalised and lower)

text_with_ws `property`

text_with_ws: str

The text with tailing whitespace (where applicable).

MutableDocument

Bases: Protocol

The mutable parts of the document.

Represents parts of the document that can / should be changed by the various components.

Methods:

get_addon_data –

Get data added to the entity.
get_available_addon_paths –

Gets the available addon data paths for this document.
get_tokens –

Get the tokens that span the specified character indices.
has_addon_data –

Checks whether the addon data for a specific path has been set.
register_addon_path –

Register a custom/arbitrary data path.
set_addon_data –

Used to add arbitrary data to the entity.

Attributes:

base (BaseDocument) –

The base document.
linked_ents (list[MutableEntity]) –

The linked entities associated with the document.
ner_ents (list[MutableEntity]) –

All entities recognised by NER.

base `property`

base: BaseDocument

The base document.

linked_ents `property`

linked_ents: list[MutableEntity]

The linked entities associated with the document.

This should be set by the linker.

ner_ents `property`

ner_ents: list[MutableEntity]

All entities recognised by NER.

This should be set by the NER component.

get_addon_data

get_addon_data(path: str) -> Any

Get data added to the entity.

See add_data for details.

Parameters:

path
(str) –

The data ID / path.

Returns:

Any ( Any ) –

The stored value.

Source code in medcat-v2/medcat/tokenizing/tokens.py

def get_addon_data(self, path: str) -> Any:
    """Get data added to the entity.

    See `add_data` for details.

    Args:
        path (str): The data ID / path.

    Returns:
        Any: The stored value.
    """
    pass

get_available_addon_paths

get_available_addon_paths() -> list[str]

Gets the available addon data paths for this document.

This will only include paths that have values set.

Returns:

list[str] –

list[str]: List of available addon data paths.

Source code in medcat-v2/medcat/tokenizing/tokens.py

def get_available_addon_paths(self) -> list[str]:
    """Gets the available addon data paths for this document.

    This will only include paths that have values set.

    Returns:
        list[str]: List of available addon data paths.
    """
    pass

get_tokens

get_tokens(start_index: int, end_index: int) -> list[MutableToken]

Get the tokens that span the specified character indices.

Parameters:

start_index
(int) –

The starting character index.
end_index
(int) –

The ending character index.

Returns:

list[MutableToken] –

list[MutableToken]: The list of tokens.

Source code in medcat-v2/medcat/tokenizing/tokens.py

def get_tokens(self, start_index: int, end_index: int
               ) -> list[MutableToken]:
    """Get the tokens that span the specified character indices.

    Args:
        start_index (int): The starting character index.
        end_index (int): The ending character index.

    Returns:
        list[MutableToken]:
            The list of tokens.
    """
    pass

has_addon_data

has_addon_data(path: str) -> bool

Checks whether the addon data for a specific path has been set.

Parameters:

path
(str) –

The path to check.

Returns:

bool ( bool ) –

Whether the addon data had been set.

Source code in medcat-v2/medcat/tokenizing/tokens.py

def has_addon_data(self, path: str) -> bool:
    """Checks whether the addon data for a specific path has been set.

    Args:
        path (str): The path to check.

    Returns:
        bool: Whether the addon data had been set.
    """
    pass

register_addon_path `classmethod`

register_addon_path(path: str, def_val: Any = None, force: bool = True) -> None

Register a custom/arbitrary data path.

This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT).

PS: If using this, it is important to use paths namespaced to the component you're using in order to avoid conflicts.

Parameters:

path
(str) –

The path to be used. Should be prefixed by component name (e.g meta_cat_id for an ID tied to the meta_cat addon)
def_val
(Any, default: None ) –

Default value. Defaults to None.
force
(bool, default: True ) –

Whether to forcefully add the value. Defaults to True.

Source code in medcat-v2/medcat/tokenizing/tokens.py

@classmethod
def register_addon_path(cls, path: str, def_val: Any = None,
                        force: bool = True) -> None:
    """Register a custom/arbitrary data path.

    This can be used to store arbitrary data along with the entity for
    use in an addon (e.g MetaCAT).

    PS: If using this, it is important to use paths namespaced to the
    component you're using in order to avoid conflicts.

    Args:
        path (str): The path to be used. Should be prefixed by component
            name (e.g `meta_cat_id` for an ID tied to the `meta_cat` addon)
        def_val (Any): Default value. Defaults to `None`.
        force (bool): Whether to forcefully add the value.
            Defaults to True.
    """
    pass

set_addon_data

set_addon_data(path: str, val: Any) -> None

Used to add arbitrary data to the entity.

This is generally used by addons to keep track of their data.

NB! The path used needs to be registered using the register_addon_path class method.

Parameters:

path
(str) –

The data ID / path.
val
(Any) –

The value to be added.

Source code in medcat-v2/medcat/tokenizing/tokens.py

def set_addon_data(self, path: str, val: Any) -> None:
    """Used to add arbitrary data to the entity.

    This is generally used by addons to keep track of their data.

    NB! The path used needs to be registered using the
    `register_addon_path` class method.

    Args:
        path (str): The data ID / path.
        val (Any): The value to be added.
    """
    pass

MutableEntity

Bases: Protocol

The mutable part of an entity.

This represent the changeable part of an entnity. That is, parts that should be changed by the various components.

Methods:

get_addon_data –

Get data added to the entity.
get_available_addon_paths –

Gets the available addon data paths for this entity.
has_addon_data –

Checks whether the addon data for a specific path has been set.
register_addon_path –

Register a custom/arbitrary data path.
set_addon_data –

Used to add arbitrary data to the entity.

Attributes:

base (BaseEntity) –

The base / static entity part.
confidence (float) –

The confidence for the lnked entity.
context_similarity (float) –

The context similarity of the lnked entity.
cui (str) –

The CUI of the lnked entity.
detected_name (str) –

The detected name (if any) for this entity.
id (int) –

The ID of the entity within the document.
link_candidates (list[str]) –

The candidates for the detected name (if any) for this entity.

base `property`

base: BaseEntity

The base / static entity part.

confidence `property` `writable`

confidence: float

The confidence for the lnked entity.

NOTE: This seems to be unused!

context_similarity `property` `writable`

context_similarity: float

The context similarity of the lnked entity.

This should be set by the linker component.

cui `property` `writable`

cui: str

The CUI of the lnked entity.

This should be set by the linker component.

detected_name `property` `writable`

detected_name: str

The detected name (if any) for this entity.

This should be set by the NER component.

id `property` `writable`

id: int

The ID of the entity within the document.

This counts all the entities recognised, not just ones that were successfully linked.

This should be set by the NER.

link_candidates `property` `writable`

link_candidates: list[str]

The candidates for the detected name (if any) for this entity.

This should be set by the NER component.

get_addon_data

get_addon_data(path: str) -> Any

Get data added to the entity.

See add_data for details.

Parameters:

path
(str) –

The data ID / path.

Returns:

Any ( Any ) –

The stored value.

Source code in medcat-v2/medcat/tokenizing/tokens.py

def get_addon_data(self, path: str) -> Any:
    """Get data added to the entity.

    See `add_data` for details.

    Args:
        path (str): The data ID / path.

    Returns:
        Any: The stored value.
    """
    pass

get_available_addon_paths

get_available_addon_paths() -> list[str]

Gets the available addon data paths for this entity.

This will only include paths that have values set.

Returns:

list[str] –

list[str]: List of available addon data paths.

Source code in medcat-v2/medcat/tokenizing/tokens.py

def get_available_addon_paths(self) -> list[str]:
    """Gets the available addon data paths for this entity.

    This will only include paths that have values set.

    Returns:
        list[str]: List of available addon data paths.
    """
    pass

has_addon_data

has_addon_data(path: str) -> bool

Checks whether the addon data for a specific path has been set.

Parameters:

path
(str) –

The path to check.

Returns:

bool ( bool ) –

Whether the addon data had been set.

Source code in medcat-v2/medcat/tokenizing/tokens.py

def has_addon_data(self, path: str) -> bool:
    """Checks whether the addon data for a specific path has been set.

    Args:
        path (str): The path to check.

    Returns:
        bool: Whether the addon data had been set.
    """
    pass

register_addon_path `classmethod`

register_addon_path(path: str, def_val: Any = None, force: bool = True) -> None

Register a custom/arbitrary data path.

This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT).

PS: If using this, it is important to use paths namespaced to the component you're using in order to avoid conflicts.

Parameters:

path
(str) –

The path to be used. Should be prefixed by component name (e.g meta_cat_id for an ID tied to the meta_cat addon)
def_val
(Any, default: None ) –

Default value. Defaults to None.
force
(bool, default: True ) –

Whether to forcefully add the value. Defaults to True.

Source code in medcat-v2/medcat/tokenizing/tokens.py

@classmethod
def register_addon_path(cls, path: str, def_val: Any = None,
                        force: bool = True) -> None:
    """Register a custom/arbitrary data path.

    This can be used to store arbitrary data along with the entity for
    use in an addon (e.g MetaCAT).

    PS: If using this, it is important to use paths namespaced to the
    component you're using in order to avoid conflicts.

    Args:
        path (str): The path to be used. Should be prefixed by component
            name (e.g `meta_cat_id` for an ID tied to the `meta_cat` addon)
        def_val (Any): Default value. Defaults to `None`.
        force (bool): Whether to forcefully add the value.
            Defaults to True.
    """
    pass

set_addon_data

set_addon_data(path: str, val: Any) -> None

Used to add arbitrary data to the entity.

This is generally used by addons to keep track of their data.

NB! The path used needs to be registered using the register_addon_path class method.

Parameters:

path
(str) –

The data ID / path.
val
(Any) –

The value to be added.

Source code in medcat-v2/medcat/tokenizing/tokens.py

def set_addon_data(self, path: str, val: Any) -> None:
    """Used to add arbitrary data to the entity.

    This is generally used by addons to keep track of their data.

    NB! The path used needs to be registered using the
    `register_addon_path` class method.

    Args:
        path (str): The data ID / path.
        val (Any): The value to be added.
    """
    pass

MutableToken

Bases: Protocol

The mutable part of a token.

This protocol describes all the parts of a token that could be expected to change.

Attributes:

base (BaseToken) –

The base portion of the token.
is_punctuation (bool) –

Whether the token represents punctuation.
lemma (str) –

The lemmatised version of the text.
norm (str) –

The normalised text.
tag (Optional[str]) –

Optional tag (e.g) for normalization.
to_skip (bool) –

Whether the token should be skipped.

base `property`

base: BaseToken

The base portion of the token.

is_punctuation `property` `writable`

is_punctuation: bool

Whether the token represents punctuation.

lemma `property`

lemma: str

The lemmatised version of the text.

norm `property` `writable`

norm: str

The normalised text.

tag `property`

tag: Optional[str]

Optional tag (e.g) for normalization.

to_skip `property` `writable`

to_skip: bool

Whether the token should be skipped.

UnregisteredDataPathException

UnregisteredDataPathException(cls: Type, path: str)

Bases: ValueError

Attributes:

cls –
path –

Source code in medcat-v2/medcat/tokenizing/tokens.py

def __init__(self, cls: Type, path: str):
    super().__init__(
        f"Unregistered path '{path}' for class: {cls}")
    self.cls = cls
    self.path = path

cls `instance-attribute`

cls = cls

path `instance-attribute`

path = path

medcat.tokenizing.tokens

BaseDocument

text property

isupper

BaseEntity

end_char_index property

end_index property

label property

start_char_index property

start_index property

text property

BaseToken

char_index property

index property

is_digit property

is_stop property

is_upper property

lower property

text property

text_versions property

text_with_ws property

MutableDocument

base property

linked_ents property

ner_ents property

get_addon_data

path

get_available_addon_paths

get_tokens

start_index

end_index

has_addon_data

path

register_addon_path classmethod

path

def_val

force

set_addon_data

path

val

MutableEntity

base property

confidence property writable

context_similarity property writable

cui property writable

detected_name property writable

id property writable

link_candidates property writable

get_addon_data

path

get_available_addon_paths

has_addon_data

path

register_addon_path classmethod

path

def_val

force

set_addon_data

path

val

MutableToken

base property

is_punctuation property writable

lemma property

norm property writable

tag property

to_skip property writable

UnregisteredDataPathException

cls instance-attribute

path instance-attribute

text `property`

end_char_index `property`

end_index `property`

label `property`

start_char_index `property`

start_index `property`

text `property`

char_index `property`

index `property`

is_digit `property`

is_stop `property`

is_upper `property`

lower `property`

text `property`

text_versions `property`

text_with_ws `property`

base `property`

linked_ents `property`

ner_ents `property`

`path`

`start_index`

`end_index`

`path`

register_addon_path `classmethod`

`path`

`def_val`

`force`

`path`

`val`

base `property`

confidence `property` `writable`

context_similarity `property` `writable`

cui `property` `writable`

detected_name `property` `writable`

id `property` `writable`

link_candidates `property` `writable`

`path`

`path`

register_addon_path `classmethod`

`path`

`def_val`

`force`

`path`

`val`

base `property`

is_punctuation `property` `writable`

lemma `property`

norm `property` `writable`

tag `property`

to_skip `property` `writable`

cls `instance-attribute`

path `instance-attribute`