medcat.tokenizing.tokens
Classes:
-
BaseDocument–The base document protocol.
-
BaseEntity–Base entity protocol.
-
BaseToken–Base token protocol.
-
MutableDocument–The mutable parts of the document.
-
MutableEntity–The mutable part of an entity.
-
MutableToken–The mutable part of a token.
-
UnregisteredDataPathException–
BaseDocument
BaseEntity
Bases: Protocol
Base entity protocol.
This describes the static (unchangeable) parts of an entity or sequence of tokens.
Attributes:
-
end_char_index(int) –The character index of the last token.
-
end_index(int) –The index of the last token in the entity.
-
label(int) –The label of the entity (NOTE: seems unused).
-
start_char_index(int) –The character index of the first token.
-
start_index(int) –The index of the first token in the entity.
-
text(str) –The text of the entire entity.
BaseToken
Bases: Protocol
Base token protocol.
This represents the static (unchangeable) parts of a token.
Attributes:
-
char_index(int) –The character index of the start of this token
-
index(int) –The index (in terms of tokens) of this token in the document.
-
is_digit(bool) –Whether the token represents a digit.
-
is_stop(bool) –Whether the token represents a stop token.
-
is_upper(bool) –Whether the text is upper case.
-
lower(str) –The lower case text representation.
-
text(str) –The text represented by this token.
-
text_versions(list[str]) –The different versions of text (e.g normalised and lower)
-
text_with_ws(str) –The text with tailing whitespace (where applicable).
text_versions
property
The different versions of text (e.g normalised and lower)
MutableDocument
Bases: Protocol
The mutable parts of the document.
Represents parts of the document that can / should be changed by the various components.
Methods:
-
get_addon_data–Get data added to the entity.
-
get_available_addon_paths–Gets the available addon data paths for this document.
-
get_tokens–Get the tokens that span the specified character indices.
-
has_addon_data–Checks whether the addon data for a specific path has been set.
-
register_addon_path–Register a custom/arbitrary data path.
-
set_addon_data–Used to add arbitrary data to the entity.
Attributes:
-
base(BaseDocument) –The base document.
-
linked_ents(list[MutableEntity]) –The linked entities associated with the document.
-
ner_ents(list[MutableEntity]) –All entities recognised by NER.
linked_ents
property
linked_ents: list[MutableEntity]
The linked entities associated with the document.
This should be set by the linker.
ner_ents
property
ner_ents: list[MutableEntity]
All entities recognised by NER.
This should be set by the NER component.
get_addon_data
Get data added to the entity.
See add_data for details.
Parameters:
-
(pathstr) –The data ID / path.
Returns:
-
Any(Any) –The stored value.
Source code in medcat-v2/medcat/tokenizing/tokens.py
449 450 451 452 453 454 455 456 457 458 459 460 | |
get_available_addon_paths
Gets the available addon data paths for this document.
This will only include paths that have values set.
Returns:
Source code in medcat-v2/medcat/tokenizing/tokens.py
462 463 464 465 466 467 468 469 470 | |
get_tokens
get_tokens(start_index: int, end_index: int) -> list[MutableToken]
Get the tokens that span the specified character indices.
Parameters:
Returns:
-
list[MutableToken]–list[MutableToken]: The list of tokens.
Source code in medcat-v2/medcat/tokenizing/tokens.py
410 411 412 413 414 415 416 417 418 419 420 421 422 | |
has_addon_data
Checks whether the addon data for a specific path has been set.
Parameters:
-
(pathstr) –The path to check.
Returns:
-
bool(bool) –Whether the addon data had been set.
Source code in medcat-v2/medcat/tokenizing/tokens.py
438 439 440 441 442 443 444 445 446 447 | |
register_addon_path
classmethod
Register a custom/arbitrary data path.
This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT).
PS: If using this, it is important to use paths namespaced to the component you're using in order to avoid conflicts.
Parameters:
-
(pathstr) –The path to be used. Should be prefixed by component name (e.g
meta_cat_idfor an ID tied to themeta_cataddon) -
(def_valAny, default:None) –Default value. Defaults to
None. -
(forcebool, default:True) –Whether to forcefully add the value. Defaults to True.
Source code in medcat-v2/medcat/tokenizing/tokens.py
472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 | |
set_addon_data
Used to add arbitrary data to the entity.
This is generally used by addons to keep track of their data.
NB! The path used needs to be registered using the
register_addon_path class method.
Parameters:
Source code in medcat-v2/medcat/tokenizing/tokens.py
424 425 426 427 428 429 430 431 432 433 434 435 436 | |
MutableEntity
Bases: Protocol
The mutable part of an entity.
This represent the changeable part of an entnity. That is, parts that should be changed by the various components.
Methods:
-
get_addon_data–Get data added to the entity.
-
get_available_addon_paths–Gets the available addon data paths for this entity.
-
has_addon_data–Checks whether the addon data for a specific path has been set.
-
register_addon_path–Register a custom/arbitrary data path.
-
set_addon_data–Used to add arbitrary data to the entity.
Attributes:
-
base(BaseEntity) –The base / static entity part.
-
confidence(float) –The confidence for the lnked entity.
-
context_similarity(float) –The context similarity of the lnked entity.
-
cui(str) –The CUI of the lnked entity.
-
detected_name(str) –The detected name (if any) for this entity.
-
id(int) –The ID of the entity within the document.
-
link_candidates(list[str]) –The candidates for the detected name (if any) for this entity.
confidence
property
writable
confidence: float
The confidence for the lnked entity.
NOTE: This seems to be unused!
context_similarity
property
writable
context_similarity: float
The context similarity of the lnked entity.
This should be set by the linker component.
cui
property
writable
cui: str
The CUI of the lnked entity.
This should be set by the linker component.
detected_name
property
writable
detected_name: str
The detected name (if any) for this entity.
This should be set by the NER component.
id
property
writable
id: int
The ID of the entity within the document.
This counts all the entities recognised, not just ones that were successfully linked.
This should be set by the NER.
link_candidates
property
writable
The candidates for the detected name (if any) for this entity.
This should be set by the NER component.
get_addon_data
Get data added to the entity.
See add_data for details.
Parameters:
-
(pathstr) –The data ID / path.
Returns:
-
Any(Any) –The stored value.
Source code in medcat-v2/medcat/tokenizing/tokens.py
205 206 207 208 209 210 211 212 213 214 215 216 | |
get_available_addon_paths
Gets the available addon data paths for this entity.
This will only include paths that have values set.
Returns:
Source code in medcat-v2/medcat/tokenizing/tokens.py
218 219 220 221 222 223 224 225 226 | |
has_addon_data
Checks whether the addon data for a specific path has been set.
Parameters:
-
(pathstr) –The path to check.
Returns:
-
bool(bool) –Whether the addon data had been set.
Source code in medcat-v2/medcat/tokenizing/tokens.py
194 195 196 197 198 199 200 201 202 203 | |
register_addon_path
classmethod
Register a custom/arbitrary data path.
This can be used to store arbitrary data along with the entity for use in an addon (e.g MetaCAT).
PS: If using this, it is important to use paths namespaced to the component you're using in order to avoid conflicts.
Parameters:
-
(pathstr) –The path to be used. Should be prefixed by component name (e.g
meta_cat_idfor an ID tied to themeta_cataddon) -
(def_valAny, default:None) –Default value. Defaults to
None. -
(forcebool, default:True) –Whether to forcefully add the value. Defaults to True.
Source code in medcat-v2/medcat/tokenizing/tokens.py
314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 | |
set_addon_data
Used to add arbitrary data to the entity.
This is generally used by addons to keep track of their data.
NB! The path used needs to be registered using the
register_addon_path class method.
Parameters:
Source code in medcat-v2/medcat/tokenizing/tokens.py
180 181 182 183 184 185 186 187 188 189 190 191 192 | |
MutableToken
Bases: Protocol
The mutable part of a token.
This protocol describes all the parts of a token that could be expected to change.
Attributes:
-
base(BaseToken) –The base portion of the token.
-
is_punctuation(bool) –Whether the token represents punctuation.
-
lemma(str) –The lemmatised version of the text.
-
norm(str) –The normalised text.
-
tag(Optional[str]) –Optional tag (e.g) for normalization.
-
to_skip(bool) –Whether the token should be skipped.
UnregisteredDataPathException
Bases: ValueError
Attributes:
Source code in medcat-v2/medcat/tokenizing/tokens.py
495 496 497 498 499 | |
cls
instance-attribute
cls = cls
path
instance-attribute
path = path