medcat.utils.vocab_utils
Functions:
-
calc_matrix–Calculate the transformation matrix based on the word vectors in the
-
convert_context_vectors–Use the transformation matrix to convert the context vectors within the
-
convert_vec–Helper function to convert the vector.
-
convert_vocab–Use the transformation matrix to convert the word vectors.
-
convert_vocab_vector_size–Convert the vocab vector size to a smaller one.
Attributes:
-
logger–
calc_matrix
Calculate the transformation matrix based on the word vectors in the Vocab.
Performs Principal Component Analysis (PCA).
This first means all the word vectors in the Vocab.
It then finds the covariance matrix.
After that, the eigenvalues and and eigenvectors are calculated.
And the target_size eigenvectors corresponding to the largest
eigenvalues are selected to create the transformation matrix.
Args:
vocab (Vocab): The Vocab.
target_size (int): The target vector size.
Returns:
np.ndarray: The transformation matrix.
Source code in medcat-v2/medcat/utils/vocab_utils.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | |
convert_context_vectors
Use the transformation matrix to convert the context vectors within the CDB.
Parameters:
-
(cdbCDB) –The Context Database.
-
(matrixndarray) –The transformation matrix.
Source code in medcat-v2/medcat/utils/vocab_utils.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | |
convert_vec
convert_vec(cur: ndarray, matrix: ndarray, target_dtype: Type = float32) -> ndarray
Helper function to convert the vector.
This also guarantees uniform typing (of np.float32) since in our experience some vectors may be of a different type before (i.e np.float64).
Parameters:
-
(curndarray) –The current vector.
-
(matrixndarray) –The transformation matrix.
-
(target_dtypeType, default:float32) –The target element data ype. Defaults to np.float32.
Returns: np.ndarray: The transformed vector.
Source code in medcat-v2/medcat/utils/vocab_utils.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | |
convert_vocab
Use the transformation matrix to convert the word vectors.
Parameters:
-
(vocabVocab) –The Vocab.
-
(matrixndarray) –The transformation matrix.
Source code in medcat-v2/medcat/utils/vocab_utils.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |
convert_vocab_vector_size
Convert the vocab vector size to a smaller one.
This uses Principal Component Analysis (PCA). The idea is that we
first center all the word vectors (in Vocab), then compute the
covariance matrix, then find the eigenvalues and eigenvectors,
and then we select the top vec_size eigenvectors.
This produces a transformation matrix of shape (vec_size, N),
where N is the current vector length in the vocab.
After that, we perform the transformation. First we transform all the vectors in the Vocab. And then we transform all the context vectors defined within the CDB.
NOTE: This requires the CDB as well since the per concept context vectors stored within it are based on the vectors in the vocab and thus they also need to be transformed.
Parameters:
-
(cdbCDB) –The Concept Database.
-
(vocabVocab) –The Vocab.
-
(vec_sizeint) –The target vector size.
Source code in medcat-v2/medcat/utils/vocab_utils.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | |