Skip to content

medcat.utils.vocab_utils

Functions:

Attributes:

logger module-attribute

logger = getLogger(__name__)

calc_matrix

calc_matrix(vocab: Vocab, target_size: int) -> ndarray

Calculate the transformation matrix based on the word vectors in the Vocab.

Performs Principal Component Analysis (PCA). This first means all the word vectors in the Vocab. It then finds the covariance matrix. After that, the eigenvalues and and eigenvectors are calculated. And the target_size eigenvectors corresponding to the largest eigenvalues are selected to create the transformation matrix. Args: vocab (Vocab): The Vocab. target_size (int): The target vector size. Returns: np.ndarray: The transformation matrix.

Source code in medcat-v2/medcat/utils/vocab_utils.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def calc_matrix(vocab: Vocab, target_size: int) -> np.ndarray:
    """Calculate the transformation matrix based on the word vectors in the
    Vocab.

    Performs Principal Component Analysis (PCA).
    This first means all the word vectors in the Vocab.
    It then finds the covariance matrix.
    After that, the eigenvalues and and eigenvectors are calculated.
    And the `target_size` eigenvectors corresponding to the largest
    eigenvalues are selected to create the transformation matrix.
    Args:
        vocab (Vocab): The Vocab.
        target_size (int): The target vector size.
    Returns:
        np.ndarray: The transformation matrix.
    """
    all_vecs = np.vstack(
        [value['vector'] for value in vocab.vocab.values()
         if value['vector'] is not None]
    )
    logger.debug("Vocab vectors have a total shape of %s", np.shape(all_vecs))
    all_vecs_meaned = all_vecs - np.mean(all_vecs, axis=0)
    cov_matrix = np.cov(all_vecs_meaned, rowvar=False)
    eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
    sorted_idx = np.argsort(eigenvalues)[::-1]
    logger.debug("The sorted eigenvalues are as follows: %s",
                 [f"{v:5.2f}" for v in eigenvalues[sorted_idx]])
    sorted_eigenvectors = eigenvectors[:, sorted_idx]
    transformation_matrix = sorted_eigenvectors[:, :target_size]
    return transformation_matrix.T

convert_context_vectors

convert_context_vectors(cdb: CDB, matrix: ndarray) -> None

Use the transformation matrix to convert the context vectors within the CDB.

Parameters:

  • cdb

    (CDB) –

    The Context Database.

  • matrix

    (ndarray) –

    The transformation matrix.

Source code in medcat-v2/medcat/utils/vocab_utils.py
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def convert_context_vectors(cdb: CDB, matrix: np.ndarray) -> None:
    """Use the transformation matrix to convert the context vectors within the
    CDB.

    Args:
        cdb (CDB): The Context Database.
        matrix (np.ndarray): The transformation matrix.
    """
    for cuiinfo in cdb.cui2info.values():
        if 'context_vectors' not in cuiinfo:
            continue
        per_cui_dict = cuiinfo['context_vectors']
        if per_cui_dict is None:
            continue
        for type_name, cur_vec in list(per_cui_dict.items()):
            per_cui_dict[type_name] = convert_vec(cur_vec, matrix)
    cdb.is_dirty = True

convert_vec

convert_vec(cur: ndarray, matrix: ndarray, target_dtype: Type = float32) -> ndarray

Helper function to convert the vector.

This also guarantees uniform typing (of np.float32) since in our experience some vectors may be of a different type before (i.e np.float64).

Parameters:

  • cur

    (ndarray) –

    The current vector.

  • matrix

    (ndarray) –

    The transformation matrix.

  • target_dtype

    (Type, default: float32 ) –

    The target element data ype. Defaults to np.float32.

Returns: np.ndarray: The transformed vector.

Source code in medcat-v2/medcat/utils/vocab_utils.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def convert_vec(cur: np.ndarray, matrix: np.ndarray,
                target_dtype: Type = np.float32) -> np.ndarray:
    """Helper function to convert the vector.

    This also guarantees uniform typing (of np.float32) since in our
    experience some vectors may be of a different type before (i.e np.float64).

    Args:
        cur (np.ndarray): The current vector.
        matrix (np.ndarray): The transformation matrix.
        target_dtype (Type): The target element data ype.
            Defaults to np.float32.
    Returns:
        np.ndarray: The transformed vector.
    """
    return (matrix @ cur).astype(target_dtype)

convert_vocab

convert_vocab(vocab: Vocab, matrix: ndarray) -> None

Use the transformation matrix to convert the word vectors.

Parameters:

  • vocab

    (Vocab) –

    The Vocab.

  • matrix

    (ndarray) –

    The transformation matrix.

Source code in medcat-v2/medcat/utils/vocab_utils.py
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def convert_vocab(vocab: Vocab, matrix: np.ndarray) -> None:
    """Use the transformation matrix to convert the word vectors.

    Args:
        vocab (Vocab): The Vocab.
        matrix (np.ndarray): The transformation matrix.
    """
    for d in vocab.vocab.values():
        cvec = d['vector']
        if cvec is None:
            continue
        d['vector'] = convert_vec(cvec, matrix)
    logger.info("Recalc cumulative sums (instead of unigram table)")
    vocab.init_cumsums()

convert_vocab_vector_size

convert_vocab_vector_size(cdb: CDB, vocab: Vocab, vec_size: int)

Convert the vocab vector size to a smaller one.

This uses Principal Component Analysis (PCA). The idea is that we first center all the word vectors (in Vocab), then compute the covariance matrix, then find the eigenvalues and eigenvectors, and then we select the top vec_size eigenvectors. This produces a transformation matrix of shape (vec_size, N), where N is the current vector length in the vocab.

After that, we perform the transformation. First we transform all the vectors in the Vocab. And then we transform all the context vectors defined within the CDB.

NOTE: This requires the CDB as well since the per concept context vectors stored within it are based on the vectors in the vocab and thus they also need to be transformed.

Parameters:

  • cdb

    (CDB) –

    The Concept Database.

  • vocab

    (Vocab) –

    The Vocab.

  • vec_size

    (int) –

    The target vector size.

Source code in medcat-v2/medcat/utils/vocab_utils.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def convert_vocab_vector_size(cdb: CDB, vocab: Vocab, vec_size: int):
    """Convert the vocab vector size to a smaller one.

    This uses Principal Component Analysis (PCA). The idea is that we
    first center all the word vectors (in Vocab), then compute the
    covariance matrix, then find the eigenvalues and eigenvectors,
    and then we select the top `vec_size` eigenvectors.
    This produces a transformation matrix of shape (vec_size, N),
    where N is the current vector length in the vocab.

    After that, we perform the transformation. First we transform all
    the vectors in the Vocab. And then we transform all the context
    vectors defined within the CDB.

    NOTE: This requires the CDB as well since the per concept context
    vectors stored within it are based on the vectors in the vocab and
    thus they also need to be transformed.

    Args:
        cdb (CDB): The Concept Database.
        vocab (Vocab): The Vocab.
        vec_size (int): The target vector size.
    """
    logger.info("Converting Vocab and CDB to size %s. Calculating "
                "transformation matrix", vec_size)
    matrix = calc_matrix(vocab, vec_size)
    logger.info("Found transformation matrix with shape %s. "
                "Now converting vocab.", matrix.shape)
    convert_vocab(vocab, matrix)
    logger.info("Done converting vocab, now converting the per concept "
                "context vectors defined in the CDB.")
    convert_context_vectors(cdb, matrix)
    logger.info("Done with the conversion to vocab vector size %s.",
                vec_size)