API Reference¶
This is the API Reference of the Python library Contextual Encoders.
Note
When the dependencies got updated, an export from poetry needs to be done to update the requirements.txt within the doc directory: poetry export –dev -f requirements.txt –output requirements.txt.
Aggregator¶
Aggregators are used to combine multiple matrices to a single matrix. This is used to combine similarity and dissimilarity matrices of multiple attributes to a single one. Thus, an Aggregator \(\mathcal{A}\) is a mapping of the form \(\mathcal{A} : \mathbb{R}^{n \times n \times k} \rightarrow \mathbb{R}^{n \times n}\), with \(n\) being the amount of features and \(k\) being the number of similarity or dissimilarity matrices of type \(D \in \mathbb{R}^{n \times n}\), i.e. the amount of attributes/columns of the dataset.
Currently, the following Aggregators are implement:
Name |
Formula |
mean |
\(\mathcal{A} (D^1, D^2, ..., D^k) = \frac{1}{k} \sum_{i=1}^{k} D^i\) |
median |
\(\mathcal{A} (D^1, D^2, ..., D^k) = \left\{ \begin{array}{ll} D^{\frac{k}{2}} & \mbox{, if } k \mbox{ is even} \\ \frac{1}{2} \left( D^{\frac{k-1}{2}} + D^{\frac{k+1}{2}} \right) & \mbox{, if } k \mbox{ is odd} \end{array} \right.\) |
max |
\(\mathcal{A} (D^1, D^2, ..., D^k) = max_{ l} \; D_{i,j}^l\) |
min |
\(\mathcal{A} (D^1, D^2, ..., D^k) = min_{ l} \; D_{i,j}^l\) |
- class contextual_encoders.aggregator.Aggregator¶
Bases:
abc.ABCAn abstract base class for Aggregators. If custom Aggregators are created, it is enough to derive from this class and use it whenever an Aggregator is needed.
- abstract aggregate(matrices)¶
The abstract method that is implemented by the concrete Aggregators.
- Parameters
matrices – a list of similarity or dissimilarity matrices as 2D numpy arrays.
- Returns
a single 2D numpy array.
- class contextual_encoders.aggregator.AggregatorFactory¶
Bases:
objectThe factory class for creating concrete instances of the implemented Aggregators with default values.
- static create(aggregator)¶
Creates an instance of the given Aggregator name.
- Parameters
aggregator – The name of the Aggregator, which can be
mean,median,maxormin.- Returns
An instance of the Aggregator.
- Raises
ValueError – The given Aggregator does not exist.
- class contextual_encoders.aggregator.MaxAggregator¶
Bases:
contextual_encoders.aggregator.AggregatorThis class aggregates similarity or dissimilarity matrices using the
max. Given \(k\) similarity or dissimilarity matrices \(D^i \in \mathbb{R}^{n \times n}\), the MaxAggregator calculates\(\mathcal{A} (D^1, D^2, ..., D^k) = max_{ l} \; D_{i,j}^l\).
- aggregate(matrices)¶
Calculates the max of all given matrices along the zero axis.
- Parameters
matrices – A list of 2D numpy arrays.
- Returns
A 2D numpy array.
- class contextual_encoders.aggregator.MeanAggregator¶
Bases:
contextual_encoders.aggregator.AggregatorThis class aggregates similarity or dissimilarity matrices using the
mean. Given \(k\) similarity or dissimilarity matrices \(D^i \in \mathbb{R}^{n \times n}\), the MeanAggregator calculates\(\mathcal{A} (D^1, D^2, ..., D^k) = \frac{1}{k} \sum_{i=1}^{k} D^i\).
- aggregate(matrices)¶
Calculates the mean of all given matrices along the zero axis.
- Parameters
matrices – A list of 2D numpy arrays.
- Returns
A 2D numpy array.
- class contextual_encoders.aggregator.MedianAggregator¶
Bases:
contextual_encoders.aggregator.AggregatorThis class aggregates similarity or dissimilarity matrices using the
median. Given \(k\) similarity or dissimilarity matrices \(D^i \in \mathbb{R}^{n \times n}\), the MedianAggregator calculates\(\mathcal{A} (D^1, D^2, ..., D^k) = \left{ \begin{array}{ll} D^{\frac{k}{2}} & \mbox{, if } k \mbox{ is even} \\ \frac{1}{2} \left( D^{\frac{k-1}{2}} + D^{\frac{k+1}{2}} \right) & \mbox{, if } k \mbox{ is odd} \end{array} \right.\)
- aggregate(matrices)¶
Calculates the median of all given matrices along the zero axis.
- Parameters
matrices – A list of 2D numpy arrays.
- Returns
A 2D numpy array.
- class contextual_encoders.aggregator.MinAggregator¶
Bases:
contextual_encoders.aggregator.AggregatorThis class aggregates similarity or dissimilarity matrices using the
min. Given \(k\) similarity or dissimilarity matrices \(D^i \in \mathbb{R}^{n \times n}\), the MinAggregator calculates\(\mathcal{A} (D^1, D^2, ..., D^k) = min_{ l} \; D_{i,j}^l\).
- aggregate(matrices)¶
Calculates the min of all given matrices along the zero axis.
- Parameters
matrices – A list of 2D numpy arrays.
- Returns
A 2D numpy array.
MatrixComputer¶
The MatrixComputer combines the Measure with the Gatherer and calculates the similarity or
dissimilarity matrix for one attribute.
Thus, the MatrixComputer can be seen as a mapping \(\mathcal{M}: F \rightarrow \mathbb{R}^{n \times n}\),
with \(F\) being the feature space and \(n\) the amount of features.
- class contextual_encoders.computer.MatrixComputer(measure, gatherer, separator_token)¶
Bases:
objectThe service class to compute a similarity or dissimilarity matrix.
- __init__(measure, gatherer, separator_token)¶
Initializes the MatrixComputer.
- Parameters
measure – The instance of the Similarity or Dissimilarity Measure. See
SimilarityMeasureandDissimilarityMeasure.gatherer – Either the name of a Gatherer or a concrete instance. See
GathererFactoryfor implemented Gatherers andGathererfor creating custom Gatherers. If the specified measure can handle multiple values (forms of an attribute), theIdentityGathererwill be taken in any way.separator_token – A string for separating forms of categorical attributes.
- compute(data)¶
Computes the similarity or dissimilarity matrix based on the given data.
- Parameters
data – A single pandas series containing the data. Note, that each entry can have multiple values (the forms of an attribute), that are separated with the
separator_token.- Returns
A 2D numpy array representing the similarity or dissimilarity matrix.
Context¶
The Context is the core part of the Contextual Encoders library.
It is used to measure the similarity or dissimilarity of attributes.
So far, two different Context-Types are implemented: GraphContext and TreeContext.
However, it is very likely that custom context needs to be implemented.
Therefore, the base classes Context and GraphBasedContext are used,
that come with optimized in- and export functions as well as caching.
- class contextual_encoders.context.Context(name)¶
Bases:
abc.ABCThe abstract base class for all Context.
- __init__(name)¶
Initializes the Context.
- Parameters
name – The name of the Context.
- abstract export_to_file(path)¶
Exports the Context to the given file path.
- Parameters
path – The path to export the Context to.
- abstract import_from_file(path)¶
Imports the Context from the given file path.
- Parameters
path – The path to import the Context from.
- class contextual_encoders.context.GraphBasedContext(name)¶
Bases:
contextual_encoders.context.ContextA base class for all graph based Context.
- __init__(name)¶
Initializes the GraphBasedContext.
- Parameters
name – The name of the Context.
- draw()¶
Draws the graph using matplotlib.
- export_to_file(path)¶
Exports the graph to the given file path.
- Parameters
path – The path to export the graph to.
- get_graph()¶
Returns the networkx DiGraph instance.
- Returns
A networkx DiGraph instance.
- import_from_file(path)¶
Imports the graph from the given file path.
- Parameters
path – The path to import the graph from.
- class contextual_encoders.context.GraphContext(name)¶
Bases:
contextual_encoders.context.GraphBasedContextA graph based Context than can be used for graph based measures.
- add_concept(node, neighbor=None, weight=1.0)¶
Adds a new node to the graph. If the neighbor does not exist, it will be added as new node. If the node already exists, the weight will be overwritten.
- Parameters
node – The name of the node.
neighbor – The name of the neighbor node.
weight – The wight of the edge between the node and the neighbor.
- class contextual_encoders.context.TreeContext(name)¶
Bases:
contextual_encoders.context.GraphBasedContextA graph based Context than can be used for tree based measures.
- add_concept(child, parent=None, weight=1.0)¶
Adds a new node to the tree, where the name of the context serves as the root node. If the parent does not exist, it will be added as new node. If the parent is None, the root node will serve as the parent. If the node already exists, the weight will be overwritten.
- Parameters
child – The name of the child node.
parent – The name of the parent node.
weight – The wight of the edge between the child and the parent.
- get_root()¶
Gets the name of the root, i.e. the name of the context.
- Returns
The name of the root.
- get_tree()¶
Returns the networkx DiGraph instance.
- Returns
A networkx DiGraph representing the tree.
ContextualEncoder¶
The ContextualEncoder is the actual interface for using the Contextual Encoders library. It is used to perform the contextual encoding of a given dataset. Moreover, it inherits from the scikit- learn BaseEstimator and TransformerMixin types and thus enable being used in scikit- learn Pipelines.
Having a dataset \(X \subset \mathcal{F}\), with \(\mathcal{F}\) denoting the feature space, the ContextualEncoder can be seen as a map \(\mathcal{E} : X \subset \mathcal{F} \rightarrow \tilde{X} \subset \mathbb{R}^{m}\), with \(m \in \mathbb{N}\) being the (configurable) dimension of the encoding and \(\tilde{X}\) the encoded dataset as vectors.
In other words, let \(n \in \mathbb{N}\) be the amount of features. The ContextualEncoder then takes \(n\) features that are either numerical, categorical or a mix of both and produces \(n\) vectors of dimension \(m \in \mathbb{N}\).
Note
Additionally, a similarity matrix \(S \in \mathbb{R}^{n \times n}\) and dissimilarity matrix \(D \in \mathbb{R}^{n \times n}\) will be calculated.
Note
Assuming we have a dataset with \(n\) columns. Each column is called an attribute and each row is called a feature. One attribute of a particular feature can consist of multiple values. Those values are called the forms of the attribute. The forms can be separated e.g. with a comma. The contextual encoding then consists of the following steps:
Calculate a comparison value for each form of each attribute and feature using a
Measure.Combine the form comparison values to an attribute comparison value using a
Gatherer.Combine the attribute comparison values to a feature comparison value using an
Aggregator.Use an
Inverterto either get a similarity value from a dissimilarity value or visa verse.Collect all the feature comparison values and construct the similarity and dissimilarity matrix within the
MatrixComputer.Convert the similarity or dissimilarity matrix to a set of vectors using a
Reducer.
- class contextual_encoders.encoder.ContextualEncoder(measures, separator_token=',', gatherers='smm', aggregator='mean', inverters='sqrt', reducer='mds')¶
Bases:
sklearn.base.BaseEstimator,sklearn.base.TransformerMixinThe interface for encoding contextual variables.
- __init__(measures, separator_token=',', gatherers='smm', aggregator='mean', inverters='sqrt', reducer='mds')¶
Initializes the ContextualEncoder.
Note
If no concrete instances but only names are specified for the components, an instance will be created with the default values.
- Parameters
measures – A list of Measures. If \(k \in \mathbb{N}\) columns should be encoded, the list needs to be of size \(k\). See
Measurefor currently implemented Measures and how custom Measures can be implemented.separator_token – A string for separating forms of attributes.
gatherers – A list of either Gatherer instances or Gatherer names. If \(k \in \mathbb{N}\) columns should be encoded, the list needs to be of size \(k\). If only one Gatherer should be used for all columns, a single object is enough and a list is not needed. See
Gathererfor currently implemented Gatherers and how custom Gatherers can be implemented. SeeGathererFactoryfor the names of the implemented Gatherers.aggregator – Either an Aggregator instance or an Aggregator name. See
Aggregatorfor currently implemented Aggregators and how custom Aggregators can be implemented. SeeAggregatorFactoryfor the names of the implemented Aggregators.inverters – A list of either Inverter instances or Inverter names. If \(k \in \mathbb{N}\) columns should be encoded, the list needs to be of size \(k\). If only one Inverter should be used for all columns, a single object is enough and a list is not needed. See
Inverterfor currently implemented Inverters and how custom Inverters can be implemented. SeeInverterFactoryfor the names of the implemented Inverters.reducer – Either a Reducer instance or a Reducer name. See
Reducerfor currently implemented Reducers and how custom Reducers can be implemented. SeeReducerFactoryfor the names of the implemented Reducers.
- get_dissimilarity_matrix()¶
Gets the dissimilarity matrix.
- Returns
The dissimilarity matrix as 2D numpy array.
- get_similarity_matrix()¶
Gets the similarity matrix.
- Returns
The similarity matrix as 2D numpy array.
- transform(x)¶
Encodes the given contextual variables.
- Parameters
x – The data as numpy array, pandas dataframe or python list format.
- Returns
The encoded data as numpy array.
Gatherer¶
A Gatherer is used to combine the form comparison values to an attribute comparison value. I.e. when an attribute of a feature contains multiple values (forms of an attribute), the Gatherer will combine the pairwise form comparison values to a single attribute comparison value.
Let \(x, y \in F\) be two features from the feature space \(F\). Each feature consists of \(k\) attributes and each attribute can have up to \(l\) forms. A form of an attribute of the feature \(x\) can then be denoted as \(x_{a,i}\), with \(a\) being the attribute and \(i\) being the form. For simplicity, we just denote it as \(\tilde{x}_i\) and skip the attribute index. A Measure is then defined as \(\mathcal{M} : (\tilde{x}_i, \tilde{y}_j) \rightarrow [0,1]\), i.e. it maps a similarity or dissimilarity value to each attribute form.
The Gatherer uses the Measure together with all the attribute forms and calculates a single attribute comparison value. Hence, it can be seen as a mapping \(\mathcal{G} : (x, y, \mathcal{M}) \mapsto g \in \mathbb{R}\).
Currently, the following Gatherers are implement:
Name |
Formula |
id |
\(\mathcal{G} (x, y, \mathcal{M}) = \mathcal{M}(x, y)\) |
first |
\(\mathcal{G} (x, y, \mathcal{M}) = \mathcal{M}(x_1, y_1)\) |
smm |
\(\mathcal{G} (x, y, \mathcal{M}) = \frac{1}{2} \Big( \frac{1}{|x|} \sum_{i=1}^{l_x} \mathcal{M}(x_i, \tilde{y}) + \frac{1}{|y|} \sum_{i=1}^{l_y} \mathcal{M}(\tilde{x}, y_i) \Big)\) |
Note
If a Measure has the property multiple_values,
it accepts all forms of an attribute as input and can calculate an attribute comparison value,
rather then an attribute form comparison value.
In this case, a Gatherer is not needed.
- class contextual_encoders.gatherer.FirstValueGatherer¶
Bases:
contextual_encoders.gatherer.GathererA Gatherer that let only uses the first forms of the attributes. It can be seen as a mapping
\(\mathcal{G} (x, y, \mathcal{M}) = \mathcal{M}(x_1, y_1)\),
with \(\mathcal{M}\) being the Similarity or Dissimilarity Measure and \(x, y\) being the same attributes from different features.
- _gather(first, second)¶
Gather the given attributes with only measuring their first values.
- Parameters
first – The value of the first attribute.
second – The value of the second attribute.
- Returns
The combined value.
- class contextual_encoders.gatherer.Gatherer¶
Bases:
abc.ABCThe abstract base class of all Gatherers.
- __init__()¶
Initializes the Gatherer.
- abstract _gather(first, second)¶
The abstract method to gather all attribute form comparison values of two features. This class needs to be implemented by concrete instances of Gatherers.
Note
Multiple values, e.g. comma separated, are exclusively possible.
- Parameters
first – A list of the value(s) of the first attribute.
second – A list of the value(s) of the second attribute.
- Returns
The aggregated value.
- gather(first, second)¶
Combines the given attributes.
Note
Multiple values, e.g. comma separated, are exclusively possible.
- Parameters
first – A list of the value(s) of the first attribute.
second – A list of the value(s) of the second attribute.
- Returns
The aggregated value.
- set_measure(measure)¶
Sets the Measure for the Gatherer.
- Parameters
measure – The Similarity or Dissimilarity Measure, see
SimilarityMeasureandDissimilarityMeasure.
- class contextual_encoders.gatherer.GathererFactory¶
Bases:
objectThe factory class for creating Gatherers.
- static create(gatherer)¶
Creates a Gatherer given the name.
- Parameters
gatherer – The name of the Gatherer, which can be
id,firstorsmm.- Returns
The concrete instance of the Gatherer.
- class contextual_encoders.gatherer.IdentityGatherer¶
Bases:
contextual_encoders.gatherer.GathererA Gatherer that let the Measure decide how to handle multiple attribute forms. It can be seen as a mapping
\(\mathcal{G} (x, y, \mathcal{M}) = \mathcal{M}(x, y)\),
with \(\mathcal{M}\) being the Similarity or Dissimilarity Measure and \(x, y\) being the same attributes from different features.
- _gather(first, second)¶
Calling the Measure without handling multiple values at Gatherer level.
- Parameters
first – The value of the first attribute.
second – The value of the second attribute.
- Returns
The value returned from the measure.
- class contextual_encoders.gatherer.SymMaxMeanGatherer¶
Bases:
contextual_encoders.gatherer.GathererA Gatherer that symmetrically measures all pairwise attribute forms but only uses the maximum value. It can be seen as a mapping
\(\mathcal{G} (x, y, \mathcal{M}) = \frac{1}{2} \Big( \frac{1}{|x|} \sum_{i=1}^{l_x} \mathcal{M}(x_i, \tilde{y_i}) + \frac{1}{|y|} \sum_{i=1}^{l_y} \mathcal{M}(\tilde{x_i}, y_i) \Big)\)
with \(\mathcal{M}\) being the Similarity or Dissimilarity Measure, \(x, y\) being the same attributes from different features, \(|x|\) the amount of attribute forms of the attribute \(x\) and \(\tilde{x_i} = argmax_{j=1,...,l_x} \mathcal{M}(x_j, y_i)\).
- _gather(first, second)¶
Gathers two attributes symmetrically based on the maximum comparison value.
- Parameters
first – The value of the first attribute.
second – The value of the second attribute.
- Returns
The combined value.
Inverter¶
An Inverter is used to calculate a dissimilarity value given a similarity value and vice versa. It can be seen as a one-to-one mapping \(\mathcal{I} : [0,1] \rightarrow [0,1]\).
Currently, the following Inverters are implement:
Name |
Formula |
lin |
\(\mathcal{I} (s) = 1 - s\) |
sqrt |
\(\mathcal{I} (s) = \sqrt{1 - s}\) |
exp |
\(\mathcal{I} (s) = 2 - e^{ln(2) \cdot s}\) |
cos |
\(\mathcal{I} (s) = cos(\frac{\pi}{2} \cdot s)\) |
Note
If a custom inverter is implemented, make sure that the function is invertible and the definition range and value range is \([0, 1]\).
- class contextual_encoders.inverter.CosineInverter¶
Bases:
contextual_encoders.inverter.InverterAn Inverter that converts a similarity matrix to a dissimilarity matrix and vice versa using a cosine ansatz. It can be used as the
cosoption.- dissimilarity_to_similarity(dissimilarity_matrix)¶
Converts the given dissimilarity matrix to a similarity matrix accordingly to \(\mathcal{I}^{-1} (d) = \frac{2}{\pi} acos(d)\), with \(d\) being the dissimilarity matrix. The operations are considered as elementwise.
- Parameters
dissimilarity_matrix – A dissimilarity matrix as 2D numpy array.
- Returns
A similarity matrix as 2D numpy array.
- similarity_to_dissimilarity(similarity_matrix)¶
Converts the given similarity matrix to a dissimilarity matrix accordingly to \(\mathcal{I} (s) = cos(\frac{\pi}{2} \cdot s)\), with \(s\) being the similarity matrix. The operations are considered as elementwise.
- Parameters
similarity_matrix – A similarity matrix as 2D numpy array.
- Returns
A dissimilarity matrix as 2D numpy array.
- class contextual_encoders.inverter.ExponentialInverter¶
Bases:
contextual_encoders.inverter.InverterAn Inverter that converts a similarity matrix to a dissimilarity matrix and vice versa using an exponential ansatz. It can be used as the
expoption.- dissimilarity_to_similarity(dissimilarity_matrix)¶
Converts the given dissimilarity matrix to a similarity matrix accordingly to \(\mathcal{I}^{-1} (d) = \frac{1}{ln(2)} ln(2 - d)\), with \(d\) being the dissimilarity matrix. The operations are considered as elementwise.
- Parameters
dissimilarity_matrix – A dissimilarity matrix as 2D numpy array.
- Returns
A similarity matrix as 2D numpy array.
- similarity_to_dissimilarity(similarity_matrix)¶
Converts the given similarity matrix to a dissimilarity matrix accordingly to \(\mathcal{I} (s) = 2 - e^{ln(2) \cdot s}\), with \(s\) being the similarity matrix. The operations are considered as elementwise.
- Parameters
similarity_matrix – A similarity matrix as 2D numpy array.
- Returns
A dissimilarity matrix as 2D numpy array.
- class contextual_encoders.inverter.Inverter¶
Bases:
abc.ABCAn abstract base class for all concrete Inverter implementations.
- abstract dissimilarity_to_similarity(dissimilarity_matrix)¶
Calculates a similarity matrix given a dissimilarity matrix. :param dissimilarity_matrix: a dissimilarity matrix as 2D numpy array. :return: a similarity matrix as 2D numpy array.
- abstract similarity_to_dissimilarity(similarity_matrix)¶
Calculates a dissimilarity matrix given a similarity matrix. :param similarity_matrix: a similarity matrix as 2D numpy array. :return: a dissimilarity matrix as 2D numpy array.
- class contextual_encoders.inverter.InverterFactory¶
Bases:
objectA factory class to create concrete Inverter instances with default values.
- static create(inverter)¶
Creates an instance of the given Inverter name.
- Parameters
inverter – The name of the Inverter, which can be
lin,sqrt,exporcos.- Returns
An instance of the Inverter.
- Raises
ValueError – The given Inverter does not exist.
- class contextual_encoders.inverter.LinearInverter¶
Bases:
contextual_encoders.inverter.InverterAn Inverter that converts a similarity matrix to a dissimilarity matrix and vice versa using a linear ansatz. It can be used as the
linoption.- dissimilarity_to_similarity(dissimilarity_matrix)¶
Converts the given dissimilarity matrix to a similarity matrix accordingly to \(\mathcal{I}^{-1} (d) = 1 - d\), with \(d\) being the dissimilarity matrix. The operations are considered as elementwise.
- Parameters
dissimilarity_matrix – A dissimilarity matrix as 2D numpy array.
- Returns
A similarity matrix as 2D numpy array.
- similarity_to_dissimilarity(similarity_matrix)¶
Converts the given similarity matrix to a dissimilarity matrix accordingly to \(\mathcal{I} (s) = 1 - s\), with \(s\) being the similarity matrix. The operations are considered as elementwise.
- Parameters
similarity_matrix – A similarity matrix as 2D numpy array.
- Returns
A dissimilarity matrix as 2D numpy array.
- class contextual_encoders.inverter.SqrtInverter¶
Bases:
contextual_encoders.inverter.InverterAn Inverter that converts a similarity matrix to a dissimilarity matrix and vice versa using a sqrt ansatz. It can be used as the
sqrtoption.- dissimilarity_to_similarity(dissimilarity_matrix)¶
Converts the given dissimilarity matrix to a similarity matrix accordingly to \(\mathcal{I}^{-1} (d) = 1 - d^2\), with \(d\) being the dissimilarity matrix. The operations are considered as elementwise.
- Parameters
dissimilarity_matrix – A dissimilarity matrix as 2D numpy array.
- Returns
A similarity matrix as 2D numpy array.
- similarity_to_dissimilarity(similarity_matrix)¶
Converts the given similarity matrix to a dissimilarity matrix accordingly to \(\mathcal{I} (s) = \sqrt{1 - s}\), with \(s\) being the similarity matrix. The operations are considered as elementwise.
- Parameters
similarity_matrix – A similarity matrix as 2D numpy array.
- Returns
A dissimilarity matrix as 2D numpy array.
Measure¶
A Measure is used to calculate a comparison value between two attribute forms.
Let \(x, y \in F\) be two features from the feature space \(F\). Each feature consists of \(k\) attributes and each attribute can have up to \(l\) forms. A form of an attribute of the feature \(x\) can then be denoted as \(x_{a,i}\), with \(a\) being the attribute and \(i\) being the form. For simplicity, we just denote it as \(\tilde{x}_i\) and skip the attribute index. A Measure is then defined as \(\mathcal{M} : (\tilde{x}_i, \tilde{y}_j) \rightarrow [0,1]\), i.e. it maps a similarity or dissimilarity value to each attribute form.
Note
A Measure can also be defined on an attribute, rather then on attribute forms.
This can be done by setting multiple_values to True, see Measure.
Note
A Measure always needs to return values within the range \([0,1]\).
- class contextual_encoders.measure.DissimilarityMeasure(symmetric, multiple_values)¶
Bases:
contextual_encoders.measure.Measure,abc.ABCAn abstract base class for calculating dissimilarity values.
- __init__(symmetric, multiple_values)¶
Initializes the Dissimilarity Measure.
- Parameters
symmetric – Defines whether the Dissimilarity Measure is symmetric, i.e. if \(\mathcal{M}(x,y) = \mathcal{M}(y,x)\).
multiple_values – Defines whether the Dissimilarity Measure can compare full attributes, rather then only attribute forms. When this property is set to
True, the_comparemethod will get the entire attribute value as input. If the property is set toFalse, a list with all attribute forms will be given as input.
- class contextual_encoders.measure.Measure(symmetric, multiple_values)¶
Bases:
abc.ABCThe abstract base class for all implementations of Measures.
- __generate_cache_key(second)¶
Generates a serializable cache key given the two attributes or attribute forms.
- Parameters
first – The first attribute or attribute form.
second – The second attribute or attribute form.
- Returns
A unique key representing the two attributes or attribute forms.
- __init__(symmetric, multiple_values)¶
Initializes the Measure.
- Parameters
symmetric – Defines whether the Measure is symmetric, i.e. if \(\mathcal{M}(x,y) = \mathcal{M}(y,x)\).
multiple_values – Defines whether the Measure can compare full attributes, rather then only attribute forms. When this property is set to
True, the_comparemethod will get the entire attribute value as input. If the property is set toFalse, a list with all attribute forms will be given as input.
- __read_from_cache(first, second)¶
Read the comparison value for the two given attributes or attribute forms from the cache. If the value cannot be found in the cache,
Nonewill be returned instead.- Parameters
first – The first attribute or attribute form.
second – The second attribute or attribute form.
- Returns
The comparison value or
Noneif it cannot be found.
- __write_to_cache(first, second, value)¶
Writes the comparison value with the two attributes or attribute forms into the cache. :param first: The first attribute or attribute form. :param second: The second attribute or attribute form. :param value: The comparison value.
- abstract _compare(first, second)¶
Compares the two attributes or attribute forms. This is the abstract method that needs to be implemented by concrete Measure instances.
- Parameters
first – The first attribute or attribute form.
second – The second attribute or attribute form.
- Returns
The comparison value which is in \([0,1]\).
- can_handle_multiple_values()¶
Returns
Trueif the Measure can handle multiple values. When this property is set toTrue, the_comparemethod will get the entire attribute value as input. If the property is set toFalse, a list with all attribute forms will be given as input.- Returns
Trueif the Measure can handle multiple values.
- compare(first, second)¶
Compares the two attributes or attribute forms. This method caches precalculated values within an in-memory dictionary.
- Parameters
first – The first attribute or attribute form.
second – The second attribute or attribute form.
- Returns
The comparison value which is in \([0,1]\).
- export_to_file(path)¶
Exports the Measure including the cache to the given path.
- Parameters
path – The path to export the Measure to.
- import_from_file(path)¶
Imports the Measure including the cache from the given path.
- Parameters
path – The path to import the Measure from.
- is_symmetric()¶
Returns
Trueif the Measure is symmetric, i.e. if \(\mathcal{M}(x,y) = \mathcal{M}(y,x)\).- Returns
Trueif the Measure is symmetric.
- class contextual_encoders.measure.PathLengthMeasure(context)¶
Bases:
contextual_encoders.measure.SimilarityMeasureA SimilarityMeasure based on counting the path length between two concepts.
- __init__(context)¶
Initializes the PathLengthMeasure.
- Parameters
context – The
GraphContextused for comparison.
- _compare(first, second)¶
Compares the two attribute forms based on their path length in the Context. The Measure counts the shortest path length \(p\) going from the first to the second value and returns \(\frac{1}{1+p}\).
- Parameters
first – The first attribute form.
second – The second attribute form.
- Returns
The PathLength Similarity comparison value.
- class contextual_encoders.measure.SimilarityMeasure(symmetric, multiple_values)¶
Bases:
contextual_encoders.measure.Measure,abc.ABCAn abstract base class for calculating similarity values.
- __init__(symmetric, multiple_values)¶
Initializes the Similarity Measure.
- Parameters
symmetric – Defines whether the Similarity Measure is symmetric, i.e. if \(\mathcal{M}(x,y) = \mathcal{M}(y,x)\).
multiple_values – Defines whether the Similarity Measure can compare full attributes, rather then only attribute forms. When this property is set to
True, the_comparemethod will get the entire attribute value as input. If the property is set toFalse, a list with all attribute forms will be given as input.
- class contextual_encoders.measure.WuPalmer(context, offset=0.0)¶
Bases:
contextual_encoders.measure.SimilarityMeasureA tree based similarity measure based on the Wu-Palmer Similarity Measure.
- __init__(context, offset=0.0)¶
Initializes the WuPalmer Similarity Measure.
- Parameters
context – The
TreeContextused for comparison.offset – Either a real value or
depth. If a real value is used, the distance between the root and a concept will always be at least the value of the offset. Ifdepthis used, the offset will be \(\frac{1}{N}\), with \(N\) being the depth of the tree. Using an offset prevent from getting a zero similarity.
- _compare(first, second)¶
Compares the two given attribute forms using the WuPalmer Similarity Measure. :param first: The first attribute form. :param second: The second attribute form. :return: The WuPalmer Similarity comparison value.
Reducer¶
A Reducer transforms a similarity or dissimilarity matrix into a set of vectors. Mathematically, it can be seen as a map \(\mathcal{R} : D \in \mathbb{R}^{n \times n} \rightarrow \tilde{X} \subset \mathbb{R}^{m}\), with \(m \in \mathbb{N}\) being the (configurable) dimension of the encoding and \(\tilde{X}\) the encoded dataset as vectors.
In other words, let \(n \in \mathbb{N}\) be the amount of features. A Reducer then takes the similarity or dissimilarity matrix \(D \in \mathbb{R}^{n \times n}\) and produces \(n\) euclidean vectors of dimension \(m\).
Currently, the following Reducers are implement:
Name |
Description |
mds |
Creates a low-dimensional representation of the data in which the distances respect well
the distances in the original high-dimensional space.
|
- class contextual_encoders.reducer.DissimilarityMatrixReducer(n_components)¶
Bases:
contextual_encoders.reducer.Reducer,abc.ABCAn abstract base class for reducing dissimilarity matrices.
- class contextual_encoders.reducer.MultidimensionalScalingReducer(n_components=2, metric=True)¶
Bases:
contextual_encoders.reducer.DissimilarityMatrixReducerA reducer using the Multidimensional Scaling approach (MDS) from scikit-learn. It can be used with the
mdsoption.- __init__(n_components=2, metric=True)¶
Initializes the MultidimensionalScalingReducer.
- Parameters
n_components – The dimension of the output vectors.
metric – If
True, perform metric MDS; otherwise, perform non-metric MDS.
- get_stress()¶
Gets the stress level for the performed MDS.
- Returns
The stress level of the MDS.
- reduce(dissimilarity_matrix)¶
Reduces the given dissimilarity matrix using the MDS approach.
- Parameters
dissimilarity_matrix – The dissimilarity matrix as 2D numpy array.
- Returns
Encoded vectors as 2D numpy array of size \(n \times m\), with \(n\) being the amount of features and \(m\) the dimension of the vectors, i.e.
n_components.
- class contextual_encoders.reducer.Reducer(n_components)¶
Bases:
abc.ABCThe abstract base class for all Reducers.
- __init__(n_components)¶
Initializes the Reducer.
- Parameters
n_components – The dimension of the output vectors.
- abstract reduce(matrix)¶
The abstract method that is implemented by concrete instances of Reducers.
- Parameters
matrix – The similarity or dissimilarity matrix \(D \in \mathbb{R}^{n \times n}\) as 2D numpy array.
- Returns
The set of vectors \(\tilde{X} \in \mathbb{R}^{n \times m}\), with \(m\) being n_components.
- class contextual_encoders.reducer.ReducerFactory¶
Bases:
objectThe factory class for creating Reducers with default values.
- static create(reducer)¶
Creates a concrete Reducer instance given the name.
- Parameters
reducer – The name of the Reducer, which can be
mds.- Returns
The instance of the Reducer
- class contextual_encoders.reducer.SimilarityMatrixReducer(n_components)¶
Bases:
contextual_encoders.reducer.Reducer,abc.ABCAn abstract base class for reducing similarity matrices.