API Reference

This is the API Reference of the Python library Contextual Encoders.

Note

When the dependencies got updated, an export from poetry needs to be done to update the requirements.txt within the doc directory: poetry export –dev -f requirements.txt –output requirements.txt.

Aggregator

Aggregators are used to combine multiple matrices to a single matrix. This is used to combine similarity and dissimilarity matrices of multiple attributes to a single one. Thus, an Aggregator \(\mathcal{A}\) is a mapping of the form \(\mathcal{A} : \mathbb{R}^{n \times n \times k} \rightarrow \mathbb{R}^{n \times n}\), with \(n\) being the amount of features and \(k\) being the number of similarity or dissimilarity matrices of type \(D \in \mathbb{R}^{n \times n}\), i.e. the amount of attributes/columns of the dataset.

Currently, the following Aggregators are implement:

Name

Formula

mean

\(\mathcal{A} (D^1, D^2, ..., D^k) = \frac{1}{k} \sum_{i=1}^{k} D^i\)

median

\(\mathcal{A} (D^1, D^2, ..., D^k) = \left\{ \begin{array}{ll} D^{\frac{k}{2}} & \mbox{, if } k \mbox{ is even} \\ \frac{1}{2} \left( D^{\frac{k-1}{2}} + D^{\frac{k+1}{2}} \right) & \mbox{, if } k \mbox{ is odd} \end{array} \right.\)

max

\(\mathcal{A} (D^1, D^2, ..., D^k) = max_{ l} \; D_{i,j}^l\)

min

\(\mathcal{A} (D^1, D^2, ..., D^k) = min_{ l} \; D_{i,j}^l\)

class contextual_encoders.aggregator.Aggregator

Bases: abc.ABC

An abstract base class for Aggregators. If custom Aggregators are created, it is enough to derive from this class and use it whenever an Aggregator is needed.

abstract aggregate(matrices)

The abstract method that is implemented by the concrete Aggregators.

Parameters

matrices – a list of similarity or dissimilarity matrices as 2D numpy arrays.

Returns

a single 2D numpy array.

class contextual_encoders.aggregator.AggregatorFactory

Bases: object

The factory class for creating concrete instances of the implemented Aggregators with default values.

static create(aggregator)

Creates an instance of the given Aggregator name.

Parameters

aggregator – The name of the Aggregator, which can be mean, median, max or min.

Returns

An instance of the Aggregator.

Raises

ValueError – The given Aggregator does not exist.

class contextual_encoders.aggregator.MaxAggregator

Bases: contextual_encoders.aggregator.Aggregator

This class aggregates similarity or dissimilarity matrices using the max. Given \(k\) similarity or dissimilarity matrices \(D^i \in \mathbb{R}^{n \times n}\), the MaxAggregator calculates

\(\mathcal{A} (D^1, D^2, ..., D^k) = max_{ l} \; D_{i,j}^l\).

aggregate(matrices)

Calculates the max of all given matrices along the zero axis.

Parameters

matrices – A list of 2D numpy arrays.

Returns

A 2D numpy array.

class contextual_encoders.aggregator.MeanAggregator

Bases: contextual_encoders.aggregator.Aggregator

This class aggregates similarity or dissimilarity matrices using the mean. Given \(k\) similarity or dissimilarity matrices \(D^i \in \mathbb{R}^{n \times n}\), the MeanAggregator calculates

\(\mathcal{A} (D^1, D^2, ..., D^k) = \frac{1}{k} \sum_{i=1}^{k} D^i\).

aggregate(matrices)

Calculates the mean of all given matrices along the zero axis.

Parameters

matrices – A list of 2D numpy arrays.

Returns

A 2D numpy array.

class contextual_encoders.aggregator.MedianAggregator

Bases: contextual_encoders.aggregator.Aggregator

This class aggregates similarity or dissimilarity matrices using the median. Given \(k\) similarity or dissimilarity matrices \(D^i \in \mathbb{R}^{n \times n}\), the MedianAggregator calculates

\(\mathcal{A} (D^1, D^2, ..., D^k) = \left{ \begin{array}{ll} D^{\frac{k}{2}} & \mbox{, if } k \mbox{ is even} \\ \frac{1}{2} \left( D^{\frac{k-1}{2}} + D^{\frac{k+1}{2}} \right) & \mbox{, if } k \mbox{ is odd} \end{array} \right.\)

aggregate(matrices)

Calculates the median of all given matrices along the zero axis.

Parameters

matrices – A list of 2D numpy arrays.

Returns

A 2D numpy array.

class contextual_encoders.aggregator.MinAggregator

Bases: contextual_encoders.aggregator.Aggregator

This class aggregates similarity or dissimilarity matrices using the min. Given \(k\) similarity or dissimilarity matrices \(D^i \in \mathbb{R}^{n \times n}\), the MinAggregator calculates

\(\mathcal{A} (D^1, D^2, ..., D^k) = min_{ l} \; D_{i,j}^l\).

aggregate(matrices)

Calculates the min of all given matrices along the zero axis.

Parameters

matrices – A list of 2D numpy arrays.

Returns

A 2D numpy array.

MatrixComputer

The MatrixComputer combines the Measure with the Gatherer and calculates the similarity or dissimilarity matrix for one attribute. Thus, the MatrixComputer can be seen as a mapping \(\mathcal{M}: F \rightarrow \mathbb{R}^{n \times n}\), with \(F\) being the feature space and \(n\) the amount of features.

class contextual_encoders.computer.MatrixComputer(measure, gatherer, separator_token)

Bases: object

The service class to compute a similarity or dissimilarity matrix.

__init__(measure, gatherer, separator_token)

Initializes the MatrixComputer.

Parameters
  • measure – The instance of the Similarity or Dissimilarity Measure. See SimilarityMeasure and DissimilarityMeasure.

  • gatherer – Either the name of a Gatherer or a concrete instance. See GathererFactory for implemented Gatherers and Gatherer for creating custom Gatherers. If the specified measure can handle multiple values (forms of an attribute), the IdentityGatherer will be taken in any way.

  • separator_token – A string for separating forms of categorical attributes.

compute(data)

Computes the similarity or dissimilarity matrix based on the given data.

Parameters

data – A single pandas series containing the data. Note, that each entry can have multiple values (the forms of an attribute), that are separated with the separator_token.

Returns

A 2D numpy array representing the similarity or dissimilarity matrix.

Context

The Context is the core part of the Contextual Encoders library. It is used to measure the similarity or dissimilarity of attributes. So far, two different Context-Types are implemented: GraphContext and TreeContext. However, it is very likely that custom context needs to be implemented. Therefore, the base classes Context and GraphBasedContext are used, that come with optimized in- and export functions as well as caching.

class contextual_encoders.context.Context(name)

Bases: abc.ABC

The abstract base class for all Context.

__init__(name)

Initializes the Context.

Parameters

name – The name of the Context.

abstract export_to_file(path)

Exports the Context to the given file path.

Parameters

path – The path to export the Context to.

abstract import_from_file(path)

Imports the Context from the given file path.

Parameters

path – The path to import the Context from.

class contextual_encoders.context.GraphBasedContext(name)

Bases: contextual_encoders.context.Context

A base class for all graph based Context.

__init__(name)

Initializes the GraphBasedContext.

Parameters

name – The name of the Context.

draw()

Draws the graph using matplotlib.

export_to_file(path)

Exports the graph to the given file path.

Parameters

path – The path to export the graph to.

get_graph()

Returns the networkx DiGraph instance.

Returns

A networkx DiGraph instance.

import_from_file(path)

Imports the graph from the given file path.

Parameters

path – The path to import the graph from.

class contextual_encoders.context.GraphContext(name)

Bases: contextual_encoders.context.GraphBasedContext

A graph based Context than can be used for graph based measures.

add_concept(node, neighbor=None, weight=1.0)

Adds a new node to the graph. If the neighbor does not exist, it will be added as new node. If the node already exists, the weight will be overwritten.

Parameters
  • node – The name of the node.

  • neighbor – The name of the neighbor node.

  • weight – The wight of the edge between the node and the neighbor.

class contextual_encoders.context.TreeContext(name)

Bases: contextual_encoders.context.GraphBasedContext

A graph based Context than can be used for tree based measures.

add_concept(child, parent=None, weight=1.0)

Adds a new node to the tree, where the name of the context serves as the root node. If the parent does not exist, it will be added as new node. If the parent is None, the root node will serve as the parent. If the node already exists, the weight will be overwritten.

Parameters
  • child – The name of the child node.

  • parent – The name of the parent node.

  • weight – The wight of the edge between the child and the parent.

get_root()

Gets the name of the root, i.e. the name of the context.

Returns

The name of the root.

get_tree()

Returns the networkx DiGraph instance.

Returns

A networkx DiGraph representing the tree.

ContextualEncoder

The ContextualEncoder is the actual interface for using the Contextual Encoders library. It is used to perform the contextual encoding of a given dataset. Moreover, it inherits from the scikit- learn BaseEstimator and TransformerMixin types and thus enable being used in scikit- learn Pipelines.

Having a dataset \(X \subset \mathcal{F}\), with \(\mathcal{F}\) denoting the feature space, the ContextualEncoder can be seen as a map \(\mathcal{E} : X \subset \mathcal{F} \rightarrow \tilde{X} \subset \mathbb{R}^{m}\), with \(m \in \mathbb{N}\) being the (configurable) dimension of the encoding and \(\tilde{X}\) the encoded dataset as vectors.

In other words, let \(n \in \mathbb{N}\) be the amount of features. The ContextualEncoder then takes \(n\) features that are either numerical, categorical or a mix of both and produces \(n\) vectors of dimension \(m \in \mathbb{N}\).

Note

Additionally, a similarity matrix \(S \in \mathbb{R}^{n \times n}\) and dissimilarity matrix \(D \in \mathbb{R}^{n \times n}\) will be calculated.

Note

Assuming we have a dataset with \(n\) columns. Each column is called an attribute and each row is called a feature. One attribute of a particular feature can consist of multiple values. Those values are called the forms of the attribute. The forms can be separated e.g. with a comma. The contextual encoding then consists of the following steps:

  • Calculate a comparison value for each form of each attribute and feature using a Measure.

  • Combine the form comparison values to an attribute comparison value using a Gatherer.

  • Combine the attribute comparison values to a feature comparison value using an Aggregator.

  • Use an Inverter to either get a similarity value from a dissimilarity value or visa verse.

  • Collect all the feature comparison values and construct the similarity and dissimilarity matrix within the MatrixComputer.

  • Convert the similarity or dissimilarity matrix to a set of vectors using a Reducer.

class contextual_encoders.encoder.ContextualEncoder(measures, separator_token=',', gatherers='smm', aggregator='mean', inverters='sqrt', reducer='mds')

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

The interface for encoding contextual variables.

__init__(measures, separator_token=',', gatherers='smm', aggregator='mean', inverters='sqrt', reducer='mds')

Initializes the ContextualEncoder.

Note

If no concrete instances but only names are specified for the components, an instance will be created with the default values.

Parameters
  • measures – A list of Measures. If \(k \in \mathbb{N}\) columns should be encoded, the list needs to be of size \(k\). See Measure for currently implemented Measures and how custom Measures can be implemented.

  • separator_token – A string for separating forms of attributes.

  • gatherers – A list of either Gatherer instances or Gatherer names. If \(k \in \mathbb{N}\) columns should be encoded, the list needs to be of size \(k\). If only one Gatherer should be used for all columns, a single object is enough and a list is not needed. See Gatherer for currently implemented Gatherers and how custom Gatherers can be implemented. See GathererFactory for the names of the implemented Gatherers.

  • aggregator – Either an Aggregator instance or an Aggregator name. See Aggregator for currently implemented Aggregators and how custom Aggregators can be implemented. See AggregatorFactory for the names of the implemented Aggregators.

  • inverters – A list of either Inverter instances or Inverter names. If \(k \in \mathbb{N}\) columns should be encoded, the list needs to be of size \(k\). If only one Inverter should be used for all columns, a single object is enough and a list is not needed. See Inverter for currently implemented Inverters and how custom Inverters can be implemented. See InverterFactory for the names of the implemented Inverters.

  • reducer – Either a Reducer instance or a Reducer name. See Reducer for currently implemented Reducers and how custom Reducers can be implemented. See ReducerFactory for the names of the implemented Reducers.

get_dissimilarity_matrix()

Gets the dissimilarity matrix.

Returns

The dissimilarity matrix as 2D numpy array.

get_similarity_matrix()

Gets the similarity matrix.

Returns

The similarity matrix as 2D numpy array.

transform(x)

Encodes the given contextual variables.

Parameters

x – The data as numpy array, pandas dataframe or python list format.

Returns

The encoded data as numpy array.

Gatherer

A Gatherer is used to combine the form comparison values to an attribute comparison value. I.e. when an attribute of a feature contains multiple values (forms of an attribute), the Gatherer will combine the pairwise form comparison values to a single attribute comparison value.

Let \(x, y \in F\) be two features from the feature space \(F\). Each feature consists of \(k\) attributes and each attribute can have up to \(l\) forms. A form of an attribute of the feature \(x\) can then be denoted as \(x_{a,i}\), with \(a\) being the attribute and \(i\) being the form. For simplicity, we just denote it as \(\tilde{x}_i\) and skip the attribute index. A Measure is then defined as \(\mathcal{M} : (\tilde{x}_i, \tilde{y}_j) \rightarrow [0,1]\), i.e. it maps a similarity or dissimilarity value to each attribute form.

The Gatherer uses the Measure together with all the attribute forms and calculates a single attribute comparison value. Hence, it can be seen as a mapping \(\mathcal{G} : (x, y, \mathcal{M}) \mapsto g \in \mathbb{R}\).

Currently, the following Gatherers are implement:

Name

Formula

id

\(\mathcal{G} (x, y, \mathcal{M}) = \mathcal{M}(x, y)\)

first

\(\mathcal{G} (x, y, \mathcal{M}) = \mathcal{M}(x_1, y_1)\)

smm

\(\mathcal{G} (x, y, \mathcal{M}) = \frac{1}{2} \Big( \frac{1}{|x|} \sum_{i=1}^{l_x} \mathcal{M}(x_i, \tilde{y}) + \frac{1}{|y|} \sum_{i=1}^{l_y} \mathcal{M}(\tilde{x}, y_i) \Big)\)

Note

If a Measure has the property multiple_values, it accepts all forms of an attribute as input and can calculate an attribute comparison value, rather then an attribute form comparison value.

In this case, a Gatherer is not needed.

class contextual_encoders.gatherer.FirstValueGatherer

Bases: contextual_encoders.gatherer.Gatherer

A Gatherer that let only uses the first forms of the attributes. It can be seen as a mapping

\(\mathcal{G} (x, y, \mathcal{M}) = \mathcal{M}(x_1, y_1)\),

with \(\mathcal{M}\) being the Similarity or Dissimilarity Measure and \(x, y\) being the same attributes from different features.

_gather(first, second)

Gather the given attributes with only measuring their first values.

Parameters
  • first – The value of the first attribute.

  • second – The value of the second attribute.

Returns

The combined value.

class contextual_encoders.gatherer.Gatherer

Bases: abc.ABC

The abstract base class of all Gatherers.

__init__()

Initializes the Gatherer.

abstract _gather(first, second)

The abstract method to gather all attribute form comparison values of two features. This class needs to be implemented by concrete instances of Gatherers.

Note

Multiple values, e.g. comma separated, are exclusively possible.

Parameters
  • first – A list of the value(s) of the first attribute.

  • second – A list of the value(s) of the second attribute.

Returns

The aggregated value.

gather(first, second)

Combines the given attributes.

Note

Multiple values, e.g. comma separated, are exclusively possible.

Parameters
  • first – A list of the value(s) of the first attribute.

  • second – A list of the value(s) of the second attribute.

Returns

The aggregated value.

set_measure(measure)

Sets the Measure for the Gatherer.

Parameters

measure – The Similarity or Dissimilarity Measure, see SimilarityMeasure and DissimilarityMeasure.

class contextual_encoders.gatherer.GathererFactory

Bases: object

The factory class for creating Gatherers.

static create(gatherer)

Creates a Gatherer given the name.

Parameters

gatherer – The name of the Gatherer, which can be id, first or smm.

Returns

The concrete instance of the Gatherer.

class contextual_encoders.gatherer.IdentityGatherer

Bases: contextual_encoders.gatherer.Gatherer

A Gatherer that let the Measure decide how to handle multiple attribute forms. It can be seen as a mapping

\(\mathcal{G} (x, y, \mathcal{M}) = \mathcal{M}(x, y)\),

with \(\mathcal{M}\) being the Similarity or Dissimilarity Measure and \(x, y\) being the same attributes from different features.

_gather(first, second)

Calling the Measure without handling multiple values at Gatherer level.

Parameters
  • first – The value of the first attribute.

  • second – The value of the second attribute.

Returns

The value returned from the measure.

class contextual_encoders.gatherer.SymMaxMeanGatherer

Bases: contextual_encoders.gatherer.Gatherer

A Gatherer that symmetrically measures all pairwise attribute forms but only uses the maximum value. It can be seen as a mapping

\(\mathcal{G} (x, y, \mathcal{M}) = \frac{1}{2} \Big( \frac{1}{|x|} \sum_{i=1}^{l_x} \mathcal{M}(x_i, \tilde{y_i}) + \frac{1}{|y|} \sum_{i=1}^{l_y} \mathcal{M}(\tilde{x_i}, y_i) \Big)\)

with \(\mathcal{M}\) being the Similarity or Dissimilarity Measure, \(x, y\) being the same attributes from different features, \(|x|\) the amount of attribute forms of the attribute \(x\) and \(\tilde{x_i} = argmax_{j=1,...,l_x} \mathcal{M}(x_j, y_i)\).

_gather(first, second)

Gathers two attributes symmetrically based on the maximum comparison value.

Parameters
  • first – The value of the first attribute.

  • second – The value of the second attribute.

Returns

The combined value.

Inverter

An Inverter is used to calculate a dissimilarity value given a similarity value and vice versa. It can be seen as a one-to-one mapping \(\mathcal{I} : [0,1] \rightarrow [0,1]\).

Currently, the following Inverters are implement:

Name

Formula

lin

\(\mathcal{I} (s) = 1 - s\)

sqrt

\(\mathcal{I} (s) = \sqrt{1 - s}\)

exp

\(\mathcal{I} (s) = 2 - e^{ln(2) \cdot s}\)

cos

\(\mathcal{I} (s) = cos(\frac{\pi}{2} \cdot s)\)

Note

If a custom inverter is implemented, make sure that the function is invertible and the definition range and value range is \([0, 1]\).

class contextual_encoders.inverter.CosineInverter

Bases: contextual_encoders.inverter.Inverter

An Inverter that converts a similarity matrix to a dissimilarity matrix and vice versa using a cosine ansatz. It can be used as the cos option.

dissimilarity_to_similarity(dissimilarity_matrix)

Converts the given dissimilarity matrix to a similarity matrix accordingly to \(\mathcal{I}^{-1} (d) = \frac{2}{\pi} acos(d)\), with \(d\) being the dissimilarity matrix. The operations are considered as elementwise.

Parameters

dissimilarity_matrix – A dissimilarity matrix as 2D numpy array.

Returns

A similarity matrix as 2D numpy array.

similarity_to_dissimilarity(similarity_matrix)

Converts the given similarity matrix to a dissimilarity matrix accordingly to \(\mathcal{I} (s) = cos(\frac{\pi}{2} \cdot s)\), with \(s\) being the similarity matrix. The operations are considered as elementwise.

Parameters

similarity_matrix – A similarity matrix as 2D numpy array.

Returns

A dissimilarity matrix as 2D numpy array.

class contextual_encoders.inverter.ExponentialInverter

Bases: contextual_encoders.inverter.Inverter

An Inverter that converts a similarity matrix to a dissimilarity matrix and vice versa using an exponential ansatz. It can be used as the exp option.

dissimilarity_to_similarity(dissimilarity_matrix)

Converts the given dissimilarity matrix to a similarity matrix accordingly to \(\mathcal{I}^{-1} (d) = \frac{1}{ln(2)} ln(2 - d)\), with \(d\) being the dissimilarity matrix. The operations are considered as elementwise.

Parameters

dissimilarity_matrix – A dissimilarity matrix as 2D numpy array.

Returns

A similarity matrix as 2D numpy array.

similarity_to_dissimilarity(similarity_matrix)

Converts the given similarity matrix to a dissimilarity matrix accordingly to \(\mathcal{I} (s) = 2 - e^{ln(2) \cdot s}\), with \(s\) being the similarity matrix. The operations are considered as elementwise.

Parameters

similarity_matrix – A similarity matrix as 2D numpy array.

Returns

A dissimilarity matrix as 2D numpy array.

class contextual_encoders.inverter.Inverter

Bases: abc.ABC

An abstract base class for all concrete Inverter implementations.

abstract dissimilarity_to_similarity(dissimilarity_matrix)

Calculates a similarity matrix given a dissimilarity matrix. :param dissimilarity_matrix: a dissimilarity matrix as 2D numpy array. :return: a similarity matrix as 2D numpy array.

abstract similarity_to_dissimilarity(similarity_matrix)

Calculates a dissimilarity matrix given a similarity matrix. :param similarity_matrix: a similarity matrix as 2D numpy array. :return: a dissimilarity matrix as 2D numpy array.

class contextual_encoders.inverter.InverterFactory

Bases: object

A factory class to create concrete Inverter instances with default values.

static create(inverter)

Creates an instance of the given Inverter name.

Parameters

inverter – The name of the Inverter, which can be lin, sqrt, exp or cos.

Returns

An instance of the Inverter.

Raises

ValueError – The given Inverter does not exist.

class contextual_encoders.inverter.LinearInverter

Bases: contextual_encoders.inverter.Inverter

An Inverter that converts a similarity matrix to a dissimilarity matrix and vice versa using a linear ansatz. It can be used as the lin option.

dissimilarity_to_similarity(dissimilarity_matrix)

Converts the given dissimilarity matrix to a similarity matrix accordingly to \(\mathcal{I}^{-1} (d) = 1 - d\), with \(d\) being the dissimilarity matrix. The operations are considered as elementwise.

Parameters

dissimilarity_matrix – A dissimilarity matrix as 2D numpy array.

Returns

A similarity matrix as 2D numpy array.

similarity_to_dissimilarity(similarity_matrix)

Converts the given similarity matrix to a dissimilarity matrix accordingly to \(\mathcal{I} (s) = 1 - s\), with \(s\) being the similarity matrix. The operations are considered as elementwise.

Parameters

similarity_matrix – A similarity matrix as 2D numpy array.

Returns

A dissimilarity matrix as 2D numpy array.

class contextual_encoders.inverter.SqrtInverter

Bases: contextual_encoders.inverter.Inverter

An Inverter that converts a similarity matrix to a dissimilarity matrix and vice versa using a sqrt ansatz. It can be used as the sqrt option.

dissimilarity_to_similarity(dissimilarity_matrix)

Converts the given dissimilarity matrix to a similarity matrix accordingly to \(\mathcal{I}^{-1} (d) = 1 - d^2\), with \(d\) being the dissimilarity matrix. The operations are considered as elementwise.

Parameters

dissimilarity_matrix – A dissimilarity matrix as 2D numpy array.

Returns

A similarity matrix as 2D numpy array.

similarity_to_dissimilarity(similarity_matrix)

Converts the given similarity matrix to a dissimilarity matrix accordingly to \(\mathcal{I} (s) = \sqrt{1 - s}\), with \(s\) being the similarity matrix. The operations are considered as elementwise.

Parameters

similarity_matrix – A similarity matrix as 2D numpy array.

Returns

A dissimilarity matrix as 2D numpy array.

Measure

A Measure is used to calculate a comparison value between two attribute forms.

Let \(x, y \in F\) be two features from the feature space \(F\). Each feature consists of \(k\) attributes and each attribute can have up to \(l\) forms. A form of an attribute of the feature \(x\) can then be denoted as \(x_{a,i}\), with \(a\) being the attribute and \(i\) being the form. For simplicity, we just denote it as \(\tilde{x}_i\) and skip the attribute index. A Measure is then defined as \(\mathcal{M} : (\tilde{x}_i, \tilde{y}_j) \rightarrow [0,1]\), i.e. it maps a similarity or dissimilarity value to each attribute form.

Note

A Measure can also be defined on an attribute, rather then on attribute forms. This can be done by setting multiple_values to True, see Measure.

Note

A Measure always needs to return values within the range \([0,1]\).

class contextual_encoders.measure.DissimilarityMeasure(symmetric, multiple_values)

Bases: contextual_encoders.measure.Measure, abc.ABC

An abstract base class for calculating dissimilarity values.

__init__(symmetric, multiple_values)

Initializes the Dissimilarity Measure.

Parameters
  • symmetric – Defines whether the Dissimilarity Measure is symmetric, i.e. if \(\mathcal{M}(x,y) = \mathcal{M}(y,x)\).

  • multiple_values – Defines whether the Dissimilarity Measure can compare full attributes, rather then only attribute forms. When this property is set to True, the _compare method will get the entire attribute value as input. If the property is set to False, a list with all attribute forms will be given as input.

class contextual_encoders.measure.Measure(symmetric, multiple_values)

Bases: abc.ABC

The abstract base class for all implementations of Measures.

__generate_cache_key(second)

Generates a serializable cache key given the two attributes or attribute forms.

Parameters
  • first – The first attribute or attribute form.

  • second – The second attribute or attribute form.

Returns

A unique key representing the two attributes or attribute forms.

__init__(symmetric, multiple_values)

Initializes the Measure.

Parameters
  • symmetric – Defines whether the Measure is symmetric, i.e. if \(\mathcal{M}(x,y) = \mathcal{M}(y,x)\).

  • multiple_values – Defines whether the Measure can compare full attributes, rather then only attribute forms. When this property is set to True, the _compare method will get the entire attribute value as input. If the property is set to False, a list with all attribute forms will be given as input.

__read_from_cache(first, second)

Read the comparison value for the two given attributes or attribute forms from the cache. If the value cannot be found in the cache, None will be returned instead.

Parameters
  • first – The first attribute or attribute form.

  • second – The second attribute or attribute form.

Returns

The comparison value or None if it cannot be found.

__write_to_cache(first, second, value)

Writes the comparison value with the two attributes or attribute forms into the cache. :param first: The first attribute or attribute form. :param second: The second attribute or attribute form. :param value: The comparison value.

abstract _compare(first, second)

Compares the two attributes or attribute forms. This is the abstract method that needs to be implemented by concrete Measure instances.

Parameters
  • first – The first attribute or attribute form.

  • second – The second attribute or attribute form.

Returns

The comparison value which is in \([0,1]\).

can_handle_multiple_values()

Returns True if the Measure can handle multiple values. When this property is set to True, the _compare method will get the entire attribute value as input. If the property is set to False, a list with all attribute forms will be given as input.

Returns

True if the Measure can handle multiple values.

compare(first, second)

Compares the two attributes or attribute forms. This method caches precalculated values within an in-memory dictionary.

Parameters
  • first – The first attribute or attribute form.

  • second – The second attribute or attribute form.

Returns

The comparison value which is in \([0,1]\).

export_to_file(path)

Exports the Measure including the cache to the given path.

Parameters

path – The path to export the Measure to.

import_from_file(path)

Imports the Measure including the cache from the given path.

Parameters

path – The path to import the Measure from.

is_symmetric()

Returns True if the Measure is symmetric, i.e. if \(\mathcal{M}(x,y) = \mathcal{M}(y,x)\).

Returns

True if the Measure is symmetric.

class contextual_encoders.measure.PathLengthMeasure(context)

Bases: contextual_encoders.measure.SimilarityMeasure

A SimilarityMeasure based on counting the path length between two concepts.

__init__(context)

Initializes the PathLengthMeasure.

Parameters

context – The GraphContext used for comparison.

_compare(first, second)

Compares the two attribute forms based on their path length in the Context. The Measure counts the shortest path length \(p\) going from the first to the second value and returns \(\frac{1}{1+p}\).

Parameters
  • first – The first attribute form.

  • second – The second attribute form.

Returns

The PathLength Similarity comparison value.

class contextual_encoders.measure.SimilarityMeasure(symmetric, multiple_values)

Bases: contextual_encoders.measure.Measure, abc.ABC

An abstract base class for calculating similarity values.

__init__(symmetric, multiple_values)

Initializes the Similarity Measure.

Parameters
  • symmetric – Defines whether the Similarity Measure is symmetric, i.e. if \(\mathcal{M}(x,y) = \mathcal{M}(y,x)\).

  • multiple_values – Defines whether the Similarity Measure can compare full attributes, rather then only attribute forms. When this property is set to True, the _compare method will get the entire attribute value as input. If the property is set to False, a list with all attribute forms will be given as input.

class contextual_encoders.measure.WuPalmer(context, offset=0.0)

Bases: contextual_encoders.measure.SimilarityMeasure

A tree based similarity measure based on the Wu-Palmer Similarity Measure.

__init__(context, offset=0.0)

Initializes the WuPalmer Similarity Measure.

Parameters
  • context – The TreeContext used for comparison.

  • offset – Either a real value or depth. If a real value is used, the distance between the root and a concept will always be at least the value of the offset. If depth is used, the offset will be \(\frac{1}{N}\), with \(N\) being the depth of the tree. Using an offset prevent from getting a zero similarity.

_compare(first, second)

Compares the two given attribute forms using the WuPalmer Similarity Measure. :param first: The first attribute form. :param second: The second attribute form. :return: The WuPalmer Similarity comparison value.

Reducer

A Reducer transforms a similarity or dissimilarity matrix into a set of vectors. Mathematically, it can be seen as a map \(\mathcal{R} : D \in \mathbb{R}^{n \times n} \rightarrow \tilde{X} \subset \mathbb{R}^{m}\), with \(m \in \mathbb{N}\) being the (configurable) dimension of the encoding and \(\tilde{X}\) the encoded dataset as vectors.

In other words, let \(n \in \mathbb{N}\) be the amount of features. A Reducer then takes the similarity or dissimilarity matrix \(D \in \mathbb{R}^{n \times n}\) and produces \(n\) euclidean vectors of dimension \(m\).

Currently, the following Reducers are implement:

Name

Description

mds

Creates a low-dimensional representation of the data in which the distances respect well
the distances in the original high-dimensional space.
class contextual_encoders.reducer.DissimilarityMatrixReducer(n_components)

Bases: contextual_encoders.reducer.Reducer, abc.ABC

An abstract base class for reducing dissimilarity matrices.

class contextual_encoders.reducer.MultidimensionalScalingReducer(n_components=2, metric=True)

Bases: contextual_encoders.reducer.DissimilarityMatrixReducer

A reducer using the Multidimensional Scaling approach (MDS) from scikit-learn. It can be used with the mds option.

__init__(n_components=2, metric=True)

Initializes the MultidimensionalScalingReducer.

Parameters
  • n_components – The dimension of the output vectors.

  • metric – If True, perform metric MDS; otherwise, perform non-metric MDS.

get_stress()

Gets the stress level for the performed MDS.

Returns

The stress level of the MDS.

reduce(dissimilarity_matrix)

Reduces the given dissimilarity matrix using the MDS approach.

Parameters

dissimilarity_matrix – The dissimilarity matrix as 2D numpy array.

Returns

Encoded vectors as 2D numpy array of size \(n \times m\), with \(n\) being the amount of features and \(m\) the dimension of the vectors, i.e. n_components.

class contextual_encoders.reducer.Reducer(n_components)

Bases: abc.ABC

The abstract base class for all Reducers.

__init__(n_components)

Initializes the Reducer.

Parameters

n_components – The dimension of the output vectors.

abstract reduce(matrix)

The abstract method that is implemented by concrete instances of Reducers.

Parameters

matrix – The similarity or dissimilarity matrix \(D \in \mathbb{R}^{n \times n}\) as 2D numpy array.

Returns

The set of vectors \(\tilde{X} \in \mathbb{R}^{n \times m}\), with \(m\) being n_components.

class contextual_encoders.reducer.ReducerFactory

Bases: object

The factory class for creating Reducers with default values.

static create(reducer)

Creates a concrete Reducer instance given the name.

Parameters

reducer – The name of the Reducer, which can be mds.

Returns

The instance of the Reducer

class contextual_encoders.reducer.SimilarityMatrixReducer(n_components)

Bases: contextual_encoders.reducer.Reducer, abc.ABC

An abstract base class for reducing similarity matrices.