Back to my projects page
Large scale similarity calculations with support for:
- Thresholds & sparse representations of the similarity matrix (based on the
- Parallel calculations, locally or using a standard R cluster object (from
parallelpackage). We also include helper functions to decide how to split the calculations efficiently
- “Injection” of domain knowledge, in the form of external information on groups of nodes that belong to the same higher-level entity. The package provides efficient tools for quotient graph calculations.
A smart dev-ops engineer once told me:
Before I give you a cluster, show me you can fully utilize a single machine
With that in mind I created this package to share my experiences working on large scale similarity projects. The main problems I’ve encountered were
Scaling up similarity calculations and representation: Specifically how to better distribute calculations (focus on utilizing a single, multi-core machine as efficiently as possible) and efficiently store the result, especially when low values are not very interesting (making the similarity matrix sparse)
Injecting domain knowledge / quotient similarity: In many cases similarity is calculated at different levels - for example similarity between messages to find similar users or similarity between images to find similar products. These cases require a way to aggregate similarities between sets of arbitrary lower level entities (messages / images) to represent similarity between higher level entities (users / products).
This package contains tools for handling both of the above problems.
Data is numeric (binary, integers or real numbers). For categorical data please convert first (embedding, 1-hot encoding or other methods)
NxMmatrix of features (
Mfeatures) can be contained in memory, or at least expose a
devtools::install_github(repo = 'ytoren/simscaleR', build_vignettes = TRUE)
The package contains functions that automatically estimate resources of the local machine. You can read the vignette in
vignette('estimating-local-resources', package='simscaleR'). You can also control the calculation manually using lower level functions. See
Similarity matrix manipulation \ domain knowledge injection
Aggregate rows/columns of the similarity matrix using an aggregation function (default is a simple sum). See
?merge_by_partition. The end result is a similarity smaller similarity matrix that represents similarity between higher level entities.
Shuffle matrix rows, so that all rows that belong to the same entity are next to one another (in case they are not already sorted this way). See