Back to my projects page

simscaleR

Large scale similarity calculations with support for:

Background

A smart dev-ops engineer once told me:

Before I give you a cluster, show me you can fully utilize a single machine

With that in mind I created this package to share my experiences working on large scale similarity projects. The main problems I’ve encountered were

  1. Scaling up similarity calculations and representation: Specifically how to better distribute calculations (focus on utilizing a single, multi-core machine as efficiently as possible) and efficiently store the result, especially when low values are not very interesting (making the similarity matrix sparse)

  2. Injecting domain knowledge / quotient similarity: In many cases similarity is calculated at different levels - for example similarity between messages to find similar users or similarity between images to find similar products. These cases require a way to aggregate similarities between sets of arbitrary lower level entities (messages / images) to represent similarity between higher level entities (users / products).

This package contains tools for handling both of the above problems.

Assumptions

Installation

devtools::install_github(repo = 'ytoren/simscaleR', build_vignettes = TRUE)

Usage

Similarity calculations

The package contains functions that automatically estimate resources of the local machine. You can read the vignette in vignette('estimating-local-resources', package='simscaleR'). You can also control the calculation manually using lower level functions. See ?sim_blocksR

Similarity matrix manipulation \ domain knowledge injection