API Reference

This page provides an overview of the public Python API exposed by synrfp. It focuses on:

  • core fingerprint builders in synrfp,

  • graph containers and reaction helpers in synrfp.graph,

  • tokenization backends in synrfp.tokenizers,

  • sketching backends in synrfp.sketchers, and

  • batch encoding utilities in synrfp.encoder.

If you are looking for high-level usage examples, see the README on GitHub or the Getting Started page.

Top-level API: synrfp

The synrfp module collects the main user-facing entry points: fingerprint engines, convenience wrappers, and similarity utilities.

class synrfp.BatchEncoder(tokenizer: str = 'wl', radius: int = 2, sketch: str = 'parity', bits: int = 1024, m: int = 256, seed: int = 1, mode: str = 'delta', node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None, *, n_jobs: int = 1, batch_size: int | None = None, verbose: int | None = None, backend: str = 'loky')

Bases: object

Batch encoder for mapping reaction SMILES (RSMI) to SynRFP fingerprints.

The encoder wraps the top-level synrfp function and provides: - configuration stored on the instance (tokenizer/sketch/seed/etc.), - single-item and batched encoding, - optional parallelization via joblib.Parallel, - optional chunked processing via batch_size for memory control.

Parameters:
  • tokenizer (str) – Tokenizer name (‘wl’, ‘nauty’, ‘morgan’, ‘path’, …).

  • radius (int) – Neighborhood / iteration radius for tokenizers that use it.

  • sketch (str) – Sketcher name (‘parity’, ‘minhash’, ‘cw’, ‘srp’, …).

  • bits (int) – Length of the final bit-vector for bit-based sketches.

  • m (int) – Sketch size parameter (e.g. number of hash samples/projections).

  • seed (int) – RNG seed forwarded to tokenizers/sketchers.

  • mode (str) – Token aggregation mode: ‘delta’ or ‘union’.

  • node_attrs (Sequence[str] | None) – Optional node attribute names passed to tokenizers.

  • edge_attrs (Sequence[str] | None) – Optional edge attribute names passed to tokenizers.

  • n_jobs (int) – Number of jobs for parallel encoding (1 = serial).

  • batch_size (int | None) – Maximum number of reactions per batch. If None, no chunking.

  • verbose (int | None) – Verbosity forwarded to joblib (int). If None, 0 is used.

  • backend (str) – joblib backend (e.g. ‘loky’, ‘threading’).

Example:

>>> from synrfp.encode.batch import BatchEncoder
>>> enc = BatchEncoder(tokenizer='wl', sketch='parity', bits=1024, n_jobs=1)
>>> single_fp = enc.encode_one('CCO>>CCO')           # numpy array shape (1024,)
>>> many = enc.encode_many(['CCO>>CCO', 'CCO>>CCO'])  # shape (2, 1024)

A small-batch example (useful for constrained memory / large inputs):

>>> enc = BatchEncoder(batch_size=128, n_jobs=2)
>>> X = enc.encode_many(list_of_rsmi)  # processes list_of_rsmi in chunks of 128
classmethod encode(rsmi_list: Sequence[str], *, tokenizer: str = 'wl', radius: int = 2, sketch: str = 'parity', bits: int = 1024, m: int = 256, seed: int = 1, mode: str = 'delta', node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None, batch_size: int | None = None) ndarray

Convenience classmethod to encode many reactions (serial execution).

Parameters:
  • rsmi_list (Sequence[str]) – Sequence of reaction SMILES.

  • tokenizer (str) – Tokenizer name.

  • radius (int) – Neighborhood radius.

  • sketch (str) – Sketcher name.

  • bits (int) – Fingerprint length for bit-based sketches.

  • m (int) – Sketch size parameter.

  • seed (int) – RNG seed.

  • mode (str) – Token aggregation mode (‘delta’|’union’).

  • node_attrs (Sequence[str] | None) – Node attribute names for tokenizer.

  • edge_attrs (Sequence[str] | None) – Edge attribute names for tokenizer.

  • batch_size (int | None) – Maximum chunk size for encode_many.

Returns:

2D numpy array of fingerprints.

Return type:

numpy.ndarray

Example:

>>> BatchEncoder.encode(['CCO>>CCO'], tokenizer='wl', sketch='parity', bits=64)
array([[...]], dtype=int)
encode_many(rsmi_list: Sequence[str]) ndarray

Encode a sequence of reaction SMILES into a 2D numpy array.

If batch_size is set and smaller than the input length, encoding proceeds in chunks of at most batch_size and results are concatenated.

Parameters:

rsmi_list (Sequence[str]) – Sequence of reaction SMILES strings.

Returns:

2D numpy array with shape (N, L) where N = len(rsmi_list)

and L = fingerprint length. :rtype: numpy.ndarray :raises ValueError: If fingerprint lengths are inconsistent across batches.

encode_one(rsmi: str) ndarray

Encode a single reaction SMILES into a 1D numpy array.

Parameters:

rsmi (str) – Reaction SMILES string.

Returns:

1D numpy array of ints (length == bits).

Return type:

numpy.ndarray

class synrfp.SynRFP(tokenizer: BaseTokenizer, radius: int = 2, sketch: BaseSketch | None = None, weighted_sketch: WeightedSketch | None = None)

Bases: object

Build a SynRFP fingerprint for a single reaction:
  • one reactant Molecule

  • one product Molecule

Exactly one of sketch or weighted_sketch must be provided.

Parameters:
  • tokenizer (BaseTokenizer) – Tokenizer instance (e.g. WLTokenizer, NautyTokenizer, MorganTokenizer, PathTokenizer).

  • radius (int) – Neighborhood radius for the tokenizer.

  • sketch (BaseSketch | None) – Unweighted sketcher (e.g. ParityFold, MinHashSketch).

  • weighted_sketch (WeightedSketch | None) – Weighted sketcher (e.g. CWSketch, SRPSketch).

static describe() str

Example usage:

>>> fp = SynRFP(tokenizer=WLTokenizer(), radius=2, sketch=ParityFold(1024))
>>> res = fp.fingerprint(reactant_G, product_G)
Returns:

Example usage string.

Return type:

str

fingerprint(reactant: Molecule, product: Molecule, *, mode: str = 'delta', node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None) SynRFPResult

Compute the reaction fingerprint for a pair of molecules.

Parameters:
  • reactant (Molecule) – Reactant molecular graph.

  • product (Molecule) – Product molecular graph.

  • mode (str) – Token combination mode: - 'delta': signed difference P−R (default) - 'union': union counts R+P

  • node_attrs (Sequence[str] | None) – Optional node attribute names for tokenizer.

  • edge_attrs (Sequence[str] | None) – Optional edge attribute names for tokenizer.

Returns:

A SynRFPResult with tokens, support and sketch.

Return type:

SynRFPResult

Raises:
  • TypeError – If inputs are not Molecule instances.

  • ValueError – If mode is invalid.

class synrfp.SynRFPResult(tokens_R: Counter, tokens_P: Counter, delta: Counter, support: List[int], sketch: object, mode: str = 'delta')

Bases: object

Container for outputs of a single fingerprinting call.

Parameters:
  • tokens_R (collections.Counter) – Token multiset for the reactant graph.

  • tokens_P (collections.Counter) – Token multiset for the product graph.

  • delta (collections.Counter) – Token counts summarising the transformation, depending on mode: - if mode='delta': signed difference P−R - if mode='union': union counts (R+P)

  • support (list[int]) – List of token keys with nonzero contribution (delta or union).

  • sketch (object) – Sketch object (bytes, list, or array) from the compressor.

  • mode (str) – Fingerprint mode, either 'delta' or 'union'.

as_array() ndarray

Return the underlying sketch as a 1D numpy integer array.

This works for all sketcher types:
  • ParityFold: 0/1 array

  • MinHashSketch: hash values

  • CWSketch: sample indices

  • SRPSketch: sign pattern (+1/-1)

Returns:

1D numpy array representation of the sketch.

Return type:

numpy.ndarray

delta: Counter
static describe() str

Example usage:

>>> # assume `res` is a SynRFPResult
>>> print(res)
SynRFPResult(tokens_R=10 tokens, tokens_P=8 tokens,
support=3, mode='delta', sketch_type=bytearray)
Returns:

Example usage string.

Return type:

str

mode: str = 'delta'
sketch: object
support: List[int]
to_binary() List[int]

Return the sketch stored in this result as a plain list of 0/1 bits.

Only works for binary sketchers (e.g. ParityFold). For non-binary sketchers (MinHash, CWSketch, SRP) a TypeError is raised.

Returns:

Binary fingerprint as list of 0/1 bits.

Return type:

list[int]

Raises:

TypeError – If the underlying sketch cannot be interpreted as bits.

tokens_P: Counter
tokens_R: Counter
synrfp.build_graph_from_printout(nodes: Dict[int, Dict], edges: Dict[tuple[int, int], Dict]) Molecule

Helper to convert “printout” dicts directly into a Molecule.

Parameters:
  • nodes (Dict[int, Dict]) – Mapping from node ID to attribute dict.

  • edges (Dict[tuple[int, int], Dict]) – Mapping from (u, v) edges (with u < v) to attribute dict.

Returns:

A fresh Molecule instance.

Return type:

Molecule

Example:
>>> nodes = {0: {'element': 'C'}, 1: {'element': 'O'}}
>>> edges = {(0, 1): {'order': 1.5}}
>>> G = build_graph_from_printout(nodes, edges)
synrfp.jaccard_minhash(h1: list | tuple, h2: list | tuple) float

Estimate Jaccard similarity from two MinHash signature arrays.

Parameters:
  • h1 (list or tuple) – First MinHash hash‐value sequence.

  • h2 (list or tuple) – Second MinHash sequence (must be same length).

Returns:

Fraction of positions where h1[i] == h2[i].

Return type:

float

synrfp.synrfp(rsmi: str, *, tokenizer: str = 'wl', radius: int = 2, sketch: str = 'parity', bits: int = 1024, m: int = 256, seed: int = 1, mode: str = 'delta', node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None) List[int]

Convert a reaction SMILES (RSMI) into a binary fingerprint bit-vector.

Internally:

  • A tokenizer (WL, Nauty, Morgan, Path) converts each side into multiset tokens.

  • Depending on mode, either token delta (P−R) or union (R+P) is computed.

  • A sketcher (parity, minhash, cw, srp) converts the token set/weights into a fixed-size sketch.

  • The sketch is finally mapped into a binary vector of length bits.

Parameters

rsmistr

Reaction SMILES, e.g. "CCO>>C=C.O".

tokenizerstr, default “wl”

Which tokenizer to use:

  • "wl" : Weisfeiler–Lehman style tokenizer.

  • "nauty" : Nauty-based canonical labeling tokenizer.

  • "morgan" : Morgan/ECFP-style neighborhood tokenizer.

  • "path" : simple path-based tokenizer.

radiusint, default 2

Neighborhood radius for the tokenizer (ignored by some tokenizers if not applicable).

sketchstr, default “parity”

Which sketcher to use:

  • "parity" : parity-folding into a binary vector.

  • "minhash" : MinHash signature, then mapped to bits.

  • "cw" : count-weighted sketch, then mapped to bits.

  • "srp" : signed random projection sketch (cosine-oriented).

bitsint, default 1024

Length of the final binary fingerprint. For sketch="parity", this is the internal bit-length of ParityFold. For "minhash", "cw" and "srp", it controls the final bin count used by signature_to_bits().

mint, default 256

Number of hash samples/projections for MinHash, CWSketch, or SRP.

seedint, default 1

Random seed for reproducibility.

mode{“delta”, “union”}, default “delta”

Token combination mode:

  • "delta" : signed difference P−R.

  • "union" : union of tokens appearing on either side.

node_attrsSequence[str] or None, optional

Node attribute names passed to the tokenizer (e.g. ["element"]).

edge_attrsSequence[str] or None, optional

Edge attribute names passed to the tokenizer (e.g. ["order"]).

Returns

list[int]

Fingerprint as a list of 0/1 bits of length bits.

Raises

ValueError

On invalid tokenizer, sketch, or mode names.

RuntimeError

If required dependencies (e.g. pynauty or datasketch) are missing.

synrfp.tanimoto_bits(b1: bytearray | List[int] | ndarray, b2: bytearray | List[int] | ndarray) float

Compute the Tanimoto (Jaccard) similarity between two binary‐bit sketches.

Accepts bytearray, list[int], or numpy.ndarray of 0/1.

Parameters:
  • b1 (bytearray or List[int] or numpy.ndarray) – First bit array.

  • b2 (bytearray or List[int] or numpy.ndarray) – Second bit array.

Returns:

Intersection size divided by union size, or 0.0 if union is zero.

Return type:

float

Key classes and functions

Typical responsibilities of the top-level API include:

  • converting reaction SMILES (RSMI) directly into fixed-length fingerprints,

  • configuring a SynRFP engine with a tokenizer and sketcher,

  • exposing similarity utilities for binary and MinHash sketches, and

  • providing a simple batch encoder for lists of reaction SMILES.

Graph utilities: synrfp.graph

The synrfp.graph subpackage defines light-weight graph containers and helpers for representing reactions.

class synrfp.graph.graph_data.GraphData(nodes: Dict[int, Dict], edges: Dict[Tuple[int, int], Dict], _adj: Dict[int, List[int]] | None = None)

Bases: object

Lightweight labeled graph container.

Parameters:
  • nodes (Dict[NodeId, Dict]) – Mapping from node id to attribute dict (e.g., element, charge).

  • edges (Dict[Edge, Dict]) – Mapping from edge tuple (u, v) with u<v to attribute dict (e.g., order).

  • _adj (Optional[Dict[NodeId, List[NodeId]]]) – Internal adjacency cache, computed lazily.

property adj: Dict[int, List[int]]

Lazily compute and cache adjacency list.

Returns:

Mapping from node id to sorted neighbor list.

Return type:

Dict[int, List[int]]

degree(v: int) int

Get degree of node v.

Parameters:

v (int) – Node identifier.

Returns:

Degree count.

Return type:

int

edge_attr(u: int, v: int) Dict

Retrieve attribute dict for edge (u, v).

Parameters:
  • u (int) – First node.

  • v (int) – Second node.

Returns:

Edge attributes.

Return type:

Dict

Raises:

KeyError – If edge not present.

edges: Dict[Tuple[int, int], Dict]
static from_dicts(nodes: Dict[int, Dict], edges: Dict[Tuple[int, int], Dict]) GraphData

Construct GraphData ensuring edge keys are ordered (u < v).

Parameters:
  • nodes (Dict[int, Dict]) – Node attribute mapping.

  • edges (Dict[Tuple[int, int], Dict]) – Edge attribute mapping.

Returns:

Initialized GraphData.

Return type:

GraphData

static from_nx_graph(G: Graph) GraphData

Construct GraphData from a NetworkX Graph.

Parameters:

G (nx.Graph) – NetworkX graph with node and edge attributes.

Returns:

Initialized GraphData.

Return type:

GraphData

nodes: Dict[int, Dict]
class synrfp.graph.reaction.Reaction(reactant: Molecule, product: Molecule)

Bases: object

Represents a chemical reaction with a single reactant and a single product graph.

Parameters:
  • reactant (Molecule) – Molecule for the reactant molecule.

  • product (Molecule) – Molecule for the product molecule.

static from_graph(reactant_graph: Graph, product_graph: Graph) Reaction

Create a Reaction from two NetworkX graphs.

Parameters:
  • reactant_graph (nx.Graph) – NetworkX Graph for reactant.

  • product_graph (nx.Graph) – NetworkX Graph for product.

Returns:

Reaction instance.

Return type:

Reaction

static from_rsmi(rsmi: str) Reaction

Create a Reaction from an RSMI string using synkit IO.

Parameters:

rsmi (str) – Reaction SMILES string.

Returns:

Reaction with reactant and product Molecule.

Return type:

Reaction

Raises:

ValueError – If parsing fails.

help() str

Show usage examples for Reaction.

Returns:

Usage guide.

Return type:

str

product: Molecule
reactant: Molecule
to_dataframe() DataFrame

Summarize reaction graphs as a pandas DataFrame.

Returns:

DataFrame with columns [‘side’,’n_nodes’,’n_edges’].

Return type:

pd.DataFrame

Key classes

Typical responsibilities include:

  • representing labeled molecular graphs via GraphData,

  • constructing Reaction objects from RSMI or from existing networkx graphs, and

  • providing small convenience methods for introspection and tabular summaries.

Tokenizers: synrfp.tokenizers

Tokenizers map molecular graphs to multisets of integer tokens (e.g. WL subtree hashes). They are responsible for the graph → tokens stage.

class synrfp.tokenizers.base.BaseTokenizer(node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None)

Bases: ABC

Abstract base for graph tokenizers (graph → multiset of integer tokens).

Parameters:
  • node_attrs (Optional[Sequence[str]]) – Node attribute keys to include in labels.

  • edge_attrs (Optional[Sequence[str]]) – Edge attribute keys to include in labels.

Example

>>> class Dummy(BaseTokenizer):
...     def tokens_graph(self, G, radius): return Counter({0: len(G.nodes)})
...
static describe() str

Return a generic usage example for tokenizers.

Returns:

Example code snippet.

Return type:

str

abstractmethod tokens_graph(G: Molecule, radius: int) Counter

Generate tokens for a single Molecule instance.

Parameters:
  • G (Molecule) – Molecule instance to tokenize.

  • radius (int) – Non-negative neighborhood radius.

Returns:

Multiset of tokens (hashed neighborhood labels).

Return type:

Counter

tokens_side(graphs: Sequence[Molecule], radius: int) Counter

Generate tokens across multiple graphs (e.g., reaction sides).

Parameters:
  • graphs (Sequence[Molecule]) – Sequence of Molecule objects.

  • radius (int) – Non-negative neighborhood radius.

Returns:

Combined multiset of tokens for all graphs.

Return type:

Counter

class synrfp.tokenizers.wl.WLTokenizer(node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None, use_nx: bool = False, require_networkx: bool = False)

Bases: BaseTokenizer

Weisfeiler–Lehman subtree tokenizer (edge-aware; k=0..r tokens).

Node labels: selected node attrs + degree. Edge labels: selected bond attrs (e.g., order, aromaticity).

The tokenizer supports two backends:

  • A lightweight in-house WL refinement loop (default).

  • An optional NetworkX-based WL implementation (networkx.algorithms.graph_hashing.weisfeiler_lehman_subgraph_hashes()), enabled via use_nx=True.

Example:
>>> tok = WLTokenizer(node_attrs=['element'], edge_attrs=['order'])
>>> isinstance(tok, WLTokenizer)
True
static describe() str

Return a usage example for the WLTokenizer.

Returns:

Example code snippet.

Return type:

str

tokens_graph(G: Molecule, radius: int) Counter

Tokenize a molecular graph via edge-aware WL subtree hashing.

The behaviour is controlled by the use_nx flag:

  • If use_nx=True and NetworkX WL hashing is available, this method uses networkx.algorithms.graph_hashing.weisfeiler_lehman_subgraph_hashes() on a temporary NetworkX graph with precomputed atom/bond labels, and folds the resulting hex hashes into integer tokens using synrfp.tokenizers.utils._h64().

  • Otherwise, a compact in-house WL implementation is used that performs the refinement directly on the Molecule object.

Parameters:
  • G (Molecule) – Molecular graph to tokenize.

  • radius (int) – Number of WL iterations (k >= 0). k=0 returns only the initial atom-level subtree labels; higher values add increasingly larger neighbourhoods.

Returns:

Counter mapping integer subtree-hash tokens to their multiplicities.

Return type:

collections.Counter

Raises:
  • ValueError – If radius is negative.

  • RuntimeError – If use_nx=True, require_networkx=True and NetworkX WL hashing is not available.

class synrfp.tokenizers.nauty.NautyCanonicalizer(node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None)

Bases: object

Nauty-style canonicalizer implemented with NetworkX primitives.

This class computes a canonical labeling / signature for a NetworkX graph using equitable partition refinement + backtracking search similar to the nauty algorithm. It’s designed to be reasonably robust for small ego subgraphs (typical use-case: ego-subgraphs extracted around a node).

Parameters:
  • node_attrs (list[str] | None) – node attribute keys used for initial partitioning/refinement.

  • edge_attrs (list[str] | None) – edge attribute keys used when including edge attributes in the canonical label.

canonical_form(G: Graph, return_aut: bool = False, remap_aut: bool = False, return_orbits: bool = False, return_perm: bool = False, max_depth: int | None = None)

Compute canonical form of G.

By default returns canonicalized graph G_can. Optionally can return permutation, automorphisms, orbits, and early-stop flag.

The algorithm: - Build initial partition from node_attrs (or single cell if none) - Repeatedly refine partition by node signatures until stable - If partition refined to singletons, build label and update best - Otherwise pick a non-singleton cell and branch (backtracking) until

best canonical label is found.

This implementation is primarily intended for small graphs (ego subgraphs).

compute_orbits(aut_perms: List[List[int]])
edge_attrs
graph_signature(G: Graph) str
node_attrs
class synrfp.tokenizers.nauty.NautyTokenizer(node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None, max_cache: int = 100000)

Bases: BaseTokenizer

Nauty-style canonical ego-subgraph tokenizer using pure NetworkX canonicalizer.

For each center node and each radius 0..r, the ego subgraph is canonicalized with respect to chosen node/edge attributes and the canonical signature is converted to an integer token via _h64().

Parameters:
  • node_attrs (list[str] | None) – list of node attribute keys to include in initial partitioning.

  • edge_attrs (list[str] | None) – list of edge attribute keys to include for edge distinctions.

  • max_cache – maximum number of ego-canonicalizations cached

(simple LRU via dict). :type max_cache: int

tokens_graph(G: Molecule, radius: int) Counter

Tokenize a molecular graph by canonicalizing ego-subgraphs up to radius.

Parameters:
  • G (Molecule) – Molecular graph.

  • radius (int) – maximum radius (inclusive) for ego-subgraphs.

Returns:

Counter mapping canonical integer tokens to counts.

Return type:

collections.Counter

synrfp.tokenizers.utils.atom_label_tuple(G: Molecule, v: int, node_attrs: List[str]) Tuple

Build node label tuple from selected attributes and degree.

Parameters:
  • G (Molecule) – Molecule with node data.

  • v (NodeId) – Node id.

  • node_attrs (List[str]) – Attribute keys to include.

Returns:

Label tuple.

Return type:

Tuple

synrfp.tokenizers.utils.batch_h64(items: Iterable[Any], *, seed: int = 0) List[int]

Hash a sequence deterministically.

Parameters:
  • items (Iterable[Any]) – Objects to hash.

  • seed (int) – Optional integer seed.

Returns:

List of 64-bit ints.

Return type:

List[int]

synrfp.tokenizers.utils.bond_label_tuple(G: Molecule, u: int, v: int, edge_attrs: List[str]) Tuple

Build edge label tuple from selected attributes.

Parameters:
  • G (Molecule) – Molecule with edge data.

  • u (NodeId) – First node.

  • v (NodeId) – Second node.

  • edge_attrs (List[str]) – Edge attribute keys to include.

Returns:

Label tuple.

Return type:

Tuple

Key classes and helpers

Typical responsibilities include:

  • defining the abstract BaseTokenizer interface,

  • providing concrete implementations such as WLTokenizer and NautyTokenizer,

  • specifying which node/edge attributes participate in labels, and

  • offering hashing helpers used to turn label tuples into stable 64-bit tokens.

Sketchers: synrfp.sketchers

Sketchers compress (signed) token multisets into fixed-size fingerprints. They implement the Δ/U → sketch stage.

class synrfp.sketchers.base.BaseSketch(seed: int = 1)

Bases: ABC

Abstract base class for set / multiset sketchers.

Subclasses must implement build() and may override describe().

Parameters:

seed (int) – Non-negative integer seed for reproducibility.

Raises:

ValueError – If seed is negative or not an integer.

Example

>>> class Dummy(BaseSketch):
...     def build(self, support): return Counter(support)
...
>>> sk = Dummy(seed=1)
>>> C = sk.build([1, 2, 2, 3]); C[2]
2
abstractmethod build(support: Iterable[int]) Any

Build a sketch from an unweighted iterable of integer tokens.

Parameters:

support (Iterable[int]) – Iterable of integer-encoded features (can repeat).

Returns:

Sketch object (type depends on subclass).

Return type:

Any

describe() str

Return a short usage example.

class synrfp.sketchers.base.WeightedSketch(m: int = 256, seed: int = 0, normalize: bool = True)

Bases: ABC

Abstract base for weighted (signed) sketchers.

Utilities provided:
  • input validation for pos/neg sparse multisets,

  • deterministic sparse→dense conversion,

  • signed / two-channel (pos,neg) representations,

  • exact reference similarities (weighted-Jaccard, cosine),

  • fluent config for normalization and dtype.

Concrete subclasses must implement build().

Parameters:
  • m (int) – (Optional) number of sketch samples (backend may use it).

  • seed (int) – RNG seed for deterministic behavior.

  • normalize (bool) – If True, helpers can L1-normalize outputs.

Raises:

ValueError – If parameters are invalid.

Example

>>> class Echo(WeightedSketch):
...     def build(self, pos, neg): return self.dicts_to_dense(pos, neg)[0]
...
>>> es = Echo(m=4, seed=0)
>>> vec, _ = es.dicts_to_dense({1:2},{2:1})
>>> vec.sum() != 0
True
abstractmethod build(pos: Mapping[int, int], neg: Mapping[int, int]) Any

Build a sketch for the signed multiset (pos - neg).

Parameters:
  • pos (Mapping[int, int]) – Mapping token -> non-negative count.

  • neg (Mapping[int, int]) – Mapping token -> non-negative count.

Returns:

Implementation-defined sketch object.

Return type:

Any

static cosine_similarity(vec_a: ndarray, vec_b: ndarray) float

Cosine similarity (safe with zeros).

Parameters:
  • vec_a (numpy.ndarray) – Vector A.

  • vec_b (numpy.ndarray) – Vector B.

Returns:

Cosine in [-1,1].

Return type:

float

describe() str

Short usage snippet for subclasses.

Returns:

Example text.

Return type:

str

dicts_to_dense(pos: Mapping[int, int], neg: Mapping[int, int], index_map: Dict[int, int] | None = None, *, ensure_signed: bool = True) Tuple[ndarray, Dict[int, int]]

Convert sparse pos/neg dicts into a dense array.

Parameters:
  • pos (Mapping[int,int]) – Positive counts.

  • neg (Mapping[int,int]) – Negative counts.

  • index_map (Optional[Dict[int,int]]) – Optional precomputed token->index map.

  • ensure_signed (bool) – If True return (n,) signed array (pos-neg); else return (2,n) with [pos, neg] channels.

Returns:

(array, index_map).

Return type:

Tuple[numpy.ndarray, Dict[int,int]]

exact_similarities_from_dicts(pos_a: Mapping[int, int], neg_a: Mapping[int, int], pos_b: Mapping[int, int], neg_b: Mapping[int, int], *, index_map: Dict[int, int] | None = None) Dict[str, float]

Compute exact similarities for two signed multisets.

Parameters:
  • pos_a (Mapping[int,int]) – Positive counts for A.

  • neg_a (Mapping[int,int]) – Negative counts for A.

  • pos_b (Mapping[int,int]) – Positive counts for B.

  • neg_b (Mapping[int,int]) – Negative counts for B.

  • index_map (Optional[Dict[int,int]]) – Optional shared index map.

Returns:

{“weighted_jaccard”: float, “cosine”: float}

Return type:

Dict[str,float]

last_index_map() Dict[int, int] | None

Return the last token→index map computed by dicts_to_dense().

Returns:

Shallow copy of index map or None.

Return type:

Optional[Dict[int,int]]

set_dtype(dtype: dtype) WeightedSketch

Configure dtype for dense arrays.

Parameters:

dtype (numpy.dtype) – NumPy dtype (e.g., np.float32, np.float64).

Returns:

self

Return type:

WeightedSketch

set_normalize(normalize: bool) WeightedSketch

Set whether helpers produce L1-normalized arrays.

Parameters:

normalize (bool) – True to enable L1 normalization.

Returns:

self

Return type:

WeightedSketch

static signed_to_pos_neg_arrays(vec: ndarray) Tuple[ndarray, ndarray]

Split a signed vector into non-negative positive/negative arrays.

Parameters:

vec (numpy.ndarray) – Signed vector.

Returns:

(pos, neg) arrays, both >= 0.

Return type:

Tuple[numpy.ndarray, numpy.ndarray]

validate_pos_neg(pos: Mapping[int, int], neg: Mapping[int, int]) None

Validate pos/neg dictionaries.

Parameters:
  • pos (Mapping[int,int]) – Positive token counts.

  • neg (Mapping[int,int]) – Negative token counts.

Raises:
  • TypeError – If types invalid.

  • ValueError – If keys not int or counts negative.

static weighted_jaccard_signed(vec_a: ndarray, vec_b: ndarray) float

Weighted-Jaccard for signed vectors via two-channel decomposition.

Parameters:
  • vec_a (numpy.ndarray) – Signed vector A.

  • vec_b (numpy.ndarray) – Signed vector B.

Returns:

Similarity in [0,1].

Return type:

float

class synrfp.sketchers.parity_fold.ParityFold(bits: int = 2048, seed: int = 0)

Bases: BaseSketch

Parity-based folding sketcher (unweighted tokens → binary bit vector).

Each token t is mapped to a bit index via a deterministic 64-bit hash and the sketch’s seed, then the bit is toggled (XOR). If a token appears an even number of times, its contribution cancels out; odd multiplicities flip the corresponding bit.

The result is a compact binary fingerprint useful for fast similarity via Hamming distance or Tanimoto over bits.

Parameters:
  • bits (int) – Length of the binary sketch (number of bits).

  • seed (int) – Non-negative integer seed for the internal hash mapping.

Raises:

ValueError – If bits is not positive or seed is negative.

build(support: Iterable[int]) ndarray

Build a parity-folded binary sketch from an unweighted token stream.

Internally, this:

  1. Collapses support to counts via collections.Counter.

  2. Retains only tokens with odd multiplicity (parity 1).

  3. Maps each such token t to an index idx = _h64(('pf', t), seed=seed) % bits.

  4. Sets the corresponding bits, yielding a 0/1 vector.

Parameters:

support (Iterable[int]) – Iterable of integer tokens.

Returns:

Binary sketch of length bits (dtype uint8).

Return type:

numpy.ndarray

static describe() str

Return a brief usage example for ParityFold.

Returns:

Example code snippet.

Return type:

str

class synrfp.sketchers.minhash_sketch.MinHashSketch(m: int = 256, seed: int = 0, use_datasketch: bool = True)

Bases: BaseSketch

Set-based MinHash sketch for approximating Jaccard similarity.

This sketch treats the input as a set of tokens (multiplicities are ignored). It computes a fixed-length signature such that the fraction of shared components approximates the Jaccard index between two sets.

If datasketch is available and use_datasketch is True, the implementation delegates to datasketch.MinHash. Otherwise a deterministic fallback based on repeated 64-bit hashing is used.

Parameters:
  • m (int) – Number of hash permutations (length of the sketch).

  • seed (int) – Non-negative integer seed for all hash permutations.

  • use_datasketch (bool) – Whether to use datasketch if available.

Raises:

ValueError – If m is not positive or seed is negative.

build(support: Iterable[int]) List[int]

Build a MinHash signature from an unweighted token stream.

Multiplicities in support are ignored; only distinct tokens contribute to the sketch.

Parameters:

support (Iterable[int]) – Iterable of integer tokens.

Returns:

MinHash signature as a list of length m.

Return type:

list[int]

static describe() str

Return a brief usage example for MinHashSketch.

Returns:

Example code snippet.

Return type:

str

class synrfp.sketchers.cw_sketch.CWSketch(m: int = 256, seed: int = 0, normalize: bool = True)

Bases: WeightedSketch

Consistent Weighted Sampling (CWS) sketch for weighted Jaccard.

This sketch operates on signed sparse multisets (pos, neg). It converts them into a signed dense vector via dicts_to_dense(), splits into separate non-negative positive/negative channels, concatenates them, and applies Consistent Weighted Sampling.

If datasketch is available, it delegates to datasketch.WeightedMinHashGenerator. Otherwise it uses a deterministic ICWS-like fallback implementation.

Parameters:
  • m (int) – Number of CWS samples (length of the sketch).

  • seed (int) – Random seed for the sampler.

  • normalize (bool) – If True, dense helpers L1-normalize the signed vector prior to splitting. Scaling does not change the weighted Jaccard but can improve numerical stability.

Raises:

ValueError – If arguments are invalid.

build(pos: Mapping[int, int], neg: Mapping[int, int]) ndarray

Build a length-m CWS hash signature for a signed multiset.

Internally this:

  1. Uses dicts_to_dense() with ensure_signed=True to obtain a 1D signed dense vector.

  2. Splits it into non-negative positive/negative arrays using WeightedSketch.signed_to_pos_neg_arrays().

  3. Concatenates these into a non-negative weight vector.

  4. Applies either datasketch or a deterministic fallback to draw m CWS samples.

Parameters:
  • pos (Mapping[int,int]) – Positive token counts.

  • neg (Mapping[int,int]) – Negative token counts.

Returns:

Array of sampled indices (hash values) of length m.

Return type:

numpy.ndarray

static describe() str

Return a brief usage example for CWSketch.

Returns:

Example code snippet.

Return type:

str

Key classes

Typical responsibilities include:

  • defining unweighted and weighted sketcher interfaces (BaseSketch, WeightedSketch),

  • implementing binary parity-fold sketches (ParityFold),

  • implementing MinHash-based sketches for Jaccard estimation (MinHashSketch), and

  • implementing consistent weighted sampling for signed deltas (CWSketch).

Batch encoding: synrfp.encoder

The synrfp.encoder module provides a convenience wrapper for batch encoding lists of reaction SMILES into SynRFP fingerprints, with optional parallelization via joblib.

Key class

Typical responsibilities of SynRFPEncoder include:

  • turning a list of RSMI strings into a 2D NumPy array of fingerprints,

  • exposing the same configuration knobs as synrfp.synrfp() (tokenizer, radius, sketch type, bit length, seed), and

  • handling multi-process or multi-threaded execution transparently when n_jobs > 1.