API Reference

This page provides an overview of the public Python API exposed by synrfp. It focuses on:

core fingerprint builders in synrfp,
graph containers and reaction helpers in synrfp.graph,
tokenization backends in synrfp.tokenizers,
sketching backends in synrfp.sketchers, and
batch encoding utilities in synrfp.encoder.

If you are looking for high-level usage examples, see the README on GitHub or the Getting Started page.

Top-level API: `synrfp`

The synrfp module collects the main user-facing entry points: fingerprint engines, convenience wrappers, and similarity utilities.

class synrfp.BatchEncoder(tokenizer: str = 'wl', radius: int = 2, sketch: str = 'parity', bits: int = 1024, m: int = 256, seed: int = 1, mode: str = 'delta', node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None, *, n_jobs: int = 1, batch_size: int | None = None, verbose: int | None = None, backend: str = 'loky')

Bases: object

Batch encoder for mapping reaction SMILES (RSMI) to SynRFP fingerprints.

The encoder wraps the top-level synrfp function and provides: - configuration stored on the instance (tokenizer/sketch/seed/etc.), - single-item and batched encoding, - optional parallelization via joblib.Parallel, - optional chunked processing via batch_size for memory control.

Parameters:

tokenizer (str) – Tokenizer name (‘wl’, ‘nauty’, ‘morgan’, ‘path’, …).
radius (int) – Neighborhood / iteration radius for tokenizers that use it.
sketch (str) – Sketcher name (‘parity’, ‘minhash’, ‘cw’, ‘srp’, …).
bits (int) – Length of the final bit-vector for bit-based sketches.
m (int) – Sketch size parameter (e.g. number of hash samples/projections).
seed (int) – RNG seed forwarded to tokenizers/sketchers.
mode (str) – Token aggregation mode: ‘delta’ or ‘union’.
node_attrs (Sequence[str] | None) – Optional node attribute names passed to tokenizers.
edge_attrs (Sequence[str] | None) – Optional edge attribute names passed to tokenizers.
n_jobs (int) – Number of jobs for parallel encoding (1 = serial).
batch_size (int | None) – Maximum number of reactions per batch. If None, no chunking.
verbose (int | None) – Verbosity forwarded to joblib (int). If None, 0 is used.
backend (str) – joblib backend (e.g. ‘loky’, ‘threading’).

Example:

>>> from synrfp.encode.batch import BatchEncoder
>>> enc = BatchEncoder(tokenizer='wl', sketch='parity', bits=1024, n_jobs=1)
>>> single_fp = enc.encode_one('CCO>>CCO')           # numpy array shape (1024,)
>>> many = enc.encode_many(['CCO>>CCO', 'CCO>>CCO'])  # shape (2, 1024)

A small-batch example (useful for constrained memory / large inputs):

>>> enc = BatchEncoder(batch_size=128, n_jobs=2)
>>> X = enc.encode_many(list_of_rsmi)  # processes list_of_rsmi in chunks of 128

classmethod encode(rsmi_list: Sequence[str], *, tokenizer: str = 'wl', radius: int = 2, sketch: str = 'parity', bits: int = 1024, m: int = 256, seed: int = 1, mode: str = 'delta', node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None, batch_size: int | None = None) → ndarray

Convenience classmethod to encode many reactions (serial execution).

Parameters:

rsmi_list (Sequence[str]) – Sequence of reaction SMILES.
tokenizer (str) – Tokenizer name.
radius (int) – Neighborhood radius.
sketch (str) – Sketcher name.
bits (int) – Fingerprint length for bit-based sketches.
m (int) – Sketch size parameter.
seed (int) – RNG seed.
mode (str) – Token aggregation mode (‘delta’|’union’).
node_attrs (Sequence[str] | None) – Node attribute names for tokenizer.
edge_attrs (Sequence[str] | None) – Edge attribute names for tokenizer.
batch_size (int | None) – Maximum chunk size for encode_many.

Returns:

2D numpy array of fingerprints.

Return type:

numpy.ndarray

Example:

>>> BatchEncoder.encode(['CCO>>CCO'], tokenizer='wl', sketch='parity', bits=64)
array([[...]], dtype=int)

encode_many(rsmi_list: Sequence[str]) → ndarray

Encode a sequence of reaction SMILES into a 2D numpy array.

If batch_size is set and smaller than the input length, encoding proceeds in chunks of at most batch_size and results are concatenated.

Parameters:: rsmi_list (Sequence[str]) – Sequence of reaction SMILES strings.
Returns:: 2D numpy array with shape (N, L) where N = len(rsmi_list)

and L = fingerprint length. :rtype: numpy.ndarray :raises ValueError: If fingerprint lengths are inconsistent across batches.

encode_one(rsmi: str) → ndarray

Encode a single reaction SMILES into a 1D numpy array.

Parameters:: rsmi (str) – Reaction SMILES string.
Returns:: 1D numpy array of ints (length == bits).
Return type:: numpy.ndarray

class synrfp.SynRFP(tokenizer: BaseTokenizer, radius: int = 2, sketch: BaseSketch | None = None, weighted_sketch: WeightedSketch | None = None)

Bases: object

Build a SynRFP fingerprint for a single reaction:

one reactant Molecule
one product Molecule

Exactly one of sketch or weighted_sketch must be provided.

Parameters:

tokenizer (BaseTokenizer) – Tokenizer instance (e.g. WLTokenizer, NautyTokenizer, MorganTokenizer, PathTokenizer).
radius (int) – Neighborhood radius for the tokenizer.
sketch (BaseSketch | None) – Unweighted sketcher (e.g. ParityFold, MinHashSketch).
weighted_sketch (WeightedSketch | None) – Weighted sketcher (e.g. CWSketch, SRPSketch).

static describe() → str

Example usage:

>>> fp = SynRFP(tokenizer=WLTokenizer(), radius=2, sketch=ParityFold(1024))
>>> res = fp.fingerprint(reactant_G, product_G)

Returns:: Example usage string.
Return type:: str

fingerprint(reactant: Molecule, product: Molecule, *, mode: str = 'delta', node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None) → SynRFPResult

Compute the reaction fingerprint for a pair of molecules.

Parameters:

reactant (Molecule) – Reactant molecular graph.
product (Molecule) – Product molecular graph.
mode (str) – Token combination mode: - 'delta': signed difference P−R (default) - 'union': union counts R+P
node_attrs (Sequence[str] | None) – Optional node attribute names for tokenizer.
edge_attrs (Sequence[str] | None) – Optional edge attribute names for tokenizer.

Returns:

A SynRFPResult with tokens, support and sketch.

Return type:

SynRFPResult

Raises:

TypeError – If inputs are not Molecule instances.
ValueError – If mode is invalid.

class synrfp.SynRFPResult(tokens_R: Counter, tokens_P: Counter, delta: Counter, support: List[int], sketch: object, mode: str = 'delta')

Bases: object

Container for outputs of a single fingerprinting call.

Parameters:

tokens_R (collections.Counter) – Token multiset for the reactant graph.
tokens_P (collections.Counter) – Token multiset for the product graph.
delta (collections.Counter) – Token counts summarising the transformation, depending on mode: - if mode='delta': signed difference P−R - if mode='union': union counts (R+P)
support (list[int]) – List of token keys with nonzero contribution (delta or union).
sketch (object) – Sketch object (bytes, list, or array) from the compressor.
mode (str) – Fingerprint mode, either 'delta' or 'union'.

as_array() → ndarray

Return the underlying sketch as a 1D numpy integer array.

This works for all sketcher types:

ParityFold: 0/1 array
MinHashSketch: hash values
CWSketch: sample indices
SRPSketch: sign pattern (+1/-1)

Returns:: 1D numpy array representation of the sketch.
Return type:: numpy.ndarray

delta: Counter

static describe() → str

Example usage:

>>> # assume `res` is a SynRFPResult
>>> print(res)
SynRFPResult(tokens_R=10 tokens, tokens_P=8 tokens,
support=3, mode='delta', sketch_type=bytearray)

Returns:: Example usage string.
Return type:: str

mode: str = 'delta'

sketch: object

support: List[int]

to_binary() → List[int]

Return the sketch stored in this result as a plain list of 0/1 bits.

Only works for binary sketchers (e.g. ParityFold). For non-binary sketchers (MinHash, CWSketch, SRP) a TypeError is raised.

Returns:: Binary fingerprint as list of 0/1 bits.
Return type:: list[int]
Raises:: TypeError – If the underlying sketch cannot be interpreted as bits.

tokens_P: Counter

tokens_R: Counter

synrfp.build_graph_from_printout(nodes: Dict[int, Dict], edges: Dict[tuple[int, int], Dict]) → Molecule

Helper to convert “printout” dicts directly into a Molecule.

Parameters:

nodes (Dict[int, Dict]) – Mapping from node ID to attribute dict.
edges (Dict[tuple[int, int], Dict]) – Mapping from (u, v) edges (with u < v) to attribute dict.

Returns:

A fresh Molecule instance.

Return type:

Molecule

Example:

>>> nodes = {0: {'element': 'C'}, 1: {'element': 'O'}}
>>> edges = {(0, 1): {'order': 1.5}}
>>> G = build_graph_from_printout(nodes, edges)

synrfp.jaccard_minhash(h1: list | tuple, h2: list | tuple) → float

Estimate Jaccard similarity from two MinHash signature arrays.

Parameters:

h1 (list or tuple) – First MinHash hash‐value sequence.
h2 (list or tuple) – Second MinHash sequence (must be same length).

Returns:

Fraction of positions where h1[i] == h2[i].

Return type:

float

synrfp.synrfp(rsmi: str, *, tokenizer: str = 'wl', radius: int = 2, sketch: str = 'parity', bits: int = 1024, m: int = 256, seed: int = 1, mode: str = 'delta', node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None) → List[int]

Convert a reaction SMILES (RSMI) into a binary fingerprint bit-vector.

Internally:

A tokenizer (WL, Nauty, Morgan, Path) converts each side into multiset tokens.
Depending on mode, either token delta (P−R) or union (R+P) is computed.
A sketcher (parity, minhash, cw, srp) converts the token set/weights into a fixed-size sketch.
The sketch is finally mapped into a binary vector of length bits.

Parameters

rsmistr

Reaction SMILES, e.g. "CCO>>C=C.O".

tokenizerstr, default “wl”

Which tokenizer to use:

"wl" : Weisfeiler–Lehman style tokenizer.
"nauty" : Nauty-based canonical labeling tokenizer.
"morgan" : Morgan/ECFP-style neighborhood tokenizer.
"path" : simple path-based tokenizer.

radiusint, default 2

Neighborhood radius for the tokenizer (ignored by some tokenizers if not applicable).

sketchstr, default “parity”

Which sketcher to use:

"parity" : parity-folding into a binary vector.
"minhash" : MinHash signature, then mapped to bits.
"cw" : count-weighted sketch, then mapped to bits.
"srp" : signed random projection sketch (cosine-oriented).

bitsint, default 1024

Length of the final binary fingerprint. For sketch="parity", this is the internal bit-length of ParityFold. For "minhash", "cw" and "srp", it controls the final bin count used by signature_to_bits().

mint, default 256

Number of hash samples/projections for MinHash, CWSketch, or SRP.

seedint, default 1

Random seed for reproducibility.

mode{“delta”, “union”}, default “delta”

Token combination mode:

"delta" : signed difference P−R.
"union" : union of tokens appearing on either side.

node_attrsSequence[str] or None, optional

Node attribute names passed to the tokenizer (e.g. ["element"]).

edge_attrsSequence[str] or None, optional

Edge attribute names passed to the tokenizer (e.g. ["order"]).

Returns

list[int]: Fingerprint as a list of 0/1 bits of length bits.

Raises

ValueError: On invalid tokenizer, sketch, or mode names.
RuntimeError: If required dependencies (e.g. pynauty or datasketch) are missing.

synrfp.tanimoto_bits(b1: bytearray | List[int] | ndarray, b2: bytearray | List[int] | ndarray) → float

Compute the Tanimoto (Jaccard) similarity between two binary‐bit sketches.

Accepts bytearray, list[int], or numpy.ndarray of 0/1.

Parameters:

b1 (bytearray or List[int] or numpy.ndarray) – First bit array.
b2 (bytearray or List[int] or numpy.ndarray) – Second bit array.

Returns:

Intersection size divided by union size, or 0.0 if union is zero.

Return type:

float

Key classes and functions

Typical responsibilities of the top-level API include:

converting reaction SMILES (RSMI) directly into fixed-length fingerprints,
configuring a SynRFP engine with a tokenizer and sketcher,
exposing similarity utilities for binary and MinHash sketches, and
providing a simple batch encoder for lists of reaction SMILES.

Graph utilities: `synrfp.graph`

The synrfp.graph subpackage defines light-weight graph containers and helpers for representing reactions.

class synrfp.graph.graph_data.GraphData(nodes: Dict[int, Dict], edges: Dict[Tuple[int, int], Dict], _adj: Dict[int, List[int]] | None = None)

Bases: object

Lightweight labeled graph container.

Parameters:

nodes (Dict[NodeId, Dict]) – Mapping from node id to attribute dict (e.g., element, charge).
edges (Dict[Edge, Dict]) – Mapping from edge tuple (u, v) with u<v to attribute dict (e.g., order).
_adj (Optional[Dict[NodeId, List[NodeId]]]) – Internal adjacency cache, computed lazily.

property adj: Dict[int, List[int]]

Lazily compute and cache adjacency list.

Returns:: Mapping from node id to sorted neighbor list.
Return type:: Dict[int, List[int]]

degree(v: int) → int

Get degree of node v.

Parameters:: v (int) – Node identifier.
Returns:: Degree count.
Return type:: int

edge_attr(u: int, v: int) → Dict

Retrieve attribute dict for edge (u, v).

Parameters:

u (int) – First node.
v (int) – Second node.

Returns:

Edge attributes.

Return type:

Dict

Raises:

KeyError – If edge not present.

edges: Dict[Tuple[int, int], Dict]

static from_dicts(nodes: Dict[int, Dict], edges: Dict[Tuple[int, int], Dict]) → GraphData

Construct GraphData ensuring edge keys are ordered (u < v).

Parameters:

nodes (Dict[int, Dict]) – Node attribute mapping.
edges (Dict[Tuple[int, int], Dict]) – Edge attribute mapping.

Returns:

Initialized GraphData.

Return type:

GraphData

static from_nx_graph(G: Graph) → GraphData

Construct GraphData from a NetworkX Graph.

Parameters:: G (nx.Graph) – NetworkX graph with node and edge attributes.
Returns:: Initialized GraphData.
Return type:: GraphData

nodes: Dict[int, Dict]

class synrfp.graph.reaction.Reaction(reactant: Molecule, product: Molecule)

Bases: object

Represents a chemical reaction with a single reactant and a single product graph.

Parameters:

reactant (Molecule) – Molecule for the reactant molecule.
product (Molecule) – Molecule for the product molecule.

static from_graph(reactant_graph: Graph, product_graph: Graph) → Reaction

Create a Reaction from two NetworkX graphs.

Parameters:

reactant_graph (nx.Graph) – NetworkX Graph for reactant.
product_graph (nx.Graph) – NetworkX Graph for product.

Returns:

Reaction instance.

Return type:

Reaction

static from_rsmi(rsmi: str) → Reaction

Create a Reaction from an RSMI string using synkit IO.

Parameters:: rsmi (str) – Reaction SMILES string.
Returns:: Reaction with reactant and product Molecule.
Return type:: Reaction
Raises:: ValueError – If parsing fails.

help() → str

Show usage examples for Reaction.

Returns:: Usage guide.
Return type:: str

product: Molecule

reactant: Molecule

to_dataframe() → DataFrame

Summarize reaction graphs as a pandas DataFrame.

Returns:: DataFrame with columns [‘side’,’n_nodes’,’n_edges’].
Return type:: pd.DataFrame

Key classes

Typical responsibilities include:

representing labeled molecular graphs via GraphData,
constructing Reaction objects from RSMI or from existing networkx graphs, and
providing small convenience methods for introspection and tabular summaries.

Tokenizers: `synrfp.tokenizers`

Tokenizers map molecular graphs to multisets of integer tokens (e.g. WL subtree hashes). They are responsible for the graph → tokens stage.

class synrfp.tokenizers.base.BaseTokenizer(node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None)

Bases: ABC

Abstract base for graph tokenizers (graph → multiset of integer tokens).

Parameters:

node_attrs (Optional[Sequence[str]]) – Node attribute keys to include in labels.
edge_attrs (Optional[Sequence[str]]) – Edge attribute keys to include in labels.

Example

>>> class Dummy(BaseTokenizer):
...     def tokens_graph(self, G, radius): return Counter({0: len(G.nodes)})
...

static describe() → str

Return a generic usage example for tokenizers.

Returns:: Example code snippet.
Return type:: str

abstractmethod tokens_graph(G: Molecule, radius: int) → Counter

Generate tokens for a single Molecule instance.

Parameters:

G (Molecule) – Molecule instance to tokenize.
radius (int) – Non-negative neighborhood radius.

Returns:

Multiset of tokens (hashed neighborhood labels).

Return type:

Counter

tokens_side(graphs: Sequence[Molecule], radius: int) → Counter

Generate tokens across multiple graphs (e.g., reaction sides).

Parameters:

graphs (Sequence[Molecule]) – Sequence of Molecule objects.
radius (int) – Non-negative neighborhood radius.

Returns:

Combined multiset of tokens for all graphs.

Return type:

Counter

class synrfp.tokenizers.wl.WLTokenizer(node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None, use_nx: bool = False, require_networkx: bool = False)

Bases: BaseTokenizer

Weisfeiler–Lehman subtree tokenizer (edge-aware; k=0..r tokens).

Node labels: selected node attrs + degree. Edge labels: selected bond attrs (e.g., order, aromaticity).

The tokenizer supports two backends:

A lightweight in-house WL refinement loop (default).
An optional NetworkX-based WL implementation (networkx.algorithms.graph_hashing.weisfeiler_lehman_subgraph_hashes()), enabled via use_nx=True.

Example:

>>> tok = WLTokenizer(node_attrs=['element'], edge_attrs=['order'])
>>> isinstance(tok, WLTokenizer)
True

static describe() → str

Return a usage example for the WLTokenizer.

Returns:: Example code snippet.
Return type:: str

tokens_graph(G: Molecule, radius: int) → Counter

Tokenize a molecular graph via edge-aware WL subtree hashing.

The behaviour is controlled by the use_nx flag:

If use_nx=True and NetworkX WL hashing is available, this method uses networkx.algorithms.graph_hashing.weisfeiler_lehman_subgraph_hashes() on a temporary NetworkX graph with precomputed atom/bond labels, and folds the resulting hex hashes into integer tokens using synrfp.tokenizers.utils._h64().
Otherwise, a compact in-house WL implementation is used that performs the refinement directly on the Molecule object.

Parameters:

G (Molecule) – Molecular graph to tokenize.
radius (int) – Number of WL iterations (k >= 0). k=0 returns only the initial atom-level subtree labels; higher values add increasingly larger neighbourhoods.

Returns:

Counter mapping integer subtree-hash tokens to their multiplicities.

Return type:

collections.Counter

Raises:

ValueError – If radius is negative.
RuntimeError – If use_nx=True, require_networkx=True and NetworkX WL hashing is not available.

class synrfp.tokenizers.nauty.NautyCanonicalizer(node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None)

Bases: object

Nauty-style canonicalizer implemented with NetworkX primitives.

This class computes a canonical labeling / signature for a NetworkX graph using equitable partition refinement + backtracking search similar to the nauty algorithm. It’s designed to be reasonably robust for small ego subgraphs (typical use-case: ego-subgraphs extracted around a node).

Parameters:

node_attrs (list[str] | None) – node attribute keys used for initial partitioning/refinement.
edge_attrs (list[str] | None) – edge attribute keys used when including edge attributes in the canonical label.

canonical_form(G: Graph, return_aut: bool = False, remap_aut: bool = False, return_orbits: bool = False, return_perm: bool = False, max_depth: int | None = None)

Compute canonical form of G.

By default returns canonicalized graph G_can. Optionally can return permutation, automorphisms, orbits, and early-stop flag.

The algorithm: - Build initial partition from node_attrs (or single cell if none) - Repeatedly refine partition by node signatures until stable - If partition refined to singletons, build label and update best - Otherwise pick a non-singleton cell and branch (backtracking) until

best canonical label is found.

This implementation is primarily intended for small graphs (ego subgraphs).

compute_orbits(aut_perms: List[List[int]])

edge_attrs

graph_signature(G: Graph) → str

node_attrs

class synrfp.tokenizers.nauty.NautyTokenizer(node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None, max_cache: int = 100000)

Bases: BaseTokenizer

Nauty-style canonical ego-subgraph tokenizer using pure NetworkX canonicalizer.

For each center node and each radius 0..r, the ego subgraph is canonicalized with respect to chosen node/edge attributes and the canonical signature is converted to an integer token via _h64().

Parameters:

node_attrs (list[str] | None) – list of node attribute keys to include in initial partitioning.
edge_attrs (list[str] | None) – list of edge attribute keys to include for edge distinctions.
max_cache – maximum number of ego-canonicalizations cached

(simple LRU via dict). :type max_cache: int

tokens_graph(G: Molecule, radius: int) → Counter

Tokenize a molecular graph by canonicalizing ego-subgraphs up to radius.

Parameters:

G (Molecule) – Molecular graph.
radius (int) – maximum radius (inclusive) for ego-subgraphs.

Returns:

Counter mapping canonical integer tokens to counts.

Return type:

collections.Counter

synrfp.tokenizers.utils.atom_label_tuple(G: Molecule, v: int, node_attrs: List[str]) → Tuple

Build node label tuple from selected attributes and degree.

Parameters:

G (Molecule) – Molecule with node data.
v (NodeId) – Node id.
node_attrs (List[str]) – Attribute keys to include.

Returns:

Label tuple.

Return type:

Tuple

synrfp.tokenizers.utils.batch_h64(items: Iterable[Any], *, seed: int = 0) → List[int]

Hash a sequence deterministically.

Parameters:

items (Iterable[Any]) – Objects to hash.
seed (int) – Optional integer seed.

Returns:

List of 64-bit ints.

Return type:

List[int]

synrfp.tokenizers.utils.bond_label_tuple(G: Molecule, u: int, v: int, edge_attrs: List[str]) → Tuple

Build edge label tuple from selected attributes.

Parameters:

G (Molecule) – Molecule with edge data.
u (NodeId) – First node.
v (NodeId) – Second node.
edge_attrs (List[str]) – Edge attribute keys to include.

Returns:

Label tuple.

Return type:

Tuple

Key classes and helpers

Typical responsibilities include:

defining the abstract BaseTokenizer interface,
providing concrete implementations such as WLTokenizer and NautyTokenizer,
specifying which node/edge attributes participate in labels, and
offering hashing helpers used to turn label tuples into stable 64-bit tokens.

Sketchers: `synrfp.sketchers`

Sketchers compress (signed) token multisets into fixed-size fingerprints. They implement the Δ/U → sketch stage.

class synrfp.sketchers.base.BaseSketch(seed: int = 1)

Bases: ABC

Abstract base class for set / multiset sketchers.

Subclasses must implement build() and may override describe().

Parameters:: seed (int) – Non-negative integer seed for reproducibility.
Raises:: ValueError – If seed is negative or not an integer.

Example

>>> class Dummy(BaseSketch):
...     def build(self, support): return Counter(support)
...
>>> sk = Dummy(seed=1)
>>> C = sk.build([1, 2, 2, 3]); C[2]
2

abstractmethod build(support: Iterable[int]) → Any

Build a sketch from an unweighted iterable of integer tokens.

Parameters:: support (Iterable[int]) – Iterable of integer-encoded features (can repeat).
Returns:: Sketch object (type depends on subclass).
Return type:: Any

describe() → str: Return a short usage example.

class synrfp.sketchers.base.WeightedSketch(m: int = 256, seed: int = 0, normalize: bool = True)

Bases: ABC

Abstract base for weighted (signed) sketchers.

Utilities provided:

input validation for pos/neg sparse multisets,
deterministic sparse→dense conversion,
signed / two-channel (pos,neg) representations,
exact reference similarities (weighted-Jaccard, cosine),
fluent config for normalization and dtype.

Concrete subclasses must implement build().

Parameters:

m (int) – (Optional) number of sketch samples (backend may use it).
seed (int) – RNG seed for deterministic behavior.
normalize (bool) – If True, helpers can L1-normalize outputs.

Raises:

ValueError – If parameters are invalid.

Example

>>> class Echo(WeightedSketch):
...     def build(self, pos, neg): return self.dicts_to_dense(pos, neg)[0]
...
>>> es = Echo(m=4, seed=0)
>>> vec, _ = es.dicts_to_dense({1:2},{2:1})
>>> vec.sum() != 0
True

abstractmethod build(pos: Mapping[int, int], neg: Mapping[int, int]) → Any

Build a sketch for the signed multiset (pos - neg).

Parameters:

pos (Mapping[int, int]) – Mapping token -> non-negative count.
neg (Mapping[int, int]) – Mapping token -> non-negative count.

Returns:

Implementation-defined sketch object.

Return type:

Any

static cosine_similarity(vec_a: ndarray, vec_b: ndarray) → float

Cosine similarity (safe with zeros).

Parameters:

vec_a (numpy.ndarray) – Vector A.
vec_b (numpy.ndarray) – Vector B.

Returns:

Cosine in [-1,1].

Return type:

float

describe() → str

Short usage snippet for subclasses.

Returns:: Example text.
Return type:: str

dicts_to_dense(pos: Mapping[int, int], neg: Mapping[int, int], index_map: Dict[int, int] | None = None, *, ensure_signed: bool = True) → Tuple[ndarray, Dict[int, int]]

Convert sparse pos/neg dicts into a dense array.

Parameters:

pos (Mapping[int,int]) – Positive counts.
neg (Mapping[int,int]) – Negative counts.
index_map (Optional[Dict[int,int]]) – Optional precomputed token->index map.
ensure_signed (bool) – If True return (n,) signed array (pos-neg); else return (2,n) with [pos, neg] channels.

Returns:

(array, index_map).

Return type:

Tuple[numpy.ndarray, Dict[int,int]]

exact_similarities_from_dicts(pos_a: Mapping[int, int], neg_a: Mapping[int, int], pos_b: Mapping[int, int], neg_b: Mapping[int, int], *, index_map: Dict[int, int] | None = None) → Dict[str, float]

Compute exact similarities for two signed multisets.

Parameters:

pos_a (Mapping[int,int]) – Positive counts for A.
neg_a (Mapping[int,int]) – Negative counts for A.
pos_b (Mapping[int,int]) – Positive counts for B.
neg_b (Mapping[int,int]) – Negative counts for B.
index_map (Optional[Dict[int,int]]) – Optional shared index map.

Returns:

{“weighted_jaccard”: float, “cosine”: float}

Return type:

Dict[str,float]

last_index_map() → Dict[int, int] | None

Return the last token→index map computed by dicts_to_dense().

Returns:: Shallow copy of index map or None.
Return type:: Optional[Dict[int,int]]

set_dtype(dtype: dtype) → WeightedSketch

Configure dtype for dense arrays.

Parameters:: dtype (numpy.dtype) – NumPy dtype (e.g., np.float32, np.float64).
Returns:: self
Return type:: WeightedSketch

set_normalize(normalize: bool) → WeightedSketch

Set whether helpers produce L1-normalized arrays.

Parameters:: normalize (bool) – True to enable L1 normalization.
Returns:: self
Return type:: WeightedSketch

static signed_to_pos_neg_arrays(vec: ndarray) → Tuple[ndarray, ndarray]

Split a signed vector into non-negative positive/negative arrays.

Parameters:: vec (numpy.ndarray) – Signed vector.
Returns:: (pos, neg) arrays, both >= 0.
Return type:: Tuple[numpy.ndarray, numpy.ndarray]

validate_pos_neg(pos: Mapping[int, int], neg: Mapping[int, int]) → None

Validate pos/neg dictionaries.

Parameters:

pos (Mapping[int,int]) – Positive token counts.
neg (Mapping[int,int]) – Negative token counts.

Raises:

TypeError – If types invalid.
ValueError – If keys not int or counts negative.

static weighted_jaccard_signed(vec_a: ndarray, vec_b: ndarray) → float

Weighted-Jaccard for signed vectors via two-channel decomposition.

Parameters:

vec_a (numpy.ndarray) – Signed vector A.
vec_b (numpy.ndarray) – Signed vector B.

Returns:

Similarity in [0,1].

Return type:

float

class synrfp.sketchers.parity_fold.ParityFold(bits: int = 2048, seed: int = 0)

Bases: BaseSketch

Parity-based folding sketcher (unweighted tokens → binary bit vector).

Each token t is mapped to a bit index via a deterministic 64-bit hash and the sketch’s seed, then the bit is toggled (XOR). If a token appears an even number of times, its contribution cancels out; odd multiplicities flip the corresponding bit.

The result is a compact binary fingerprint useful for fast similarity via Hamming distance or Tanimoto over bits.

Parameters:

bits (int) – Length of the binary sketch (number of bits).
seed (int) – Non-negative integer seed for the internal hash mapping.

Raises:

ValueError – If bits is not positive or seed is negative.

build(support: Iterable[int]) → ndarray

Build a parity-folded binary sketch from an unweighted token stream.

Internally, this:

Collapses support to counts via collections.Counter.
Retains only tokens with odd multiplicity (parity 1).
Maps each such token t to an index idx = _h64(('pf', t), seed=seed) % bits.
Sets the corresponding bits, yielding a 0/1 vector.

Parameters:: support (Iterable[int]) – Iterable of integer tokens.
Returns:: Binary sketch of length bits (dtype uint8).
Return type:: numpy.ndarray

static describe() → str

Return a brief usage example for ParityFold.

Returns:: Example code snippet.
Return type:: str

class synrfp.sketchers.minhash_sketch.MinHashSketch(m: int = 256, seed: int = 0, use_datasketch: bool = True)

Bases: BaseSketch

Set-based MinHash sketch for approximating Jaccard similarity.

This sketch treats the input as a set of tokens (multiplicities are ignored). It computes a fixed-length signature such that the fraction of shared components approximates the Jaccard index between two sets.

If datasketch is available and use_datasketch is True, the implementation delegates to datasketch.MinHash. Otherwise a deterministic fallback based on repeated 64-bit hashing is used.

Parameters:

m (int) – Number of hash permutations (length of the sketch).
seed (int) – Non-negative integer seed for all hash permutations.
use_datasketch (bool) – Whether to use datasketch if available.

Raises:

ValueError – If m is not positive or seed is negative.

build(support: Iterable[int]) → List[int]

Build a MinHash signature from an unweighted token stream.

Multiplicities in support are ignored; only distinct tokens contribute to the sketch.

Parameters:: support (Iterable[int]) – Iterable of integer tokens.
Returns:: MinHash signature as a list of length m.
Return type:: list[int]

static describe() → str

Return a brief usage example for MinHashSketch.

Returns:: Example code snippet.
Return type:: str

class synrfp.sketchers.cw_sketch.CWSketch(m: int = 256, seed: int = 0, normalize: bool = True)

Bases: WeightedSketch

Consistent Weighted Sampling (CWS) sketch for weighted Jaccard.

This sketch operates on signed sparse multisets (pos, neg). It converts them into a signed dense vector via dicts_to_dense(), splits into separate non-negative positive/negative channels, concatenates them, and applies Consistent Weighted Sampling.

If datasketch is available, it delegates to datasketch.WeightedMinHashGenerator. Otherwise it uses a deterministic ICWS-like fallback implementation.

Parameters:

m (int) – Number of CWS samples (length of the sketch).
seed (int) – Random seed for the sampler.
normalize (bool) – If True, dense helpers L1-normalize the signed vector prior to splitting. Scaling does not change the weighted Jaccard but can improve numerical stability.

Raises:

ValueError – If arguments are invalid.

build(pos: Mapping[int, int], neg: Mapping[int, int]) → ndarray

Build a length-m CWS hash signature for a signed multiset.

Internally this:

Uses dicts_to_dense() with ensure_signed=True to obtain a 1D signed dense vector.
Splits it into non-negative positive/negative arrays using WeightedSketch.signed_to_pos_neg_arrays().
Concatenates these into a non-negative weight vector.
Applies either datasketch or a deterministic fallback to draw m CWS samples.

Parameters:

pos (Mapping[int,int]) – Positive token counts.
neg (Mapping[int,int]) – Negative token counts.

Returns:

Array of sampled indices (hash values) of length m.

Return type:

numpy.ndarray

static describe() → str

Return a brief usage example for CWSketch.

Returns:: Example code snippet.
Return type:: str

Key classes

Typical responsibilities include:

defining unweighted and weighted sketcher interfaces (BaseSketch, WeightedSketch),
implementing binary parity-fold sketches (ParityFold),
implementing MinHash-based sketches for Jaccard estimation (MinHashSketch), and
implementing consistent weighted sampling for signed deltas (CWSketch).

Batch encoding: `synrfp.encoder`

The synrfp.encoder module provides a convenience wrapper for batch encoding lists of reaction SMILES into SynRFP fingerprints, with optional parallelization via joblib.

Key class

Typical responsibilities of SynRFPEncoder include:

turning a list of RSMI strings into a 2D NumPy array of fingerprints,
exposing the same configuration knobs as synrfp.synrfp() (tokenizer, radius, sketch type, bit length, seed), and
handling multi-process or multi-threaded execution transparently when n_jobs > 1.

API Reference

Top-level API: synrfp

Parameters

Returns

Raises

Key classes and functions

Graph utilities: synrfp.graph

Key classes

Tokenizers: synrfp.tokenizers

Example

Key classes and helpers

Sketchers: synrfp.sketchers

Example

Example

Key classes

Batch encoding: synrfp.encoder

Key class

Top-level API: `synrfp`

Graph utilities: `synrfp.graph`

Tokenizers: `synrfp.tokenizers`

Sketchers: `synrfp.sketchers`

Batch encoding: `synrfp.encoder`