API Reference
This page provides an overview of the public Python API exposed by
synrfp. It focuses on:
core fingerprint builders in
synrfp,graph containers and reaction helpers in
synrfp.graph,tokenization backends in
synrfp.tokenizers,sketching backends in
synrfp.sketchers, andbatch encoding utilities in
synrfp.encoder.
If you are looking for high-level usage examples, see the README on GitHub or the Getting Started page.
Top-level API: synrfp
The synrfp module collects the main user-facing entry points:
fingerprint engines, convenience wrappers, and similarity utilities.
- class synrfp.BatchEncoder(tokenizer: str = 'wl', radius: int = 2, sketch: str = 'parity', bits: int = 1024, m: int = 256, seed: int = 1, mode: str = 'delta', node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None, *, n_jobs: int = 1, batch_size: int | None = None, verbose: int | None = None, backend: str = 'loky')
Bases:
objectBatch encoder for mapping reaction SMILES (RSMI) to SynRFP fingerprints.
The encoder wraps the top-level
synrfpfunction and provides: - configuration stored on the instance (tokenizer/sketch/seed/etc.), - single-item and batched encoding, - optional parallelization viajoblib.Parallel, - optional chunked processing viabatch_sizefor memory control.- Parameters:
tokenizer (str) – Tokenizer name (‘wl’, ‘nauty’, ‘morgan’, ‘path’, …).
radius (int) – Neighborhood / iteration radius for tokenizers that use it.
sketch (str) – Sketcher name (‘parity’, ‘minhash’, ‘cw’, ‘srp’, …).
bits (int) – Length of the final bit-vector for bit-based sketches.
m (int) – Sketch size parameter (e.g. number of hash samples/projections).
seed (int) – RNG seed forwarded to tokenizers/sketchers.
mode (str) – Token aggregation mode: ‘delta’ or ‘union’.
node_attrs (Sequence[str] | None) – Optional node attribute names passed to tokenizers.
edge_attrs (Sequence[str] | None) – Optional edge attribute names passed to tokenizers.
n_jobs (int) – Number of jobs for parallel encoding (1 = serial).
batch_size (int | None) – Maximum number of reactions per batch. If None, no chunking.
verbose (int | None) – Verbosity forwarded to joblib (int). If None, 0 is used.
backend (str) – joblib backend (e.g. ‘loky’, ‘threading’).
- Example:
>>> from synrfp.encode.batch import BatchEncoder >>> enc = BatchEncoder(tokenizer='wl', sketch='parity', bits=1024, n_jobs=1) >>> single_fp = enc.encode_one('CCO>>CCO') # numpy array shape (1024,) >>> many = enc.encode_many(['CCO>>CCO', 'CCO>>CCO']) # shape (2, 1024)
A small-batch example (useful for constrained memory / large inputs):
>>> enc = BatchEncoder(batch_size=128, n_jobs=2) >>> X = enc.encode_many(list_of_rsmi) # processes list_of_rsmi in chunks of 128
- classmethod encode(rsmi_list: Sequence[str], *, tokenizer: str = 'wl', radius: int = 2, sketch: str = 'parity', bits: int = 1024, m: int = 256, seed: int = 1, mode: str = 'delta', node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None, batch_size: int | None = None) ndarray
Convenience classmethod to encode many reactions (serial execution).
- Parameters:
rsmi_list (Sequence[str]) – Sequence of reaction SMILES.
tokenizer (str) – Tokenizer name.
radius (int) – Neighborhood radius.
sketch (str) – Sketcher name.
bits (int) – Fingerprint length for bit-based sketches.
m (int) – Sketch size parameter.
seed (int) – RNG seed.
mode (str) – Token aggregation mode (‘delta’|’union’).
node_attrs (Sequence[str] | None) – Node attribute names for tokenizer.
edge_attrs (Sequence[str] | None) – Edge attribute names for tokenizer.
batch_size (int | None) – Maximum chunk size for encode_many.
- Returns:
2D numpy array of fingerprints.
- Return type:
numpy.ndarray
- Example:
>>> BatchEncoder.encode(['CCO>>CCO'], tokenizer='wl', sketch='parity', bits=64) array([[...]], dtype=int)
- encode_many(rsmi_list: Sequence[str]) ndarray
Encode a sequence of reaction SMILES into a 2D numpy array.
If
batch_sizeis set and smaller than the input length, encoding proceeds in chunks of at mostbatch_sizeand results are concatenated.- Parameters:
rsmi_list (Sequence[str]) – Sequence of reaction SMILES strings.
- Returns:
2D numpy array with shape (N, L) where N = len(rsmi_list)
and L = fingerprint length. :rtype: numpy.ndarray :raises ValueError: If fingerprint lengths are inconsistent across batches.
- encode_one(rsmi: str) ndarray
Encode a single reaction SMILES into a 1D numpy array.
- Parameters:
rsmi (str) – Reaction SMILES string.
- Returns:
1D numpy array of ints (length == bits).
- Return type:
numpy.ndarray
- class synrfp.SynRFP(tokenizer: BaseTokenizer, radius: int = 2, sketch: BaseSketch | None = None, weighted_sketch: WeightedSketch | None = None)
Bases:
object- Build a SynRFP fingerprint for a single reaction:
one reactant
Moleculeone product
Molecule
Exactly one of
sketchorweighted_sketchmust be provided.- Parameters:
tokenizer (BaseTokenizer) – Tokenizer instance (e.g.
WLTokenizer,NautyTokenizer,MorganTokenizer,PathTokenizer).radius (int) – Neighborhood radius for the tokenizer.
sketch (BaseSketch | None) – Unweighted sketcher (e.g.
ParityFold,MinHashSketch).weighted_sketch (WeightedSketch | None) – Weighted sketcher (e.g.
CWSketch,SRPSketch).
- static describe() str
Example usage:
>>> fp = SynRFP(tokenizer=WLTokenizer(), radius=2, sketch=ParityFold(1024)) >>> res = fp.fingerprint(reactant_G, product_G)
- Returns:
Example usage string.
- Return type:
str
- fingerprint(reactant: Molecule, product: Molecule, *, mode: str = 'delta', node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None) SynRFPResult
Compute the reaction fingerprint for a pair of molecules.
- Parameters:
reactant (Molecule) – Reactant molecular graph.
product (Molecule) – Product molecular graph.
mode (str) – Token combination mode: -
'delta': signed difference P−R (default) -'union': union counts R+Pnode_attrs (Sequence[str] | None) – Optional node attribute names for tokenizer.
edge_attrs (Sequence[str] | None) – Optional edge attribute names for tokenizer.
- Returns:
A
SynRFPResultwith tokens, support and sketch.- Return type:
- Raises:
TypeError – If inputs are not
Moleculeinstances.ValueError – If
modeis invalid.
- class synrfp.SynRFPResult(tokens_R: Counter, tokens_P: Counter, delta: Counter, support: List[int], sketch: object, mode: str = 'delta')
Bases:
objectContainer for outputs of a single fingerprinting call.
- Parameters:
tokens_R (collections.Counter) – Token multiset for the reactant graph.
tokens_P (collections.Counter) – Token multiset for the product graph.
delta (collections.Counter) – Token counts summarising the transformation, depending on
mode: - ifmode='delta': signed difference P−R - ifmode='union': union counts (R+P)support (list[int]) – List of token keys with nonzero contribution (delta or union).
sketch (object) – Sketch object (bytes, list, or array) from the compressor.
mode (str) – Fingerprint mode, either
'delta'or'union'.
- as_array() ndarray
Return the underlying sketch as a 1D numpy integer array.
- This works for all sketcher types:
ParityFold: 0/1 array
MinHashSketch: hash values
CWSketch: sample indices
SRPSketch: sign pattern (+1/-1)
- Returns:
1D numpy array representation of the sketch.
- Return type:
numpy.ndarray
- delta: Counter
- static describe() str
Example usage:
>>> # assume `res` is a SynRFPResult >>> print(res) SynRFPResult(tokens_R=10 tokens, tokens_P=8 tokens, support=3, mode='delta', sketch_type=bytearray)
- Returns:
Example usage string.
- Return type:
str
- mode: str = 'delta'
- sketch: object
- support: List[int]
- to_binary() List[int]
Return the sketch stored in this result as a plain list of 0/1 bits.
Only works for binary sketchers (e.g. ParityFold). For non-binary sketchers (MinHash, CWSketch, SRP) a
TypeErroris raised.- Returns:
Binary fingerprint as list of 0/1 bits.
- Return type:
list[int]
- Raises:
TypeError – If the underlying sketch cannot be interpreted as bits.
- tokens_P: Counter
- tokens_R: Counter
- synrfp.build_graph_from_printout(nodes: Dict[int, Dict], edges: Dict[tuple[int, int], Dict]) Molecule
Helper to convert “printout” dicts directly into a
Molecule.- Parameters:
nodes (Dict[int, Dict]) – Mapping from node ID to attribute dict.
edges (Dict[tuple[int, int], Dict]) – Mapping from (u, v) edges (with u < v) to attribute dict.
- Returns:
A fresh
Moleculeinstance.- Return type:
Molecule
- Example:
>>> nodes = {0: {'element': 'C'}, 1: {'element': 'O'}} >>> edges = {(0, 1): {'order': 1.5}} >>> G = build_graph_from_printout(nodes, edges)
- synrfp.jaccard_minhash(h1: list | tuple, h2: list | tuple) float
Estimate Jaccard similarity from two MinHash signature arrays.
- Parameters:
h1 (list or tuple) – First MinHash hash‐value sequence.
h2 (list or tuple) – Second MinHash sequence (must be same length).
- Returns:
Fraction of positions where
h1[i] == h2[i].- Return type:
float
- synrfp.synrfp(rsmi: str, *, tokenizer: str = 'wl', radius: int = 2, sketch: str = 'parity', bits: int = 1024, m: int = 256, seed: int = 1, mode: str = 'delta', node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None) List[int]
Convert a reaction SMILES (RSMI) into a binary fingerprint bit-vector.
Internally:
A tokenizer (WL, Nauty, Morgan, Path) converts each side into multiset tokens.
Depending on
mode, either token delta (P−R) or union (R+P) is computed.A sketcher (parity, minhash, cw, srp) converts the token set/weights into a fixed-size sketch.
The sketch is finally mapped into a binary vector of length
bits.
Parameters
- rsmistr
Reaction SMILES, e.g.
"CCO>>C=C.O".- tokenizerstr, default “wl”
Which tokenizer to use:
"wl": Weisfeiler–Lehman style tokenizer."nauty": Nauty-based canonical labeling tokenizer."morgan": Morgan/ECFP-style neighborhood tokenizer."path": simple path-based tokenizer.
- radiusint, default 2
Neighborhood radius for the tokenizer (ignored by some tokenizers if not applicable).
- sketchstr, default “parity”
Which sketcher to use:
"parity": parity-folding into a binary vector."minhash": MinHash signature, then mapped to bits."cw": count-weighted sketch, then mapped to bits."srp": signed random projection sketch (cosine-oriented).
- bitsint, default 1024
Length of the final binary fingerprint. For
sketch="parity", this is the internal bit-length ofParityFold. For"minhash","cw"and"srp", it controls the final bin count used bysignature_to_bits().- mint, default 256
Number of hash samples/projections for MinHash, CWSketch, or SRP.
- seedint, default 1
Random seed for reproducibility.
- mode{“delta”, “union”}, default “delta”
Token combination mode:
"delta": signed difference P−R."union": union of tokens appearing on either side.
- node_attrsSequence[str] or None, optional
Node attribute names passed to the tokenizer (e.g.
["element"]).- edge_attrsSequence[str] or None, optional
Edge attribute names passed to the tokenizer (e.g.
["order"]).
Returns
- list[int]
Fingerprint as a list of 0/1 bits of length
bits.
Raises
- ValueError
On invalid
tokenizer,sketch, ormodenames.- RuntimeError
If required dependencies (e.g.
pynautyordatasketch) are missing.
- synrfp.tanimoto_bits(b1: bytearray | List[int] | ndarray, b2: bytearray | List[int] | ndarray) float
Compute the Tanimoto (Jaccard) similarity between two binary‐bit sketches.
Accepts
bytearray,list[int], ornumpy.ndarrayof 0/1.- Parameters:
b1 (bytearray or List[int] or numpy.ndarray) – First bit array.
b2 (bytearray or List[int] or numpy.ndarray) – Second bit array.
- Returns:
Intersection size divided by union size, or 0.0 if union is zero.
- Return type:
float
Key classes and functions
Typical responsibilities of the top-level API include:
converting reaction SMILES (RSMI) directly into fixed-length fingerprints,
configuring a
SynRFPengine with a tokenizer and sketcher,exposing similarity utilities for binary and MinHash sketches, and
providing a simple batch encoder for lists of reaction SMILES.
Graph utilities: synrfp.graph
The synrfp.graph subpackage defines light-weight graph containers
and helpers for representing reactions.
- class synrfp.graph.graph_data.GraphData(nodes: Dict[int, Dict], edges: Dict[Tuple[int, int], Dict], _adj: Dict[int, List[int]] | None = None)
Bases:
objectLightweight labeled graph container.
- Parameters:
nodes (Dict[NodeId, Dict]) – Mapping from node id to attribute dict (e.g., element, charge).
edges (Dict[Edge, Dict]) – Mapping from edge tuple (u, v) with u<v to attribute dict (e.g., order).
_adj (Optional[Dict[NodeId, List[NodeId]]]) – Internal adjacency cache, computed lazily.
- property adj: Dict[int, List[int]]
Lazily compute and cache adjacency list.
- Returns:
Mapping from node id to sorted neighbor list.
- Return type:
Dict[int, List[int]]
- degree(v: int) int
Get degree of node v.
- Parameters:
v (int) – Node identifier.
- Returns:
Degree count.
- Return type:
int
- edge_attr(u: int, v: int) Dict
Retrieve attribute dict for edge (u, v).
- Parameters:
u (int) – First node.
v (int) – Second node.
- Returns:
Edge attributes.
- Return type:
Dict
- Raises:
KeyError – If edge not present.
- edges: Dict[Tuple[int, int], Dict]
- static from_dicts(nodes: Dict[int, Dict], edges: Dict[Tuple[int, int], Dict]) GraphData
Construct GraphData ensuring edge keys are ordered (u < v).
- Parameters:
nodes (Dict[int, Dict]) – Node attribute mapping.
edges (Dict[Tuple[int, int], Dict]) – Edge attribute mapping.
- Returns:
Initialized GraphData.
- Return type:
- static from_nx_graph(G: Graph) GraphData
Construct GraphData from a NetworkX Graph.
- Parameters:
G (nx.Graph) – NetworkX graph with node and edge attributes.
- Returns:
Initialized GraphData.
- Return type:
- nodes: Dict[int, Dict]
- class synrfp.graph.reaction.Reaction(reactant: Molecule, product: Molecule)
Bases:
objectRepresents a chemical reaction with a single reactant and a single product graph.
- Parameters:
reactant (Molecule) – Molecule for the reactant molecule.
product (Molecule) – Molecule for the product molecule.
- static from_graph(reactant_graph: Graph, product_graph: Graph) Reaction
Create a Reaction from two NetworkX graphs.
- Parameters:
reactant_graph (nx.Graph) – NetworkX Graph for reactant.
product_graph (nx.Graph) – NetworkX Graph for product.
- Returns:
Reaction instance.
- Return type:
- static from_rsmi(rsmi: str) Reaction
Create a Reaction from an RSMI string using synkit IO.
- Parameters:
rsmi (str) – Reaction SMILES string.
- Returns:
Reaction with reactant and product Molecule.
- Return type:
- Raises:
ValueError – If parsing fails.
- help() str
Show usage examples for Reaction.
- Returns:
Usage guide.
- Return type:
str
- product: Molecule
- reactant: Molecule
- to_dataframe() DataFrame
Summarize reaction graphs as a pandas DataFrame.
- Returns:
DataFrame with columns [‘side’,’n_nodes’,’n_edges’].
- Return type:
pd.DataFrame
Key classes
Typical responsibilities include:
Tokenizers: synrfp.tokenizers
Tokenizers map molecular graphs to multisets of integer tokens (e.g. WL subtree hashes). They are responsible for the graph → tokens stage.
- class synrfp.tokenizers.base.BaseTokenizer(node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None)
Bases:
ABCAbstract base for graph tokenizers (graph → multiset of integer tokens).
- Parameters:
node_attrs (Optional[Sequence[str]]) – Node attribute keys to include in labels.
edge_attrs (Optional[Sequence[str]]) – Edge attribute keys to include in labels.
Example
>>> class Dummy(BaseTokenizer): ... def tokens_graph(self, G, radius): return Counter({0: len(G.nodes)}) ...
- static describe() str
Return a generic usage example for tokenizers.
- Returns:
Example code snippet.
- Return type:
str
- abstractmethod tokens_graph(G: Molecule, radius: int) Counter
Generate tokens for a single
Moleculeinstance.- Parameters:
G (Molecule) – Molecule instance to tokenize.
radius (int) – Non-negative neighborhood radius.
- Returns:
Multiset of tokens (hashed neighborhood labels).
- Return type:
Counter
- tokens_side(graphs: Sequence[Molecule], radius: int) Counter
Generate tokens across multiple graphs (e.g., reaction sides).
- Parameters:
graphs (Sequence[Molecule]) – Sequence of Molecule objects.
radius (int) – Non-negative neighborhood radius.
- Returns:
Combined multiset of tokens for all graphs.
- Return type:
Counter
- class synrfp.tokenizers.wl.WLTokenizer(node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None, use_nx: bool = False, require_networkx: bool = False)
Bases:
BaseTokenizerWeisfeiler–Lehman subtree tokenizer (edge-aware; k=0..r tokens).
Node labels: selected node attrs + degree. Edge labels: selected bond attrs (e.g., order, aromaticity).
The tokenizer supports two backends:
A lightweight in-house WL refinement loop (default).
An optional NetworkX-based WL implementation (
networkx.algorithms.graph_hashing.weisfeiler_lehman_subgraph_hashes()), enabled viause_nx=True.
- Example:
>>> tok = WLTokenizer(node_attrs=['element'], edge_attrs=['order']) >>> isinstance(tok, WLTokenizer) True
- static describe() str
Return a usage example for the
WLTokenizer.- Returns:
Example code snippet.
- Return type:
str
- tokens_graph(G: Molecule, radius: int) Counter
Tokenize a molecular graph via edge-aware WL subtree hashing.
The behaviour is controlled by the
use_nxflag:If
use_nx=Trueand NetworkX WL hashing is available, this method usesnetworkx.algorithms.graph_hashing.weisfeiler_lehman_subgraph_hashes()on a temporary NetworkX graph with precomputed atom/bond labels, and folds the resulting hex hashes into integer tokens usingsynrfp.tokenizers.utils._h64().Otherwise, a compact in-house WL implementation is used that performs the refinement directly on the
Moleculeobject.
- Parameters:
G (Molecule) – Molecular graph to tokenize.
radius (int) – Number of WL iterations (
k >= 0).k=0returns only the initial atom-level subtree labels; higher values add increasingly larger neighbourhoods.
- Returns:
Counter mapping integer subtree-hash tokens to their multiplicities.
- Return type:
collections.Counter
- Raises:
ValueError – If
radiusis negative.RuntimeError – If
use_nx=True,require_networkx=Trueand NetworkX WL hashing is not available.
- class synrfp.tokenizers.nauty.NautyCanonicalizer(node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None)
Bases:
objectNauty-style canonicalizer implemented with NetworkX primitives.
This class computes a canonical labeling / signature for a NetworkX graph using equitable partition refinement + backtracking search similar to the nauty algorithm. It’s designed to be reasonably robust for small ego subgraphs (typical use-case: ego-subgraphs extracted around a node).
- Parameters:
node_attrs (list[str] | None) – node attribute keys used for initial partitioning/refinement.
edge_attrs (list[str] | None) – edge attribute keys used when including edge attributes in the canonical label.
- canonical_form(G: Graph, return_aut: bool = False, remap_aut: bool = False, return_orbits: bool = False, return_perm: bool = False, max_depth: int | None = None)
Compute canonical form of G.
By default returns canonicalized graph G_can. Optionally can return permutation, automorphisms, orbits, and early-stop flag.
The algorithm: - Build initial partition from node_attrs (or single cell if none) - Repeatedly refine partition by node signatures until stable - If partition refined to singletons, build label and update best - Otherwise pick a non-singleton cell and branch (backtracking) until
best canonical label is found.
This implementation is primarily intended for small graphs (ego subgraphs).
- compute_orbits(aut_perms: List[List[int]])
- edge_attrs
- graph_signature(G: Graph) str
- node_attrs
- class synrfp.tokenizers.nauty.NautyTokenizer(node_attrs: Sequence[str] | None = None, edge_attrs: Sequence[str] | None = None, max_cache: int = 100000)
Bases:
BaseTokenizerNauty-style canonical ego-subgraph tokenizer using pure NetworkX canonicalizer.
For each center node and each radius 0..r, the ego subgraph is canonicalized with respect to chosen node/edge attributes and the canonical signature is converted to an integer token via
_h64().- Parameters:
node_attrs (list[str] | None) – list of node attribute keys to include in initial partitioning.
edge_attrs (list[str] | None) – list of edge attribute keys to include for edge distinctions.
max_cache – maximum number of ego-canonicalizations cached
(simple LRU via dict). :type max_cache: int
- tokens_graph(G: Molecule, radius: int) Counter
Tokenize a molecular graph by canonicalizing ego-subgraphs up to radius.
- Parameters:
G (Molecule) – Molecular graph.
radius (int) – maximum radius (inclusive) for ego-subgraphs.
- Returns:
Counter mapping canonical integer tokens to counts.
- Return type:
collections.Counter
- synrfp.tokenizers.utils.atom_label_tuple(G: Molecule, v: int, node_attrs: List[str]) Tuple
Build node label tuple from selected attributes and degree.
- Parameters:
G (Molecule) – Molecule with node data.
v (NodeId) – Node id.
node_attrs (List[str]) – Attribute keys to include.
- Returns:
Label tuple.
- Return type:
Tuple
- synrfp.tokenizers.utils.batch_h64(items: Iterable[Any], *, seed: int = 0) List[int]
Hash a sequence deterministically.
- Parameters:
items (Iterable[Any]) – Objects to hash.
seed (int) – Optional integer seed.
- Returns:
List of 64-bit ints.
- Return type:
List[int]
- synrfp.tokenizers.utils.bond_label_tuple(G: Molecule, u: int, v: int, edge_attrs: List[str]) Tuple
Build edge label tuple from selected attributes.
- Parameters:
G (Molecule) – Molecule with edge data.
u (NodeId) – First node.
v (NodeId) – Second node.
edge_attrs (List[str]) – Edge attribute keys to include.
- Returns:
Label tuple.
- Return type:
Tuple
Key classes and helpers
Typical responsibilities include:
defining the abstract
BaseTokenizerinterface,providing concrete implementations such as
WLTokenizerandNautyTokenizer,specifying which node/edge attributes participate in labels, and
offering hashing helpers used to turn label tuples into stable 64-bit tokens.
Sketchers: synrfp.sketchers
Sketchers compress (signed) token multisets into fixed-size fingerprints. They implement the Δ/U → sketch stage.
- class synrfp.sketchers.base.BaseSketch(seed: int = 1)
Bases:
ABCAbstract base class for set / multiset sketchers.
Subclasses must implement
build()and may overridedescribe().- Parameters:
seed (int) – Non-negative integer seed for reproducibility.
- Raises:
ValueError – If
seedis negative or not an integer.
Example
>>> class Dummy(BaseSketch): ... def build(self, support): return Counter(support) ... >>> sk = Dummy(seed=1) >>> C = sk.build([1, 2, 2, 3]); C[2] 2
- abstractmethod build(support: Iterable[int]) Any
Build a sketch from an unweighted iterable of integer tokens.
- Parameters:
support (Iterable[int]) – Iterable of integer-encoded features (can repeat).
- Returns:
Sketch object (type depends on subclass).
- Return type:
Any
- describe() str
Return a short usage example.
- class synrfp.sketchers.base.WeightedSketch(m: int = 256, seed: int = 0, normalize: bool = True)
Bases:
ABCAbstract base for weighted (signed) sketchers.
- Utilities provided:
input validation for pos/neg sparse multisets,
deterministic sparse→dense conversion,
signed / two-channel (pos,neg) representations,
exact reference similarities (weighted-Jaccard, cosine),
fluent config for normalization and dtype.
Concrete subclasses must implement
build().- Parameters:
m (int) – (Optional) number of sketch samples (backend may use it).
seed (int) – RNG seed for deterministic behavior.
normalize (bool) – If True, helpers can L1-normalize outputs.
- Raises:
ValueError – If parameters are invalid.
Example
>>> class Echo(WeightedSketch): ... def build(self, pos, neg): return self.dicts_to_dense(pos, neg)[0] ... >>> es = Echo(m=4, seed=0) >>> vec, _ = es.dicts_to_dense({1:2},{2:1}) >>> vec.sum() != 0 True
- abstractmethod build(pos: Mapping[int, int], neg: Mapping[int, int]) Any
Build a sketch for the signed multiset (pos - neg).
- Parameters:
pos (Mapping[int, int]) – Mapping token -> non-negative count.
neg (Mapping[int, int]) – Mapping token -> non-negative count.
- Returns:
Implementation-defined sketch object.
- Return type:
Any
- static cosine_similarity(vec_a: ndarray, vec_b: ndarray) float
Cosine similarity (safe with zeros).
- Parameters:
vec_a (numpy.ndarray) – Vector A.
vec_b (numpy.ndarray) – Vector B.
- Returns:
Cosine in [-1,1].
- Return type:
float
- describe() str
Short usage snippet for subclasses.
- Returns:
Example text.
- Return type:
str
- dicts_to_dense(pos: Mapping[int, int], neg: Mapping[int, int], index_map: Dict[int, int] | None = None, *, ensure_signed: bool = True) Tuple[ndarray, Dict[int, int]]
Convert sparse pos/neg dicts into a dense array.
- Parameters:
pos (Mapping[int,int]) – Positive counts.
neg (Mapping[int,int]) – Negative counts.
index_map (Optional[Dict[int,int]]) – Optional precomputed token->index map.
ensure_signed (bool) – If True return (n,) signed array (pos-neg); else return (2,n) with [pos, neg] channels.
- Returns:
(array, index_map).
- Return type:
Tuple[numpy.ndarray, Dict[int,int]]
- exact_similarities_from_dicts(pos_a: Mapping[int, int], neg_a: Mapping[int, int], pos_b: Mapping[int, int], neg_b: Mapping[int, int], *, index_map: Dict[int, int] | None = None) Dict[str, float]
Compute exact similarities for two signed multisets.
- Parameters:
pos_a (Mapping[int,int]) – Positive counts for A.
neg_a (Mapping[int,int]) – Negative counts for A.
pos_b (Mapping[int,int]) – Positive counts for B.
neg_b (Mapping[int,int]) – Negative counts for B.
index_map (Optional[Dict[int,int]]) – Optional shared index map.
- Returns:
{“weighted_jaccard”: float, “cosine”: float}
- Return type:
Dict[str,float]
- last_index_map() Dict[int, int] | None
Return the last token→index map computed by
dicts_to_dense().- Returns:
Shallow copy of index map or None.
- Return type:
Optional[Dict[int,int]]
- set_dtype(dtype: dtype) WeightedSketch
Configure dtype for dense arrays.
- Parameters:
dtype (numpy.dtype) – NumPy dtype (e.g., np.float32, np.float64).
- Returns:
self
- Return type:
- set_normalize(normalize: bool) WeightedSketch
Set whether helpers produce L1-normalized arrays.
- Parameters:
normalize (bool) – True to enable L1 normalization.
- Returns:
self
- Return type:
- static signed_to_pos_neg_arrays(vec: ndarray) Tuple[ndarray, ndarray]
Split a signed vector into non-negative positive/negative arrays.
- Parameters:
vec (numpy.ndarray) – Signed vector.
- Returns:
(pos, neg) arrays, both >= 0.
- Return type:
Tuple[numpy.ndarray, numpy.ndarray]
- validate_pos_neg(pos: Mapping[int, int], neg: Mapping[int, int]) None
Validate pos/neg dictionaries.
- Parameters:
pos (Mapping[int,int]) – Positive token counts.
neg (Mapping[int,int]) – Negative token counts.
- Raises:
TypeError – If types invalid.
ValueError – If keys not int or counts negative.
- static weighted_jaccard_signed(vec_a: ndarray, vec_b: ndarray) float
Weighted-Jaccard for signed vectors via two-channel decomposition.
- Parameters:
vec_a (numpy.ndarray) – Signed vector A.
vec_b (numpy.ndarray) – Signed vector B.
- Returns:
Similarity in [0,1].
- Return type:
float
- class synrfp.sketchers.parity_fold.ParityFold(bits: int = 2048, seed: int = 0)
Bases:
BaseSketchParity-based folding sketcher (unweighted tokens → binary bit vector).
Each token
tis mapped to a bit index via a deterministic 64-bit hash and the sketch’s seed, then the bit is toggled (XOR). If a token appears an even number of times, its contribution cancels out; odd multiplicities flip the corresponding bit.The result is a compact binary fingerprint useful for fast similarity via Hamming distance or Tanimoto over bits.
- Parameters:
bits (int) – Length of the binary sketch (number of bits).
seed (int) – Non-negative integer seed for the internal hash mapping.
- Raises:
ValueError – If
bitsis not positive orseedis negative.
- build(support: Iterable[int]) ndarray
Build a parity-folded binary sketch from an unweighted token stream.
Internally, this:
Collapses support to counts via
collections.Counter.Retains only tokens with odd multiplicity (parity 1).
Maps each such token
tto an indexidx = _h64(('pf', t), seed=seed) % bits.Sets the corresponding bits, yielding a 0/1 vector.
- Parameters:
support (Iterable[int]) – Iterable of integer tokens.
- Returns:
Binary sketch of length
bits(dtypeuint8).- Return type:
numpy.ndarray
- static describe() str
Return a brief usage example for
ParityFold.- Returns:
Example code snippet.
- Return type:
str
- class synrfp.sketchers.minhash_sketch.MinHashSketch(m: int = 256, seed: int = 0, use_datasketch: bool = True)
Bases:
BaseSketchSet-based MinHash sketch for approximating Jaccard similarity.
This sketch treats the input as a set of tokens (multiplicities are ignored). It computes a fixed-length signature such that the fraction of shared components approximates the Jaccard index between two sets.
If
datasketchis available anduse_datasketchis True, the implementation delegates todatasketch.MinHash. Otherwise a deterministic fallback based on repeated 64-bit hashing is used.- Parameters:
m (int) – Number of hash permutations (length of the sketch).
seed (int) – Non-negative integer seed for all hash permutations.
use_datasketch (bool) – Whether to use
datasketchif available.
- Raises:
ValueError – If
mis not positive orseedis negative.
- build(support: Iterable[int]) List[int]
Build a MinHash signature from an unweighted token stream.
Multiplicities in support are ignored; only distinct tokens contribute to the sketch.
- Parameters:
support (Iterable[int]) – Iterable of integer tokens.
- Returns:
MinHash signature as a list of length
m.- Return type:
list[int]
- static describe() str
Return a brief usage example for
MinHashSketch.- Returns:
Example code snippet.
- Return type:
str
- class synrfp.sketchers.cw_sketch.CWSketch(m: int = 256, seed: int = 0, normalize: bool = True)
Bases:
WeightedSketchConsistent Weighted Sampling (CWS) sketch for weighted Jaccard.
This sketch operates on signed sparse multisets (
pos,neg). It converts them into a signed dense vector viadicts_to_dense(), splits into separate non-negative positive/negative channels, concatenates them, and applies Consistent Weighted Sampling.If
datasketchis available, it delegates todatasketch.WeightedMinHashGenerator. Otherwise it uses a deterministic ICWS-like fallback implementation.- Parameters:
m (int) – Number of CWS samples (length of the sketch).
seed (int) – Random seed for the sampler.
normalize (bool) – If True, dense helpers L1-normalize the signed vector prior to splitting. Scaling does not change the weighted Jaccard but can improve numerical stability.
- Raises:
ValueError – If arguments are invalid.
- build(pos: Mapping[int, int], neg: Mapping[int, int]) ndarray
Build a length-
mCWS hash signature for a signed multiset.Internally this:
Uses
dicts_to_dense()withensure_signed=Trueto obtain a 1D signed dense vector.Splits it into non-negative positive/negative arrays using
WeightedSketch.signed_to_pos_neg_arrays().Concatenates these into a non-negative weight vector.
Applies either
datasketchor a deterministic fallback to drawmCWS samples.
- Parameters:
pos (Mapping[int,int]) – Positive token counts.
neg (Mapping[int,int]) – Negative token counts.
- Returns:
Array of sampled indices (hash values) of length
m.- Return type:
numpy.ndarray
Key classes
Typical responsibilities include:
defining unweighted and weighted sketcher interfaces (
BaseSketch,WeightedSketch),implementing binary parity-fold sketches (
ParityFold),implementing MinHash-based sketches for Jaccard estimation (
MinHashSketch), andimplementing consistent weighted sampling for signed deltas (
CWSketch).
Batch encoding: synrfp.encoder
The synrfp.encoder module provides a convenience wrapper for batch
encoding lists of reaction SMILES into SynRFP fingerprints, with optional
parallelization via joblib.
Key class
Typical responsibilities of SynRFPEncoder include:
turning a list of RSMI strings into a 2D NumPy array of fingerprints,
exposing the same configuration knobs as
synrfp.synrfp()(tokenizer, radius, sketch type, bit length, seed), andhandling multi-process or multi-threaded execution transparently when
n_jobs > 1.