Tutorials and Examples
This page collects practical, copy-and-paste examples showing how to:
compute and inspect SynRFP fingerprints for single reactions,
explore tokenizer tokens and Δ (delta / support) statistics,
encode reaction lists in batch (CPU-friendly),
persist fingerprints to disk (NumPy / Parquet) for ML,
run a tiny supervised baseline (scikit-learn) using SynRFP bits,
compare sketchers (ParityFold, MinHash, CWSketch) and compute similarities.
If you have not installed SynRFP yet, see Getting Started.
Single reaction — token + fingerprint
This example shows the common, low-level flow: parse reaction SMILES, create graph objects, tokenize, aggregate Δ (net change), and sketch.
from synrfp.graph.reaction import Reaction
from synrfp.tokenizers.wl import WLTokenizer
from synrfp.sketchers.parity_fold import ParityFold
from synrfp.synrfp import SynRFP
# 1) parse RSMI -> GraphData pair (reactant_graph, product_graph)
reactant_G, product_G = Reaction.from_rsmi("CCO>>C=C.O")
# 2) instantiate tokenizer + sketcher
tokenizer = WLTokenizer() # WL-style subtree tokens
sketcher = ParityFold(bits=1024, seed=42) # XOR parity fold to 1024 bits
# 3) build engine (composes tokenizer + sketcher)
engine = SynRFP(tokenizer=tokenizer, radius=1, sketch=sketcher)
# 4) compute fingerprint result (object contains tokens, Δ, and sketch)
result = engine.fingerprint(reactant_G, product_G)
# Print a human-friendly summary
print(result) # e.g., tokens_R=..., tokens_P=..., support=..., sketch_type=...
bits = result.to_binary() # list/array of 0/1 for downstream use
print("bits length:", len(bits), "sample:", bits[:16])
Inspect tokenizer tokens and Δ (deltas)
Often you want to inspect local tokens and the signed delta U (counts) to validate the representation.
# Token extraction (tokenizer API: tokens_graph or equivalent)
# NOTE: API name may vary by version; common pattern: tokenizer.tokens_graph(graph)
toks_R = tokenizer.tokens_graph(reactant_G, radius=1)
toks_P = tokenizer.tokens_graph(product_G, radius=1)
print("Reactant tokens (sample):", list(toks_R)[:20])
print("Product tokens (sample):", list(toks_P)[:20])
# If your SynRFP engine exposes a delta / support view:
# The exact attribute names differ by release; typical pattern:
print("Delta (signed counts):", result.delta) # token -> integer (neg/pos)
print("Total support (unique tokens):", result.support)
# If method names differ, inspect result.__dict__ or repr(result) to find fields.
Batch encoding
Encode many reactions in one call. Use the BatchEncoder convenience helper for throughput and batching.
from synrfp import BatchEncoder
rxn_smiles = [
"CCO>>C=C.O",
"CO.O[C@@H]1CCNC1.[C-]#[N+]CC(=O)OC>>[C-]#[N+]CC(=O)N1CC[C@@H](O)C1",
# ...
]
fps = BatchEncoder.encode(
rxn_smiles,
tokenizer="wl", # accepts string aliases: "wl", "nauty", ...
radius=1,
sketch="parity", # "parity", "minhash", "cw"
bits=1024,
seed=42,
batch_size=64, # tune to your RAM/CPU
n_jobs=4, # if BatchEncoder supports parallel backends
)
# fps is a (N, B) NumPy array or list-of-lists depending on version
import numpy as np
fps = np.asarray(fps)
print("fingerprints shape:", fps.shape)
Persisting fingerprints
Save fingerprints and metadata for reuse (NumPy + Pandas / Parquet recommended).
import numpy as np
import pandas as pd
# fps: (N, B) binary array, meta: list of rxn smiles
np.save("fps.npy", fps) # fast binary dump
np.savez_compressed("fps_compressed.npz", fps=fps)
# for ML: save in Parquet with metadata (index aligned)
meta = pd.DataFrame({"rxn": rxn_smiles, "label": labels}) # if supervised
df_bits = pd.DataFrame(fps.astype(int), index=meta.index)
output = pd.concat([meta, df_bits.add_prefix("b_")], axis=1)
output.to_parquet("fps_table.parquet")
Quick ML baseline: RandomForest classifier
A minimal pipeline: encode, train/test split, train a RandomForest and evaluate.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import matthews_corrcoef, f1_score
# assume fps (N,B) and labels (N,)
X_train, X_test, y_train, y_test = train_test_split(fps, labels, test_size=0.2, random_state=0, stratify=labels)
clf = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("MCC:", matthews_corrcoef(y_test, y_pred))
print("F1 (weighted):", f1_score(y_test, y_pred, average="weighted"))
Comparing sketchers and computing similarity
You may want to compare different sketchers (binary parity vs MinHash vs weighted CWSketch) or compute similarity (Tanimoto/Jaccard for bits).
import numpy as np
def tanimoto(a: np.ndarray, b: np.ndarray) -> float:
# expects binary 0/1 arrays
a = a.astype(bool)
b = b.astype(bool)
inter = np.logical_and(a, b).sum()
union = np.logical_or(a, b).sum()
return float(inter) / union if union > 0 else 0.0
# Example: compute pairwise similarity for first 10 fps
sims = np.zeros((10, 10))
for i in range(10):
for j in range(10):
sims[i, j] = tanimoto(fps[i], fps[j])
print(sims[:3, :3])
# To compare sketchers: create engines with different sketch classes and encode the same reactions;
# then compute pairwise correlations between resulting representations (e.g. cosine on MinHash int vectors).