.. _tutorials-and-examples: Tutorials and Examples ====================== This page collects practical, copy-and-paste examples showing how to: - compute and inspect **SynRFP** fingerprints for single reactions, - explore tokenizer tokens and Δ (delta / support) statistics, - encode reaction lists in batch (CPU-friendly), - persist fingerprints to disk (NumPy / Parquet) for ML, - run a tiny supervised baseline (scikit-learn) using SynRFP bits, - compare sketchers (ParityFold, MinHash, CWSketch) and compute similarities. If you have not installed **SynRFP** yet, see :doc:`Getting Started `. Single reaction — token + fingerprint ------------------------------------ This example shows the common, low-level flow: parse reaction SMILES, create graph objects, tokenize, aggregate Δ (net change), and sketch. .. code-block:: python from synrfp.graph.reaction import Reaction from synrfp.tokenizers.wl import WLTokenizer from synrfp.sketchers.parity_fold import ParityFold from synrfp.synrfp import SynRFP # 1) parse RSMI -> GraphData pair (reactant_graph, product_graph) reactant_G, product_G = Reaction.from_rsmi("CCO>>C=C.O") # 2) instantiate tokenizer + sketcher tokenizer = WLTokenizer() # WL-style subtree tokens sketcher = ParityFold(bits=1024, seed=42) # XOR parity fold to 1024 bits # 3) build engine (composes tokenizer + sketcher) engine = SynRFP(tokenizer=tokenizer, radius=1, sketch=sketcher) # 4) compute fingerprint result (object contains tokens, Δ, and sketch) result = engine.fingerprint(reactant_G, product_G) # Print a human-friendly summary print(result) # e.g., tokens_R=..., tokens_P=..., support=..., sketch_type=... bits = result.to_binary() # list/array of 0/1 for downstream use print("bits length:", len(bits), "sample:", bits[:16]) Inspect tokenizer tokens and Δ (deltas) -------------------------------------- Often you want to inspect local tokens and the signed delta U (counts) to validate the representation. .. code-block:: python # Token extraction (tokenizer API: tokens_graph or equivalent) # NOTE: API name may vary by version; common pattern: tokenizer.tokens_graph(graph) toks_R = tokenizer.tokens_graph(reactant_G, radius=1) toks_P = tokenizer.tokens_graph(product_G, radius=1) print("Reactant tokens (sample):", list(toks_R)[:20]) print("Product tokens (sample):", list(toks_P)[:20]) # If your SynRFP engine exposes a delta / support view: # The exact attribute names differ by release; typical pattern: print("Delta (signed counts):", result.delta) # token -> integer (neg/pos) print("Total support (unique tokens):", result.support) # If method names differ, inspect result.__dict__ or repr(result) to find fields. Batch encoding -------------- Encode many reactions in one call. Use the BatchEncoder convenience helper for throughput and batching. .. code-block:: python from synrfp import BatchEncoder rxn_smiles = [ "CCO>>C=C.O", "CO.O[C@@H]1CCNC1.[C-]#[N+]CC(=O)OC>>[C-]#[N+]CC(=O)N1CC[C@@H](O)C1", # ... ] fps = BatchEncoder.encode( rxn_smiles, tokenizer="wl", # accepts string aliases: "wl", "nauty", ... radius=1, sketch="parity", # "parity", "minhash", "cw" bits=1024, seed=42, batch_size=64, # tune to your RAM/CPU n_jobs=4, # if BatchEncoder supports parallel backends ) # fps is a (N, B) NumPy array or list-of-lists depending on version import numpy as np fps = np.asarray(fps) print("fingerprints shape:", fps.shape) Persisting fingerprints ----------------------- Save fingerprints and metadata for reuse (NumPy + Pandas / Parquet recommended). .. code-block:: python import numpy as np import pandas as pd # fps: (N, B) binary array, meta: list of rxn smiles np.save("fps.npy", fps) # fast binary dump np.savez_compressed("fps_compressed.npz", fps=fps) # for ML: save in Parquet with metadata (index aligned) meta = pd.DataFrame({"rxn": rxn_smiles, "label": labels}) # if supervised df_bits = pd.DataFrame(fps.astype(int), index=meta.index) output = pd.concat([meta, df_bits.add_prefix("b_")], axis=1) output.to_parquet("fps_table.parquet") Quick ML baseline: RandomForest classifier ----------------------------------------- A minimal pipeline: encode, train/test split, train a RandomForest and evaluate. .. code-block:: python import numpy as np import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import matthews_corrcoef, f1_score # assume fps (N,B) and labels (N,) X_train, X_test, y_train, y_test = train_test_split(fps, labels, test_size=0.2, random_state=0, stratify=labels) clf = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=0) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print("MCC:", matthews_corrcoef(y_test, y_pred)) print("F1 (weighted):", f1_score(y_test, y_pred, average="weighted")) Comparing sketchers and computing similarity ------------------------------------------- You may want to compare different sketchers (binary parity vs MinHash vs weighted CWSketch) or compute similarity (Tanimoto/Jaccard for bits). .. code-block:: python import numpy as np def tanimoto(a: np.ndarray, b: np.ndarray) -> float: # expects binary 0/1 arrays a = a.astype(bool) b = b.astype(bool) inter = np.logical_and(a, b).sum() union = np.logical_or(a, b).sum() return float(inter) / union if union > 0 else 0.0 # Example: compute pairwise similarity for first 10 fps sims = np.zeros((10, 10)) for i in range(10): for j in range(10): sims[i, j] = tanimoto(fps[i], fps[j]) print(sims[:3, :3]) # To compare sketchers: create engines with different sketch classes and encode the same reactions; # then compute pairwise correlations between resulting representations (e.g. cosine on MinHash int vectors).