Agents API Reference¶
cs_copilot.agents
¶
Cs_copilot Agents Package
This package provides a comprehensive system for creating and managing AI agents specialized in cheminformatics tasks.
Public API:¶
Agent Creation (Recommended): create_agent(agent_type, model, **kwargs) - Create agents by type list_available_agent_types() - List all available agent types
Team Coordination
get_cs_copilot_agent_team(model, **kwargs) - Multi-agent team with intelligent coordination
Utilities
get_last_agent_reply(agent) - Extract last message from agent
Available Agent Types (5-Agent Architecture):¶
Core Agents: - "chembl_downloader" - Download and process bioactivity data from ChEMBL database - "gtm_agent" - Unified GTM operations (build, load, density, activity, project) with smart caching - "chemoinformatician" - Comprehensive chemoinformatics (chemotype, clustering, SAR, similarity, QSAR) - "report_generator" - Universal presentation layer for all analysis types - "molecular_designer" - Small-molecule design via autoencoder and LLM engines - "peptide_designer" - Peptide design via WAE and LLM engines plus latent-space GTM workflows
Testing/Evaluation: - "robustness_evaluation" - Analyze robustness test results and metrics
Agent Capabilities Breakdown:¶
Chemoinformatician (Most Versatile): - Chemotype/Scaffold Analysis: Extract and analyze molecular frameworks - Clustering: Group molecules by structural similarity (k-means, hierarchical, DBSCAN) - SAR Analysis: Structure-Activity Relationships, activity cliffs, matched molecular pairs - Similarity/Diversity: Molecular similarity, diversity metrics, nearest neighbors - QSAR Modeling: Extensible framework for predictive modeling (tools to be added)
AgentConfig
dataclass
¶
Configuration for creating an agent.
Source code in src/cs_copilot/agents/factories.py
validate()
¶
Validate the agent configuration.
Source code in src/cs_copilot/agents/factories.py
AgentCreationError
¶
BaseAgentFactory
¶
Bases: ABC
Base class for creating agents with common configuration and error handling.
Source code in src/cs_copilot/agents/factories.py
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 | |
get_agent_config()
abstractmethod
¶
create_agent(model, markdown=True, debug_mode=False, enable_mlflow_tracking=True, **kwargs)
¶
Create an agent with error handling and validation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Model
|
Model to use for the agent |
required |
markdown
|
bool
|
Whether to enable markdown formatting |
True
|
debug_mode
|
bool
|
Whether to enable debug mode |
False
|
enable_mlflow_tracking
|
bool
|
Whether to enable MLflow tracking for this agent |
True
|
**kwargs
|
Additional keyword arguments for agent creation |
{}
|
Returns:
| Type | Description |
|---|---|
Agent
|
Created agent instance |
Source code in src/cs_copilot/agents/factories.py
create_agent(agent_type, model, **kwargs)
¶
Create an agent by type using the global registry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent_type
|
str
|
The type of agent to create |
required |
model
|
Model
|
The language model to use |
required |
**kwargs
|
Additional arguments passed to the agent factory |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
Agent |
Agent
|
The created agent instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If agent_type is not registered |
AgentCreationError
|
If agent creation fails |
Source code in src/cs_copilot/agents/registry.py
get_registry()
¶
list_available_agent_types()
¶
get_cs_copilot_agent_team(model, *, markdown=True, debug_mode=False, show_members_responses=True, enable_memory=True, db_file=None, enable_mlflow_tracking=True)
¶
Create a coordinated team of cs_copilot agents using Agno.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Model
|
Agno Model instance used for team coordination and member agents |
required |
markdown
|
bool
|
Format output in markdown |
True
|
debug_mode
|
bool
|
Enable debug logs |
False
|
show_members_responses
|
bool
|
Print member responses during coordination |
True
|
enable_memory
|
bool
|
Enable persistent session history (default: True). Cross-session user/agentic memories stay disabled to prevent state leakage. |
True
|
db_file
|
str
|
Custom database file path. If not provided, uses CS_COPILOT_MEMORY_DB. Use unique paths for session isolation in testing. |
None
|
enable_mlflow_tracking
|
bool
|
Enable MLflow tracking for agents (default: True). Set to False to disable tracking. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
Team |
Team
|
Configured Cs_copilot team |
Raises:
| Type | Description |
|---|---|
AgentCreationError
|
If one or more agents fail to initialize |
Source code in src/cs_copilot/agents/teams.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | |
get_last_agent_reply(agent)
¶
Extract the content of the last message from an agent's session.
config
¶
Configuration module for cs_copilot agents. Contains path constants and database configuration settings. Agent instructions and prompts are now in prompts.py.
factories
¶
Agent factory classes for creating specialized cs_copilot agents. Contains the base factory class and all specialized factory implementations.
AgentConfig
dataclass
¶
Configuration for creating an agent.
Source code in src/cs_copilot/agents/factories.py
validate()
¶
Validate the agent configuration.
Source code in src/cs_copilot/agents/factories.py
AgentCreationError
¶
BaseAgentFactory
¶
Bases: ABC
Base class for creating agents with common configuration and error handling.
Source code in src/cs_copilot/agents/factories.py
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 | |
get_agent_config()
abstractmethod
¶
create_agent(model, markdown=True, debug_mode=False, enable_mlflow_tracking=True, **kwargs)
¶
Create an agent with error handling and validation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Model
|
Model to use for the agent |
required |
markdown
|
bool
|
Whether to enable markdown formatting |
True
|
debug_mode
|
bool
|
Whether to enable debug mode |
False
|
enable_mlflow_tracking
|
bool
|
Whether to enable MLflow tracking for this agent |
True
|
**kwargs
|
Additional keyword arguments for agent creation |
{}
|
Returns:
| Type | Description |
|---|---|
Agent
|
Created agent instance |
Source code in src/cs_copilot/agents/factories.py
ChEMBLDownloaderFactory
¶
Bases: BaseAgentFactory
Factory for creating ChemBL downloader agents.
Source code in src/cs_copilot/agents/factories.py
ChemoinformaticianFactory
¶
Bases: BaseAgentFactory
Factory for creating comprehensive chemoinformatics analysis agents.
This agent is a versatile chemoinformatician capable of: - Chemotype Analysis: Scaffold extraction, chemotype profiling, structural diversity - Clustering: Molecular clustering using various methods (k-means, hierarchical, DBSCAN) - SAR Analysis: Structure-Activity Relationship analysis, activity cliffs, matched molecular pairs - Similarity Analysis: Molecular similarity, diversity metrics, nearest neighbor searches
GTM-Integrated Design: - Primary use case: Downstream analysis after GTM agents (nodes as clusters) - Also works with ANY data source: t-SNE clusters, user CSVs, ChEMBL families - Standardized input: DataFrame with 'smiles' + optional 'cluster_id' + optional 'activity' - Produces structured data output (DataFrames, dicts) - NO report generation - Report generation handled by separate ReportGeneratorAgent
Tools: - ChemicalSimilarityToolkit: Fingerprints, similarity metrics, scaffold extraction - PointerPandasTools: DataFrame operations with S3 support - GTMToolkit: Access to GTM data (source_mols, node projections)
Source code in src/cs_copilot/agents/factories.py
356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 | |
MolecularDesignerFactory
¶
Bases: BaseAgentFactory
Factory for creating small-molecule design agents.
Supports two modes: - Engine-driven design: Use autoencoder or LLM engines behind a common facade - Standalone autoencoder: Encode/decode SMILES, sample latent space, interpolate, explore neighborhoods - GTM-guided: Combine GTM maps with generative engines for targeted molecular design from specific map regions (by density, activity, or coordinates)
Enhanced with GTM cache awareness to avoid redundant GTM loading when working with GTM Agent in the same session.
Source code in src/cs_copilot/agents/factories.py
GTMAgentFactory
¶
Bases: BaseAgentFactory
Factory for creating unified GTM agents (consolidates optimization, loading, density, activity, projection).
This factory creates a single agent that handles all GTM-related operations via mode-based dispatch: - optimize: Build and optimize new GTM maps - load: Load existing GTM models from S3/local/HuggingFace - density: Analyze compound distributions and neighborhood preservation - activity: Create activity-density landscapes for SAR analysis - project: Project external datasets onto existing GTM maps
Features smart caching to avoid redundant GTM loading across operations.
Source code in src/cs_copilot/agents/factories.py
ReportGeneratorFactory
¶
Bases: BaseAgentFactory
Factory for creating report generation agents.
This agent handles ALL report generation and visualization across different analysis types: - Chemotype analysis reports - GTM density reports - GTM activity/SAR reports - Molecular designer generation reports - Combined/custom reports
Separation of Concerns: Analysis agents produce structured data, Report Generator handles presentation.
This architecture enables: - Consistent formatting across all report types - Reusable visualization patterns - Easy updates to report styles (change in one place) - Clean separation: data processing vs visualization/formatting
Source code in src/cs_copilot/agents/factories.py
RobustnessEvaluationFactory
¶
Bases: BaseAgentFactory
Factory for creating robustness test evaluation agents.
Source code in src/cs_copilot/agents/factories.py
SynPlannerFactory
¶
Bases: BaseAgentFactory
Factory for creating retrosynthetic planning agents powered by SynPlanner.
This agent wraps the official SynPlanner package to perform retrosynthetic analysis on target molecules. It accepts SMILES strings or molecule names, resolves them to canonical SMILES (via PubChem / RDKit), runs the MCTS-based retrosynthesis search, and returns structured route descriptions with optional SVG/PNG visualizations.
Source code in src/cs_copilot/agents/factories.py
PeptideDesignerFactory
¶
Bases: BaseAgentFactory
Factory for creating peptide design agents.
This agent exposes a Peptide Designer facade over multiple peptide design engines. The default WAE engine encodes, decodes, samples, and interpolates amino acid sequences; the LLM engine proposes sequence candidates from natural-language objectives. The WAE model can generate any peptides; activity landscape data comes from DBAASP (antimicrobial peptides specifically).
Key capabilities: - Encoding: Convert peptide sequences to 100-dimensional latent vectors - Decoding: Generate peptide sequences from latent vectors - Sampling: Generate novel peptides from Gaussian prior - Interpolation: Smooth transitions between peptides in latent space - Neighborhood exploration: Generate peptide analogs - GTM integration: Train GTMs on latent space, create activity landscapes - Activity landscapes: Use DBAASP data (specific to antimicrobial peptides)
Input format: Space-separated single-letter amino acid codes Example: "M L L L L L A L A L L A L L L A L L L"
Source code in src/cs_copilot/agents/factories.py
726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 | |
prompts
¶
Prompt templates and instructions for cs_copilot agents. Contains all the step-by-step instructions used by various specialized agents.
CHEMBL_INSTRUCTIONS = ["Step 1: Analyze the user's request and identify the biological target or compound type they want to explore.", ' - Distinguish whether the user is asking about a *protein target* (e.g., CDK2, BRAF) or an *organism-level target* (e.g., HIV-1, Influenza A).', " - Record the target_type as either 'protein' or 'organism' for downstream filtering.", " - If an organism is specified (e.g., 'HIV', 'E. coli'), keep that exact string for filtering assays by target_organism.", "Step 2: Extract the core target name from the user's request, removing generic terms like 'inhibitor', 'activity', 'compound', 'effect'. For example:", " - 'cyclin dependent kinase 2 inhibitors' → core target: 'cyclin dependent kinase 2'", " - 'BRAF inhibitors' → core target: 'BRAF'", ' - Focus on identifying the specific biological target or protein name for protein-level queries; for organism-level queries, preserve the organism name.', 'Step 3: Apply the following required checks before proceeding. Each requirement MUST be satisfied by explicit user confirmation. If ANY requirement fails, DO NOT proceed — return control to the Team agent listing ALL unsatisfied requirements.', '', ' **Requirement 1 — Target Specificity & Abbreviation Confirmation (mandatory)**', ' Before asking anything else, verify the target the user named passes BOTH sub-checks below. Both must pass before you proceed to the other requirements.', '', ' **Sub-check 1a — Specificity Floor.** The target must be either:', " (a) a full canonical protein name — e.g., 'epidermal growth factor receptor', 'phosphodiesterase 4A', 'peroxisome proliferator-activated receptor gamma', 'serotonin receptor 2A', 'cyclin-dependent kinase 2'; OR", " (b) a recognized gene symbol or protein abbreviation — e.g., 'CDK2', 'EGFR', 'JAK2', 'BRAF', 'PDE4', 'DPP4', 'PPARG', '5-HT2A', 'mTOR', 'PTP1B', 'CYP3A4'.", ' A target is NOT specific enough if it is a **generic family word plus an index or descriptor** that does not uniquely identify a protein. REJECT these:', " - 'kinase 2' (could be CDK2, JAK2, MAP2K2/MEK2, CHK2, PKC2, STK2, …)", " - 'kinase 3', 'kinase alpha', 'kinase II'", " - 'receptor 5', 'receptor alpha', 'receptor 2'", " - 'protein 2', 'protein kinase'", " - 'phosphatase 1', 'phosphodiesterase' (bare family)", " - bare family names: 'kinase', 'receptor', 'phosphatase', 'GPCR', 'nuclear receptor', 'ion channel', 'transporter'", " **Test to apply**: strip generic suffixes like 'inhibitor(s)', 'activity', 'compound(s)', 'data', 'ligand(s)', 'modulator(s)'. What remains must be either a recognized gene abbreviation (a token like 'EGFR' or 'egfr') or a full phrase containing a specific protein name. A bare family word with only a digit or Greek letter appended FAILS the test.", ' If the target fails sub-check 1a, you MUST refuse to search and ask the user for a canonical gene/protein name. Example clarifications:', " User: 'Fetch kinase 2 inhibitor data'", ' You: \'The query "kinase 2" is too generic — it could mean CDK2 (cyclin-dependent kinase 2), JAK2 (Janus kinase 2), MAP2K2/MEK2, CHK2, or others. Please specify a gene symbol (e.g., CDK2, JAK2, MEK2) or a full canonical protein name.\'', " User: 'Download receptor 5 ligands'", ' You: \'The query "receptor 5" is too generic — it could refer to many different receptor families (5-HT1F, 5-HT5A, TAS2R5, GPR5, OR5, …). Please specify a gene symbol or a full canonical receptor name.\'', '', " **Sub-check 1b — Abbreviation Confirmation.** If the target name provided by the user is ONLY an abbreviation or acronym (e.g., 'CDK2', 'PDE4', 'EGFR', 'BRAF', 'HIV1', 'JAK2', 'DPP4'), you MUST ask the user to confirm or provide the full target name.", " - Example: 'CDK2' → Ask: 'CDK2 stands for cyclin dependent kinase 2 — is that the target you mean?'", " - Example: 'PDE4' → Ask: 'PDE4 can refer to phosphodiesterase 4A/4B/4C/4D — which isoform(s) do you need?'", " - **Anti-bypass rule**: Even if the user says 'just get me CDK2 data' or 'you know what CDK2 is', you MUST still ask for confirmation. No shortcut is allowed.", '', " **Order of operations**: sub-check 1a runs FIRST. A recognized gene symbol like 'BRAF' passes 1a and then triggers 1b (you still confirm the full name 'B-Raf proto-oncogene'). A term like 'kinase 2' fails 1a — ask for a canonical name before applying 1b.", '', ' **Requirement 2 — Organism Check (mandatory for protein targets)**', ' If the query is about a *protein target* and no organism has been explicitly specified, you MUST ask which organism to filter for.', ' - NEVER default to Homo sapiens or any other organism.', " - Example: 'CDK2 inhibitors' → Ask: 'Which organism? (e.g., Homo sapiens, Mus musculus, or all species)'", " - This requirement does NOT apply to organism-level queries (e.g., 'HIV-1 compounds') where the organism IS the target.", '', ' **Requirement 3 — Assay Type Check (mandatory)**', ' If the user has not explicitly stated the assay type(s) (binding, functional, ADMET), you MUST ask which assay type(s) to include.', " - NEVER default to any combination (e.g., do NOT silently assume 'binding + functional').", " - Example: 'EGFR data' → Ask: 'Which assay types? Binding (IC50/Ki), functional, ADMET, or a combination?'", '', ' **Requirement 4 — Mechanism of Action Check (mandatory to ASK, optional to APPLY)**', " You MUST ask the user whether they want to filter assays by a mechanism of action (e.g., 'agonist', 'antagonist', 'inverse agonist', 'allosteric modulator', 'ATP-competitive inhibitor', 'covalent inhibitor', 'partial agonist').", ' - Example question: \'Do you want to filter assays to a specific mechanism of action (agonist, antagonist, modulator, ATP-competitive inhibitor, allosteric modulator, …)? Answer with a specific mechanism, or say "unspecified" / "no preference" / "any" to keep all mechanisms.\'', " - **Unspecified is a VALID answer**: if the user explicitly says 'unspecified', 'no preference', 'any', 'I don't care', 'all', or similar, you MUST call `fetch_compounds` with `mechanism=None` (omit the filter entirely). DO NOT invent, guess, or default to a mechanism.", " - **Anti-bypass rule**: the question is mandatory. You MUST NOT skip it even if the user's initial prompt contains words like 'inhibitor' — 'inhibitor' is a generic term, not a mechanism. Only an explicit mechanism keyword (agonist / antagonist / modulator / inverse / allosteric / ATP-competitive / covalent / partial …) counts as a specified mechanism.", ' - Examples:', " • User: 'EGFR data' → Ask: 'Any specific mechanism (ATP-competitive, covalent, allosteric) or unspecified?'", " • User: 'PPARG compounds' → Ask: 'Any specific mechanism (agonist, partial agonist, antagonist, modulator) or unspecified?'", " • User: '5-HT2A ligands' → Ask: 'Any specific mechanism (agonist, antagonist, inverse agonist, partial agonist) or unspecified?'", " - When the user specifies a mechanism, pass it verbatim to `fetch_compounds(mechanism=…)`. When the user answers 'unspecified' / 'any' / 'no preference', call `fetch_compounds` WITHOUT passing the `mechanism` argument.", '', " **Additional notes**: Requirement 1 above already covers broad or generic terms and family-word + index fragments. If the user nevertheless insists on a vague target after clarification ('just give me any kinase data'), politely re-explain and re-ask for a canonical name.", '', ' **Multi-requirement failure examples:**', " - 'kinase 2 inhibitors' → Requirement 1 (sub-check 1a) fails: 'kinase 2' is a generic family word plus an index, not a unique target. Ask for a canonical gene/protein name BEFORE asking the other requirements.", " - 'BRAF inhibitors' → Requirements 1, 2, 3, 4 fail: abbreviation not confirmed (sub-check 1b), no organism, no assay type, no mechanism answer. Ask all four in one message.", " - 'EGFR data for human' → Requirements 1, 3, and 4 fail: abbreviation not confirmed, no assay type, no mechanism answer.", " - 'Download binding data for phosphodiesterase 4A' → Requirements 2 and 4 fail: no organism specified, no mechanism answer.", " - 'Get me JAK2 binding data for Homo sapiens' → Requirements 1 and 4 fail: abbreviation not confirmed, no mechanism answer.", " - 'Fetch human PPARG binding agonist data, full name peroxisome proliferator-activated receptor gamma' → ALL requirements satisfied: Req 1 passes (canonical name + full name), Req 2 Homo sapiens, Req 3 binding, Req 4 agonist. Proceed.", " - 'Fetch human EGFR binding data, full name epidermal growth factor receptor, any mechanism' → ALL requirements satisfied: Req 4 answered with 'any' → call fetch_compounds with mechanism=None.", '', ' **Procedure when requirements fail:**', ' - Combine ALL unsatisfied requirements into a SINGLE clarification message.', " - Return control to the Team agent with: 'The query needs clarification: [list all unsatisfied requirements]. Returning to Team agent for user input.'", " - Once the user provides clarification, pass the details to fetch_compounds using the appropriate parameters: 'query' for target name, 'organism' for species filter, 'assay_types' for data type, and 'mechanism' for mechanism of action. If the user explicitly said 'unspecified' / 'any' / 'no preference' for mechanism, pass `mechanism=None` (or omit the parameter entirely).", ' - It is ALWAYS better to ask for precision than to fetch incorrect or irrelevant data.', 'Step 4: Use the `convert_to_chembl_query` tool with the identified core target to generate multiple SEMANTIC keyword variations (abbreviations, synonyms, greek-letter replacements) for ChEMBL search.', ' - The tool will generate 2-4 semantic keywords per target (abbreviations and full names).', " - Punctuation/spacing variants ('phosphodiesterase 4A' vs 'phosphodiesterase-4A' vs 'phosphodiesterase4A') are matched AUTOMATICALLY by `fetch_compounds` via regex — you do NOT need to include them in the keyword list.", " - The same automatic regex matching guarantees 'epidermal growth factor receptor' and 'epidermal-growth factor receptor' are searched identically, so you never need to worry about hyphen vs space spellings.", " - Example: For 'phosphodiesterase 4A', the tool will return: 'pde4a, phosphodiesterase 4A' (fetch_compounds matches all hyphen/space variants via regex internally).", ' - When the query is organism-level, include the organism name as one of the keywords to ensure assays for that organism are retrieved.', " - Determine assay type preferences: map 'binding' → B, 'functional' → F, 'ADMET' → A. The user MUST have explicitly specified assay type(s) before reaching this step (enforced by the mandatory requirements above). NEVER apply a default.", "Step 5: Use the `fetch_compounds` tool with the semantic keywords from Step 4 (comma-separated, e.g., 'pde4a, phosphodiesterase 4A') to download bioactivity data from ChEMBL. The tool will:", " - Pass the organism filter when the query is organism-level so assays are constrained to that species/strain (e.g., organism='HIV-1').", " - Pass the assay_types filter (e.g., ['binding', 'functional', 'ADMET']) to control whether you retrieve binding, functional, or ADMET assays.", " - Pass the `mechanism` filter ONLY if the user explicitly specified a mechanism of action (e.g., mechanism='allosteric modulator' for a PDE4 query, mechanism='antagonist' for a dopamine D2 query, mechanism='ATP-competitive inhibitor' for a BRAF query). If the user answered 'unspecified', 'no preference', 'any', or similar, pass `mechanism=None` (or omit the argument) — do NOT fabricate a filter. The mechanism filter applies a case-insensitive substring match against each assay description.", ' - Automatically match all hyphen/space punctuation variants via regex (one query per keyword, transparent to you).', " - Search for assays matching each keyword's regex pattern", ' - Retrieve activity data for all found assays', ' - Merge all results and automatically remove duplicates', 'Step 6: After successful data fetch, verify the dataset quality:', ' - Check that SMILES structures were successfully mapped', ' - Verify the dataset contains expected columns (activity_id, molecule_chembl_id, canonical_smiles, standard_value, etc.)', ' - Confirm the data covers the intended biological target', ' - Confirm the assay_type column contains the requested assay categories (B=Binding, F=Functional, A=ADMET)', ' - Note the number of duplicates that were removed during merging', 'Step 7: Use the `describe_dataset` tool to generate comprehensive statistics for the downloaded dataset.', 'Step 8: Report key metrics to the user:', ' - Total number of compounds and activities', ' - Range of activity values (IC50, Ki, etc.)', ' - Data quality indicators (missing values, duplicates)', ' - Target coverage and assay diversity', 'Step 9: If data fetch fails, troubleshoot systematically:', ' - Check if the query terms are too specific (try broader terms)', ' - Verify ChEMBL connectivity using ping functionality (works for all SQL and REST backends)', ' - Consider alternative search strategies (different resource types: activity, molecule, assay)', ' - Handle rate limiting by implementing appropriate delays', 'Step 10: When working with dataframes, use inplace operations to modify dataframes (e.g., `df.drop(..., inplace=True)`) to avoid printing entire dataframes to the console, which can cause context window issues. Avoid operations like `df.assign()` that return new dataframes and may be printed.', 'Step 11: `fetch_compounds` produces raw_dataset_path for provenance and clean_dataset_path for all downstream work; raw_dataset_path retains retrieval provenance, and filtered_dataset_path is present when ChEMBL rows are removed before standardization.', ' - The clean CSV is one row per final standardized achiral compound and contains merged IDs plus final processed activity values.', ' - Descriptors are written separately to descriptor_parquet_path, and that Parquet includes the final activity values.', "Step 12: Use session_state['data_file_paths']['clean_dataset_path'] for downstream agents. `dataset_path` is a backward-compatible alias for the clean dataset, not the raw dataset.", 'Step 13: Confirm raw dataset, clean dataset, filtered rows dataset when present, descriptor Parquet, and standardization report paths are saved.', 'Step 14: Provide the user with all artifact paths and summarize ChEMBL retrieval filtering, invalid rows, duplicates after each step, raw-to-final SMILES collapses, and activity merge policy.'] + HANDLING_NEW_FILES_INSTRUCTIONS
module-attribute
¶
Expert chemoinformatician capable of: - Chemotype/scaffold analysis - Clustering and chemical space mapping - SAR analysis - Similarity and diversity analysis - QSAR modeling (extensible)
Method-agnostic, modular, and extensible design.
GTM_AGENT_INSTRUCTIONS = ['**SESSION MAP SELECTION** (CRITICAL — read session_state BEFORE choosing a mode):', " - Inspect `session_state['map_type']`. Two values are possible:", " * `'default_map'` — the user pinned the pretrained HuggingFace Default Map in the Chainlit settings (default descriptor: `'autoencoder'`).", " * `'new_map'` (or missing) — the user wants to train / reuse a session-local map (default descriptor: `'morgan'`, current behaviour).", " - When `map_type == 'default_map'`:", ' * Do NOT run **OPTIMIZE mode** unless the user explicitly asks to build / train / optimize a new map. If they do, warn them first that this overrides the Default Map selection for the remainder of the session and confirm before proceeding.', " * For LOAD / DENSITY / ACTIVITY / PROJECT modes, prefer the GTM already stored in the current session. If no session GTM exists yet, seed the session from the Default Map by using `descriptor_type='autoencoder'` and, when needed, `use_default=True`:", ' - first load: `load_gtm_model_only(use_default=True)`', ' - once loaded: reuse the session GTM for `load_and_prep_data`, `load_gtm_get_density_matrix`, `create_activity_landscapes`, and `project_data_on_gtm`', ' * Do not train or re-optimize a GTM unless the user explicitly overrides the Default Map selection.', " - When `map_type == 'new_map'` (or missing): keep the historical behaviour described below (build or reuse a session-trained map using Morgan fingerprints by default).", 'Step 1: Determine the operation mode based on user request and context:', " - **optimize mode**: User asks to 'build', 'create', 'optimize', or 'train' a GTM map", " - **load mode**: User asks to 'load', 'retrieve', or 'use existing' GTM model", " - **density mode**: User asks about 'density', 'distribution', 'neighborhood preservation', or 'analyze GTM map'", " - **activity mode**: User asks about 'activity landscape', 'SAR', 'potency zones', or 'active regions'", " - **project mode**: User asks to 'project', 'map new data', or 'apply GTM to external dataset'", " - If unclear, default to load mode and check for cached GTM in session_state['gtm_cache']", 'Step 2: Check for cached GTM before loading from files:', " - If session_state['gtm_cache'] exists and is not None:", " - Verify cache validity: check metadata['dataset_shape'] matches current dataset if applicable", ' - If valid, reuse cached GTM model and dataset (skip loading)', ' - If invalid (dataset changed), proceed to load/optimize as needed', ' - If no cache exists, proceed with mode-specific loading', 'Step 3: Execute mode-specific workflow:', '', '**OPTIMIZE MODE**:', " 1. Load chemical data from session_state['data_file_paths']['clean_dataset_path'] or user-provided path; use ['dataset_path'] only as a legacy clean-data alias", ' 2. Verify SMILES column exists using available tools', ' 3. Determine dataset size (number of rows after cleaning)', ' 4. **Choose optimization strategy**:', " **ALWAYS use strategy='low' unless the user has explicitly requested medium or high effort.**", ' Available levels (present to the user when asking or reporting):', ' * **Low** — fast heuristic grid search (9 combinations). Default for ALL datasets.', ' * **Medium** — extended grid search (~108 combinations). Balanced speed and coverage.', ' * **High** — thorough Bayesian optimization with 50 trials. Best quality but slowest.', ' - For datasets with **>5 000 molecules**, ALWAYS use **low** and inform the user that medium/high are available if they want to upgrade later.', ' - For smaller datasets, STILL use **low** by default — only switch to medium/high if the user explicitly asks.', " - If the user already specified 'medium', 'thorough', 'full', 'best', or 'high', use the corresponding level.", ' - NEVER default to medium or high on your own. The default is ALWAYS low.', " 5. Pass the chosen strategy to gtm_optimization(strategy='low' | 'medium' | 'high')", ' 6. Save with save_gtm_and_data, evaluate smoothness', ' 7. **Report strategy and results clearly**:', ' - State which strategy was used and how many combinations/trials were evaluated', ' - Report the best entropy score', " - If 'low' was used, inform the user: 'The GTM was optimized with a quick heuristic search. You can re-optimize with medium or high effort for potentially better results.'", ' 8. **Cache the result**:', " - session_state['gtm_cache'] = {", " 'model': gtm_model_object,", " 'dataset': preprocessed_dataframe,", " 'metadata': {", " 'path': gtm_file_path,", " 'created_at': timestamp,", " 'dataset_shape': df.shape,", " 'source': 'optimize',", " 'optimization_strategy': strategy,", " 'optimization_metrics': {...}", ' }', ' }', " 9. Update session_state['gtm_file_paths'] = {'gtm_path': ..., 'dataset_path': ..., 'gtm_plot_path': ...}", ' 10. Generate and save the density + projected-points GTM plot using save_gtm_plot', '', '**LOAD MODE**:', ' 1. Resolve GTM model path (priority order):', ' - User-provided explicit path', " - session_state['gtm_file_paths']['gtm_path']", ' - S3 assets bucket (via path resolver)', ' - Default model repository', ' - HuggingFace mirror (last resort)', ' 2. Load GTM using load_gtm_model_only(gtm_file)', ' 3. Determine associated dataset:', ' - If user provides dataset path → use it', ' - If dataset file next to GTM → use it', " - If session_state['data_file_paths']['clean_dataset_path'] exists → use it", " - Else if session_state['data_file_paths']['dataset_path'] exists → use it as the legacy clean-data alias", ' - Otherwise, ask user which dataset to use', ' 4. When dataset available, call load_and_prep_data(dataset, gtm_model) to build projections', " 5. **Cache the result** (same structure as optimize mode, source='load')", " 6. Update session_state['gtm_file_paths']", '', '**DENSITY MODE**:', " 1. **Check cache first**: If session_state['gtm_cache'] exists, reuse it (skip loading)", ' 2. If no cache, load GTM and dataset via load mode workflow above', ' 3. Call load_gtm_get_density_matrix(dataset_file, gtm_file) to get density and neighborhood tables', " 4. Analyze density table ['x', 'y', 'nodes', 'filtered_density']:", ' - Calculate max/min/mean/median density', ' - Identify top 5 densest nodes and top 5 sparsest nodes', ' - Describe spatial patterns (compass/quadrant terms)', " 5. Analyze neighborhood preservation table ['x', 'y', 'nodes', 'density', 'neighborhood score']:", ' - Report preservation quality metrics', ' - Identify well-preserved vs poorly-preserved regions', ' 6. Save density results:', " - session_state['analysis_results']['density_csv'] = density_csv_path", " - session_state['analysis_results']['plots'].append(density_plot_path)", ' 7. Generate the density + projected-points visualization using save_gtm_plot', ' 8. Provide 3-bullet executive summary', '', '**ACTIVITY MODE**:', " 1. **Check cache first**: If session_state['gtm_cache'] exists, reuse it", ' 2. If no cache, load GTM and dataset via load mode workflow', ' 2a. User datasets do not need ChEMBL column names: activity landscapes infer raw potency columns with detectable units, p-scale potency columns, and active/inactive labels.', f' 3. Emit BOTH renderers so the report has the discrete Altair heatmap AND the smooth Plotly surface. First call create_activity_landscapes(dataset, gtm_model, node_threshold={DEFAULT_NODE_THRESHOLD}, chart_width={DEFAULT_CHART_WIDTH}, chart_height={DEFAULT_CHART_HEIGHT}, renderer='altair') for the Altair landscape (static PNG + interactive HTML).', f' 3a. Then call create_activity_landscapes(dataset, gtm_model, node_threshold={DEFAULT_NODE_THRESHOLD}, chart_width={DEFAULT_CHART_WIDTH}, chart_height={DEFAULT_CHART_HEIGHT}, renderer='plotly') for the smooth Plotly landscape (interactive HTML; PNG is best-effort and may be skipped if the Plotly image backend is unavailable).', ' 4. Each call returns a file path and creates CSV + PNG/HTML files', " 4a. When re-rendering a saved activity landscape CSV, ALSO emit both renderers: call save_gtm_landscape_plot(csv, landscape_type, renderer='altair') and save_gtm_landscape_plot(csv, landscape_type, renderer='plotly') so the report has both the discrete Altair heatmap and the smooth Plotly surface.", ' 4b. If a projected analog/new-compound CSV exists, pass overlay_dataset_file=<projection_csv> and gtm_model_file=<gtm_model> to save_gtm_landscape_plot for regression and classification landscapes so the red analog datapoints appear on both activity maps.', ' 5. Save paths to session_state:', " - session_state['landscape_files']['landscape_data_csv'] = csv_path", " - session_state['landscape_files']['landscape_plot_altair'] = altair_plot_path", " - session_state['landscape_files']['landscape_plot_plotly'] = plotly_plot_path", " - session_state['landscape_files']['landscape_plot'] = altair_plot_path # back-compat alias", " - session_state['analysis_results']['activity_csv'] = csv_path # Also save here for consistency", " 6. Load landscape CSV and analyze ['x', 'y', 'nodes', 'filtered_reg_density']:", ' - Global stats: max, min, mean, median of reg_density', ' - Identify top 5 active nodes and top 5 inactive nodes', " - Evidence rule: never call compounds or nodes 'top active', 'most potent', or assign pIC50/pChEMBL ranks unless the claim is backed by loaded activity values from the landscape/dataframe/tool output.", ' - Density is not activity: dense nodes, scaffold-rich nodes, and sampled molecules from dense nodes are structural observations only unless an activity column was loaded and cited.', " - Describe spatial trends (compass directions, e.g., 'dense band across center')", ' 7. Cross-layer analysis:', ' - Do density hotspots coincide with potent areas?', ' - Flag anomalies (dense but low-quality, sparse but high-activity)', ' - Identify gaps/unreliable regions (zero density, NaNs)', ' 8. Provide 3-bullet SAR takeaway with actionable recommendations', ' 9. Show BOTH activity landscape plots in output: the Altair PNG via markdown image format  (blue gradient: dark=high activity, light=low), and the Plotly HTML via single-backtick path only (e.g. `s3://bucket/.../landscape_plotly_regression.html`) — never wrap HTML paths in markdown link syntax.', '', '**PROJECT MODE**:', " 1. **Check cache first**: If session_state['gtm_cache'] exists, reuse GTM model", ' 2. If no cache, load GTM via load mode workflow', ' 3. Get external dataset path from user or session_state. If the user refers to generated compounds/analogs/top candidates, call `materialize_candidate_set_dataset` first and use its `csv_path`; if it returns not_found, ask for a cset_* ID, candidate artifact path, or CSV path instead of regenerating molecules.', ' 4. Call project_data_on_gtm(external_dataset, gtm_model):', ' - Tool validates SMILES, checks compatibility', ' - Returns preprocessed CSV with GTM projections', ' 5. Analyze projection results:', ' - Compare distribution of external data vs original training data', ' - Identify covered vs novel regions', ' - Calculate distribution statistics', ' 6. Generate comparative density visualization using save_gtm_plot(preprocessed_csv, gtm_model); projected compounds render as larger red datapoints.', ' 6a. If regression or classification landscape CSVs are available, render both with save_gtm_landscape_plot(..., overlay_dataset_file=preprocessed_csv, gtm_model_file=gtm_model) so the same projected compounds appear on activity landscapes.', ' 7. Save projection results:', " - session_state['analysis_results']['projection_csv'] = projection_csv_path", " - session_state['analysis_results']['plots'].append(projection_plot_path)", ' 8. Provide summary of projection quality and coverage', 'Step 4: Final output formatting:', ' - Return concise summary of operation performed', ' - Include key metrics and file paths', ' - For plots (PNG), show using markdown image format: ', ' - For HTML artifacts (interactive plots, landscapes, maps), show the path in single backticks only, e.g. `s3://bucket/.../map.html`. NEVER wrap HTML paths in markdown link syntax like `[View Interactive Map](path)` — the browser treats such hrefs as relative URLs and clicking them reloads the Chainlit page.', ' - Highlight any warnings or anomalies discovered', ' - Confirm session_state updates for downstream agents', 'Step 5: Error handling:', ' - If GTM loading fails, check path resolver and suggest alternatives', ' - If dataset incompatible, explain mismatch (e.g., wrong SMILES column)', ' - If cache invalid, automatically reload from files', ' - For optimization failures, suggest trying different k_hit values', 'Step 6: Latent-space GTM operations (for Peptide Designer latent vectors):', ' - The GTM can also operate on pre-computed latent vectors from WAE models (not just SMILES descriptors)', " - When user mentions 'peptide GTM', 'latent space GTM', or 'WAE GTM', delegate to the Peptide Designer agent", ' - The Peptide Designer agent has GTM tools and handles the full peptide+GTM workflow', ' - For SMILES-based GTM: use standard descriptor workflow (this agent)', ' - For peptide latent-space GTM: route to Peptide Designer agent'] + HANDLING_NEW_FILES_INSTRUCTIONS
module-attribute
¶
Universal presentation layer for all analysis types. Generates rich reports and visualizations from structured analysis results.
registry
¶
Agent registry system for managing and creating agents dynamically. Provides the main public API for agent creation.
AgentRegistry
¶
Registry for managing agent factories and configurations.
Source code in src/cs_copilot/agents/registry.py
register(agent_type, factory, aliases=None)
¶
Register an agent factory with optional aliases.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent_type
|
str
|
Canonical agent type name |
required |
factory
|
BaseAgentFactory
|
Factory instance |
required |
aliases
|
List[str]
|
Optional list of alias names that redirect to this agent |
None
|
Source code in src/cs_copilot/agents/registry.py
create_agent(agent_type, model, **kwargs)
¶
Create an agent by type (supports aliases).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent_type
|
str
|
Agent type or alias |
required |
model
|
Model
|
LLM model instance |
required |
**kwargs
|
Additional arguments for agent creation |
{}
|
Returns:
| Type | Description |
|---|---|
Agent
|
Agent instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If agent_type/alias is not registered |
Source code in src/cs_copilot/agents/registry.py
list_agent_types()
¶
auto_register()
¶
Automatically discover and register all available factories.
Source code in src/cs_copilot/agents/registry.py
create_agent(agent_type, model, **kwargs)
¶
Create an agent by type using the global registry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent_type
|
str
|
The type of agent to create |
required |
model
|
Model
|
The language model to use |
required |
**kwargs
|
Additional arguments passed to the agent factory |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
Agent |
Agent
|
The created agent instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If agent_type is not registered |
AgentCreationError
|
If agent creation fails |
Source code in src/cs_copilot/agents/registry.py
list_available_agent_types()
¶
teams
¶
Team coordination functionality for multi-agent workflows.
get_cs_copilot_agent_team(model, *, markdown=True, debug_mode=False, show_members_responses=True, enable_memory=True, db_file=None, enable_mlflow_tracking=True)
¶
Create a coordinated team of cs_copilot agents using Agno.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Model
|
Agno Model instance used for team coordination and member agents |
required |
markdown
|
bool
|
Format output in markdown |
True
|
debug_mode
|
bool
|
Enable debug logs |
False
|
show_members_responses
|
bool
|
Print member responses during coordination |
True
|
enable_memory
|
bool
|
Enable persistent session history (default: True). Cross-session user/agentic memories stay disabled to prevent state leakage. |
True
|
db_file
|
str
|
Custom database file path. If not provided, uses CS_COPILOT_MEMORY_DB. Use unique paths for session isolation in testing. |
None
|
enable_mlflow_tracking
|
bool
|
Enable MLflow tracking for agents (default: True). Set to False to disable tracking. |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
Team |
Team
|
Configured Cs_copilot team |
Raises:
| Type | Description |
|---|---|
AgentCreationError
|
If one or more agents fail to initialize |
Source code in src/cs_copilot/agents/teams.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | |