BioChirp Help & Documentation
BioChirp lets you ask plain-language questions about drugs, genes, diseases, pathways and biomarkers and receive back full, structured tables drawn directly from established biomedical repositories. Unlike chat-based tools that may drop rows, ignore alternative spellings or give different answers each time, BioChirp hands the actual data-fetching work to rule-based algorithms while reserving AI only for understanding what you asked. This guide walks you through every aspect of the platform.
What BioChirp does
Pulls relationships between drugs, targets, genes, diseases, pathways and biomarkers out of four well-known reference databases and hands you every matching record as a downloadable table—nothing skipped, nothing summarised away.
Who benefits most
Bench scientists, bioinformaticians, pharmacologists and graduate students who want an exhaustive view of known associations rather than the handful of examples a chatbot might recite from memory. If your downstream pipeline depends on a complete catalogue of associations, BioChirp is built for that use case.
Safety notice
This platform is intended for research use only. It is not certified as a medical device and should never replace expert clinical judgement. Always verify important findings against original sources before acting on them.
Guiding philosophy
AI understands what you want; fixed-logic code fetches the data. Once your question has been translated into database-ready terms, every subsequent step—table selection, joining, filtering, deduplication—runs without any neural-network involvement. Asking the same question twice therefore returns the same rows, even though the plain-English explanation on top may read slightly differently each time.
2. How BioChirp works
The platform is organised into four successive stages. The first two involve AI; the last two do not. That boundary is deliberate and is the main reason outputs stay stable across repeated runs.
A. Gateway
Your question reaches BioChirp over a combined HTTP / WebSocket connection. At this point the system only manages routing and streams live progress indicators back to your browser—nothing about the biology is touched yet.
B. Interpretation (AI-assisted)
Four lightweight AI models from separate vendors read your question simultaneously. Each proposes an interpretation; a fifth “judge” model picks the best one or merges the strongest parts. Running several models in parallel guards against single-vendor quirks such as expanding an abbreviation incorrectly or misclassifying a gene name as a disease.
The winning interpretation is a structured record listing which biomedical fields your question mentions—for instance, drug: (return all) and disease: tuberculosis—together with any scope or filter constraints you specified. Abbreviations are expanded (TB → tuberculosis), and the system decides whether your question falls inside or outside the scope of the connected databases.
C. Data retrieval (fully rule-based)
BioChirp talks to its data sources in two ways. Open Targets is reached through its public GraphQL endpoint at query time, with automatic pagination so no records are left behind. CTD, HCDT and TTD live as pre-processed local files. For these offline sources a schema-graph planner figures out which tables need to be joined by modelling every table as a node and every declared foreign-key link as an edge, then walking the shortest path between the tables your question needs.
Because this planner follows strict graph-traversal rules rather than generating SQL on the fly, the same input always triggers the same table selections, the same join order and the same output columns. No neural network is consulted at any point during this stage.
D. Summary & delivery
An AI model writes a short human-friendly narrative from a snapshot of the result table (by default the first 50 rows; you can change this). The narrative sits above the table for convenience, but the downloadable CSV underneath is the authoritative output. Nothing in this last step can alter, reorder or remove rows from the structured data.
Schema-graph planning in detail
Every offline database (CTD, HCDT, TTD) is stored as a collection of “node tables” (one per entity type: drug, gene, disease, pathway, biomarker) and “edge tables” (one per relationship type, e.g. drug–gene). These tables and their foreign-key links form an undirected graph. When your question involves concepts that span more than one table, the planner runs a shortest-path search on this graph to find the fewest tables that connect everything together, then chains them into a fixed join tree.
Take the question “drugs for TB.” The system maps “drug” and “disease” to their respective node tables, discovers they are linked through a drug–disease edge table, and builds a three-table join tree. Disease rows are filtered to tuberculosis and its known sub-types before the join runs, keeping intermediate result sets small. Only joins that pass a foreign-key validation check are executed; anything that would produce an unconstrained cross-product is rejected outright.
3. Data sources
BioChirp currently draws on four widely used biomedical repositories. One is consulted live over the internet; the other three ship as pre-processed local copies.
Open Targets (live)
A large-scale genetics and genomics resource for drug-target identification. BioChirp queries it in real time through its GraphQL interface, paginating until every matching record has been collected.
For disease queries the system also walks the disease ontology hierarchy, pulling in child terms automatically. A search for a broad condition therefore picks up associations catalogued under specific sub-types as well.
CTD (offline)
The Comparative Toxicogenomics Database catalogues how chemicals interact with genes and how those interactions relate to diseases. BioChirp stores its chemical–gene, chemical–disease, gene–disease, gene–pathway and disease–pathway relationships locally.
HCDT (offline)
The Highly Confident Drug–Target Database focuses on experimentally verified compound–target pairs. It covers drug–gene, drug–disease, drug–pathway and pathway–gene links, each backed by direct experimental evidence.
TTD (offline)
The Therapeutic Target Database provides target druggability profiles, drug–target–disease mappings, mechanism-of-action annotations, regulatory approval statuses and biomarker–disease links.
Entity types you can query
Your question can reference any combination of the following fields. BioChirp will map them onto the right tables automatically.
When your question falls outside these databases
If BioChirp determines that none of its connected repositories can answer your question, it will redirect to a targeted biomedical web search. Results from this fallback route are always clearly marked so you can distinguish them from curated database outputs.
4. Getting started
You can go from a blank chat window to a complete, downloadable results table in under a minute.
- Open the chat from the main navigation bar.
- Type a biomedical question that mentions at least one entity type (drug, gene, disease, pathway or biomarker) together with the relationship you want to explore. Adding a specific constraint—a disease name, an approval status, a mechanism—helps the system narrow down results.
- Press send. Progress indicators will light up as BioChirp moves through interpretation, name matching, data retrieval and summarisation.
- Inspect the output. You will see a short narrative explaining what the system did, followed by a preview of the association table. Expand the collapsible tool-level panels to see exactly what happened at each step.
- Download the CSV. The full, untruncated table is available for download so you can feed it into your own scripts, enrichment tools or visualisation software.
Good starting questions
Which drugs are prescribed for tuberculosis?List the genes linked to chronic myeloid leukemiaWhat diseases does aspirin treat?Show me pathways involving TP53Which compounds act on EGFR?What molecular targets does imatinib bind?
Questions that may not work well
- Purely clinical scenarios with no specific drug, gene or disease name (“How should I manage this patient?”)
- Requests for molecular formats (SMILES, InChI), 3-D structures or sequence data not held in the supported databases
- Clinical trial logistics, regulatory filing questions, or insurance coverage queries
- Anything that does not map onto at least one of the supported fields listed above
5. Writing effective queries
The clearer you are about which entity type you want back and which entity you are constraining on, the sharper your results will be.
Be direct about what you want
Phrases like “list drugs”, “find genes”, “show pathways” or “which diseases” tell the system what the output columns should be.
List drug–target pairs for HER2-positive breast cancerShow pathways that TNF participates inWhich diseases have been linked to BRCA1?
Abbreviations, brand names & aliases are fine
The name-matching engine is designed to handle everyday biomedical shorthand. All of the following resolve correctly:
TB→ tuberculosis (plus subtypes such as pulmonary TB, meningeal TB)CML→ chronic myeloid leukemia / chronic myelogenous leukemiaVazalore→ aspirin (brand-to-generic mapping)ERBB1→ EGFR (gene alias lookup)
Behind the scenes: the multi-model interpreter
Your free-text question is parsed into a structured record with slots for drug_name, target_name, disease_name, pathway_name, biomarker_name and approval_status. Each slot gets one of three labels: “not relevant,” “return all matches” or “constrained to a specific term.”
When a question does not map onto any database field, the system reroutes it to a web-based literature lookup rather than attempting an incomplete database search.
6. Reading BioChirp outputs
Every answer has two layers: a short narrative on top and a structured table below. The table is the definitive result; the narrative is a convenience aid produced by an AI summariser.
Narrative section
- Tells you how BioChirp understood your question, which entities were recognised and which repositories were consulted.
- Reports the total row count and flags any deduplication that was performed.
- Highlights known gaps or caveats rather than glossing over them.
- May include direct links to the originating database pages.
Association table
- Fixed column headers (e.g.
drug_id,drug_name,disease_id,disease_name,indication_name). - A scrollable preview appears in the chat; the full dataset is available via the CSV download button.
- Row counts and wall-clock retrieval times are displayed for transparency.
Step-by-step transparency
Each internal operation (interpretation, name matching, database query) is surfaced as a collapsible panel showing elapsed time and a digest of what went in and came out. If you ever need to debug or audit a result, these panels let you trace the entire journey from raw question to final table.
Data provenance
Every row in the output carries the canonical identifier assigned by its source database. The summariser cannot add, remove or reorder rows; the CSV you download is always the unmodified output of the rule-based pipeline.
7. How name matching works
A drug may be listed under its brand name in one database and its generic name in another. A gene can have half a dozen aliases. A disease might appear under an ICD code, an ontology term or a colloquial name. Getting all of these to resolve to the right database entries is critical for returning a complete result set. BioChirp tackles this with three parallel strategies.
Approximate string matching
Catches typos, punctuation differences, reordered tokens and minor spelling variants. Four different string-similarity measures run in parallel, and any candidate scoring above 90 (on a 0–100 scale) is kept for further vetting.
Embedding-based search
Three biomedical text-embedding models encode your query term and compare it against pre-indexed vocabularies. A data-driven “knee point” in the similarity curve sets the cutoff for each query, avoiding a one-size-fits-all threshold that would be too strict for some terms and too lax for others.
Curated alias & ontology lookup
Draws on established nomenclature services (PubChem, ChEMBL, HGNC, UniProt, Disease Ontology, EBI OLS and others) to pull in known aliases, trade names and ontology children. Gene-family queries are expanded to their individual members so that a family-level term returns all constituent genes in the database.
Dual-judge false-positive filter
Because the three strategies above are tuned for high recall, they inevitably propose some wrong matches (e.g. “nitroaspirin” for an “aspirin” query). To weed these out, two independent AI evaluators from different vendors score every candidate in parallel. Only names approved by both pass through immediately; contested names go to a third tie-breaking model. This voting scheme keeps precision high without sacrificing the broad recall of the upstream strategies.
Open Targets–specific behaviour
When querying Open Targets, entity names are resolved against that platform’s own search index, and the top-scoring hit is selected in a fixed, repeatable way. Disease queries additionally traverse the Experimental Factor Ontology (EFO) hierarchy downward, automatically pulling in associations stored under narrower child terms. A search for “tuberculosis,” for example, also retrieves records filed under pulmonary tuberculosis, meningeal tuberculosis and other sub-types.
8. Consistency & coverage
Returning the same answer to the same question is non-negotiable for any tool that feeds into a research pipeline. Here is what you can expect from BioChirp.
Fully stable across runs
- Schema-graph planning, table selection and join order
- Row-level filtering and deduplication logic
- Open Targets API pagination, sorting and ID selection
- Column names, row content and CSV file layout
May show minor variation
- Wording of the plain-English summary (written by an AI model)
- Occasionally, the embedding-based name matcher may surface a slightly different set of candidates; the dual-judge filter dampens this effect
- Web-fallback content for out-of-scope questions
How BioChirp compares on coverage
In internal benchmarking on 70 diverse questions—each posed five times to nine different systems—BioChirp’s structured back-ends achieved perfect run-to-run agreement (Jaccard index = 1.0), while general-purpose chat models fluctuated considerably. The Open Targets connector alone returned a median of roughly 580 associations per question; comparable chat-model answers typically contained fewer than 10 entries. Curated databases captured almost everything the chat models produced, but the reverse was far from true.
9. Boundaries & known limitations
What BioChirp can do
- Retrieve drug–target, drug–disease, gene–disease, gene–pathway, disease–pathway and biomarker–disease associations across four databases
- Return mechanism-of-action labels and approval statuses (from TTD)
- Expand disease queries to include ontology child terms (via Open Targets)
- Resolve abbreviations, brand names, gene aliases and close misspellings
- Break down gene-family names into their constituent individual genes
What it cannot (yet) do
- Results are only as comprehensive as the underlying repositories; very new indications or experimental compounds may not appear
- Offline snapshots (CTD, HCDT, TTD) are refreshed periodically, not in real time
- Highly ambiguous terms that map plausibly to more than one canonical ID can still trip up the resolver
- Molecular structure formats, sequence data, clinical-trial logistics and regulatory filings are outside the supported schema
A note on conflicting records
- BioChirp reports what each database contains without editorial triage. If two sources disagree, both entries appear in the output.
- Thin evidence or unexpectedly few results are flagged in the narrative rather than hidden.
- For high-stakes decisions, always consult the original data provider or peer-reviewed literature.
10. Troubleshooting
Something not working as expected? The collapsible tool panels in each response are your first diagnostic tool. Below are the most common situations.
“Out of scope” or “invalid query”
- The interpreter could not find any drug, gene, disease, pathway or biomarker mention to act on.
- Add at least one concrete entity and a relationship, e.g.
drugs targeting JAK2 in myelofibrosis. - Narrative-only clinical questions without named entities will be routed to web search instead.
“No matching records”
- The databases genuinely may not catalogue that combination (e.g. a newly approved therapy).
- Try broader wording—a parent disease term instead of a rare sub-type.
- Check the tool panels to see whether the name matcher found the right canonical ID.
Timeouts or service errors
- The Open Targets API or one of the internal micro-services may be temporarily unreachable.
- Wait a couple of minutes and try again. If the problem persists, note the error text and timestamp and file a report.
Results look wrong or surprising
- Expand the tool panels to check the interpreter’s parsed output and the resolver’s chosen canonical IDs.
- Run the same question again—if the table changes, the issue is in name matching; if it stays the same, the data source itself may contain the unexpected entry.
- Verify individual rows against the originating database at its own website.
- Open a GitHub issue with the query, the output and what you expected.
11. Frequently asked questions
How does BioChirp differ from general-purpose chatbots like ChatGPT or Claude?
Chatbots rely on their training corpus to recall biomedical facts, which means they can miss records, confuse alternative names and give different answers on different days. BioChirp flips that arrangement: AI handles only the “what did the user mean?” step, while hard-coded logic fetches every qualifying row from established databases. The structured table is never authored by an AI, so it cannot hallucinate entries or silently leave rows out.
How does BioChirp compare to MCP-style agentic retrieval?
Agentic setups that route tool calls through an AI planner can hit payload limits, run into invocation timeouts or let the model choose a sub-optimal call sequence. In head-to-head tests on exhaustive biomedical queries, none of the tested agent-based configurations managed to return a full result set, whereas BioChirp’s fixed-logic pipeline delivered every record for every query.
Can I use short-hand, brand names or gene nicknames?
Absolutely. The name-matching engine runs three parallel resolution strategies (approximate string comparison, embedding similarity and curated alias lists) followed by a dual-judge AI filter. This means brand names such as “Vazalore” map to aspirin, gene aliases such as “ERBB1” map to EGFR, and common abbreviations such as “CML” expand to both “chronic myeloid leukemia” and “chronic myelogenous leukemia.”
Why does the summary text change slightly between runs while the table stays the same?
The table is assembled by rule-based code and is therefore identical on every run (given the same database snapshot). The human-readable summary, by contrast, is written by an AI model that may phrase things a little differently each time. For any quantitative or downstream use, rely on the CSV, not the narrative.
Which databases are included?
Four: Open Targets (queried live), CTD, HCDT and TTD (stored locally as pre-processed snapshots). If none of these can answer your question the system falls back to a biomedical web search, with results clearly labelled as web-sourced.
What prevents BioChirp from making up associations?
AI participation ends after the interpretation and name-matching stages. Every association in the output table is read directly from a curated database through rule-based joins. If no qualifying rows exist, BioChirp returns “no result” instead of inventing plausible-sounding entries.
How fresh is the data?
Open Targets is queried in real time, so it reflects whatever version the platform currently serves. CTD, HCDT and TTD are held as periodic snapshots that are refreshed when major upstream releases become available. Snapshot dates are noted in the system and any significant changes will be communicated.
Can I download and reuse the output?
Yes. Every in-scope query produces a CSV that you can feed straight into R, Python, Excel or any other tool. Column headers and identifier formats are stable across runs, so automated pipelines can depend on them. If you publish work that uses BioChirp data, please cite the associated paper.
Does BioChirp store my queries or personal information?
No login is required and no personally identifiable data is collected. Queries are processed in real time and are not stored for profiling, advertising or user tracking.
Is the code open source?
Yes. Application code, prompt templates, schema definitions, pre-processing scripts, benchmark setups and analysis notebooks are all available on GitHub at github.com/abhi1238/biochirp.
12. Contact & support
Reasons to get in touch
- Repeated errors or service outages you cannot resolve by retrying.
- Output that clearly contradicts what the original database shows.
- Feature ideas, user-interface feedback or roadmap questions.
- Concerns about data accuracy, naming errors or potential misinterpretation of evidence.
How to reach us
Email: abhishekh@iiitd.ac.in, debarka@iiitd.ac.in
GitHub: Open an issue
When filing a report please include: the exact query, approximate time, a screenshot if possible, and whether the problem reproduces on a second attempt.
Reminder: BioChirp is a research aid, not a clinical decision-support system. For medical emergencies, contact local emergency services immediately.