PubChemCrawler
PubChemCrawler makes it easier to search the PubChem database from Julia. You can use it to access information about particular compounds or query substructures.
The package supports only a subset of the available functionality, but it is fairly straightforward to expand to other types of query. Pull requests are welcome! If you do want to make improvements to this package, this tutorial might help you get started.
Before you start: be aware of search limits
PubChem places significant limits on requests:
- No more than 5 requests per second
- No more than 400 requests per minute
- No longer than 300 second running time per minute
- Requests made via REST time out after 30s. The PUG XML interface does not have this limitation. For substructure searches,
query_substructure_pugis recommended.
Getting started
Some queries make use of the CID, the Compound IDentifier, which you can obtain in a variety of ways. Let's get the CID for aspirin:
julia> cid = get_cid(name="aspirin")
2244
julia> cid = get_cid(smiles="CC(=O)OC1=CC=CC=C1C(=O)O") # use the SMILES string
2244You can then retrieve individual properties:
julia> smiles = chomp(String(get_for_cids(2244, properties="CanonicalSMILES", output="TXT")))
"CC(=O)OC1=CC=CC=C1C(=O)O"or a list of properties:
julia> using CSV, DataFrames
julia> df = CSV.File(get_for_cids(2244; properties="MolecularFormula,MolecularWeight,XLogP,IsomericSMILES", output="CSV")) |> DataFrame
1×5 DataFrame
│ Row │ CID │ MolecularFormula │ MolecularWeight │ XLogP │ IsomericSMILES │
│ │ $Int │ String │ Float64 │ Float64 │ String │
├─────┼───────┼──────────────────┼─────────────────┼─────────┼──────────────────────────┤
│ 1 │ 2244 │ C9H8O4 │ 180.16 │ 1.2 │ CC(=O)OC1=CC=CC=C1C(=O)O │You can query properties for a whole list of cids.
If your query returns multiple cids, you need to use get_cids:
julia> cids = get_cids(cas_number="50-78-2")
4-element Vector{Int64}:
2244
67252
3434975
12280114You can also download structure data and save it to a file. This saves a 3d conformer for aspirin:
julia> open("/tmp/aspirin.sdf", "w") do io
write(io, get_for_cids(2244, output="SDF", record_type="3d"))
end
3637Finally, you can perform substructure searches. Let's retrieve up to 10 bicyclic compounds using a SMARTS search:
julia> cids = query_substructure_pug(smarts = "[\$([*R2]([*R])([*R])([*R]))].[\$([*R2]([*R])([*R])([*R]))]", maxhits = 10)
┌ Warning: maxhits was hit, results are partial
└ @ PubChemCrawler ~/.julia/dev/PubChemCrawler/src/pugxml.jl:164
10-element Vector{$Int}:
135398658
5280795
5430
5143
54675779
5280961
5280804
5280793
5280343
3034034Note that Julia (not this package) requires the SMARTS string characters $ be escaped.
API
Queries
PubChemCrawler.get_cid — Functionget_cid(; name=nothing, smiles=nothing, cas_number=nothing, kwargs...)Return the PubChem compound identification number for the specified compound.
Examples:
julia> cid = get_cid(name="glucose")
5793
julia> cid = get_cid(smiles="C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O")
5793PubChemCrawler.get_cids — Functionget_cids(; name=nothing, smiles=nothing, cas_number=nothing,kwargs...)Return all the PubChem compound identification numbers for the specified compound.
get_cidreturns a single identifier and fails if there are multiple results.get_cidsreturns a vector of identifiers, containing all the identifiers that match
Queries on cas_number often return multiple cids.
Examples:
julia> get_cids(name="2-nonenal")
3-element Vector{Int64}:
5283335
17166
5354833
julia> get_cid(name="2-nonenal")
ERROR: ArgumentError: Collection has multiple elements, must contain exactly 1 element
julia> get_cids(cas_number="50-78-2")
4-element Vector{Int64}:
2244
67252
3434975
12280114
PubChemCrawler.query_substructure_pug — Functioncids = query_substructure_pug(;cid=nothing, smiles=nothing, smarts=nothing, # specifier for the substructure to search for
maxhits=200_000, poll_interval=10)Retrieve a list of compounds containing a substructure specified via its cid, the SMILES string, or a SMARTS string.
Example
julia> using PubChemCrawler
julia> cids = query_substructure_pug(smarts="[r13]Br") # query brominated 13-atom rings
66-element Vector{Int64}:
54533707
153064026
152829033
...PUG searches can take a while to run (they poll for completion), but conversely they allow more complex, long-running searches to succeed. See also query_substructure.
PubChemCrawler.query_substructure — Functionmsg = query_substructure(;cid=nothing, smiles=nothing, smarts=nothing, # specifier for the substructure to search for
properties="MolecularFormula,MolecularWeight,XLogP,", # properties to retrieve
output="CSV") # output formatPerform a substructure search of the entire database. You can specify the target via its cid, the SMILES string, or a SMARTS string. Specify the properties you want to retrieve as a comma-separated list from among the choices in http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest, "Compound Property Tables". Requesting more properties takes more time.
The output is a Vector{UInt8}. For output="CSV", a good choice to generate a manipulable result is DataFrame(CSV.File(msg)) from the DataFrames and CSV packages, respectively. Alternatively String(msg) will convert it to a string, which you can write to a file.
Example
julia> using PubChemCrawler, CSV, DataFrames
julia> cid = get_cid(name="estriol")
5756
julia> df = CSV.File(query_substructure(;cid)) |> DataFrame # on Julia 1.0, use `(;cid=cid)`
11607×4 DataFrame
│ Row │ CID │ MolecularFormula │ MolecularWeight │ XLogP │
│ │ Int64 │ String │ Float64 │ Float64? │
├──────┼───────────┼──────────────────┼─────────────────┼──────────┤
│ 1 │ 5756 │ C18H24O3 │ 288.4 │ 2.5 │
│ 2 │ 5281904 │ C24H32O9 │ 464.5 │ 1.1 │
│ 3 │ 27125 │ C18H24O4 │ 304.4 │ 1.5 │
...will query for derivatives of estriol.
For complex queries that risk timing out, consider query_substructure_pug in combination with get_for_cids.
PubChemCrawler.get_for_cids — Functionmsg = get_for_cids(cids; properties|xrefs|cids_type|record_type, output="CSV")Retrieve the given properties, xrefs, CIDs, or records, respectively, for a list of compounds specified by their cids. The documentation for these traits can be found at http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest; this URL will be referred to as PUGREST below.
propertiesinclude structural features like the molecular formula, number of undefined stereocenters, and so on. Specify these as a comma-separated list from among the choices in PUGREST under "Compound Property Tables".xrefs("cross-references") include identifiers used by other databases, e.g., the CAS (Registry) number, PubMedID, and so on. The supported values forxrefsare available at PUGREST under "XRefs".cids_typeis used to retrieve CIDs for compounds related to those specified incids; see PUGREST under "SIDS / CIDS / AIDS".record_typeis used to retrieve data files and to specify options for these files, e.g., 2d or 3d SDF files. See PUGREST under "Full-record Retrieval".
output specifies the output format. Not all options are applicable to all queries; for example, "CSV" is appropriate for properties queries but "SDF" might be used for a record_type query. See PUGREST, "Output".
Examples
julia> using PubChemCrawler, CSV, DataFrames, JSON3
julia> cids = [get_cid(name="cyclic guanosine monophosphate"), get_cid(name="aspirin")]
2-element Array{Int64,1}:
135398570
2244
julia> CSV.File(get_for_cids(cids; properties="MolecularFormula,XLogP", output="CSV")) |> DataFrame
2×3 DataFrame
Row │ CID MolecularFormula XLogP
│ Int64 String Float64
─────┼──────────────────────────────────────
1 │ 135398570 C10H12N5O7P -3.4
2 │ 2244 C9H8O4 1.2
julia> open("/tmp/aspirin_3d.sdf", "w") do io # save the 3d SDF file for aspirin (CID 2244)
write(io, get_for_cids(2244; record_type="3d", output="SDF"))
end
4055
julia> dct = JSON3.read(get_for_cids(cids; xrefs="RN,", output="JSON")); # get the Registry Number(s) (CAS)
julia> dct[:InformationList][:Information]
2-element JSON3.Array{JSON3.Object,Array{UInt8,1},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}:
{
"CID": 135398570,
"RN": [
"40732-48-7",
"7665-99-8"
]
}
{
"CID": 2244,
"RN": [
"11126-35-5",
"156865-15-5",
"50-78-2",
"52080-78-1",
"921943-73-9",
"98201-60-6",
"99512-66-0"
]
}PubChemCrawler.pug — Functionpug(args...; silent = true, escape_args = true, return_text = true, status_exception = false, kwargs...)
Generate a PUG endpoint and call it. The details about PUG endpoints are described here: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
Keyword arguments:
escape_args = true, URL encodes each argument before generating the endpoint.
Setting this false is useful when copy-pasting an existing PUG endpoint, e.g. from documentation.
silent = falseprint the pug URL called.return_text = true, callStringon the output to return a string rather than a byte vector.status_exception = false, tell HTTP.jl to not throw an exception on return codes >= 300.
Other keyword arguments are passed on to HTTP.request.
Examples:
julia> pug(:compound, :name, "ethanol", :cids, :txt, silent = false, return_text = true)
[ Info: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/ethanol/cids/txt
"702"
julia> pug("compound/cid/2244", :cids, :txt, escape_args = false, silent = false, return_text = true)
[ Info: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/cids/txt
"2244"
julia> pug(:compound, :smiles, "C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O", :cids, :txt, return_text = true)
"5793"
julia> pug(:compound, :cid, 708, :txt, return_text = true, status_exception = false)
"Status: 400
Code: PUGREST.BadRequest
Message: Invalid output format
Detail: Full-record output format must be one of ASNT/B, XML, JSON(P), SDF, or PNG"PubChemCrawler.get_synonyms — Functionsynonyms = get_synonyms(name="glucose")
synonyms = get_synonyms(smiles="C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O")
synonyms = get_synonyms(cid=5793)Return a list of substance or compound synonyms.
Utilities
PubChemCrawler.parse_formula — Functionatomcounts = parse_formula(str::AbstractString)Parse str as a chemical formula, return a list of atom=>multiplicity pairs.
Example
julia> parse_formula("C2CaH2O6")
4-element Vector{Pair{String, Int64}}:
"C" => 2
"Ca" => 1
"H" => 2
"O" => 6PubChemCrawler.atomregex — Functionrex = atomregex(chemicalsymbol)Create a regular expression for detecting how many atoms of type chemicalsymbol are in a molecular formula.
Examples
The formula for calcium bicarbonate is Ca(HCO3)2, i.e., C2CaH2O6.
julia> match(atomregex("C"), "C2CaH2O6")
RegexMatch("C2", 1="2")
julia> match(atomregex("Ca"), "C2CaH2O6")
RegexMatch("Ca", 1="")
julia> match(atomregex("H"), "C2CaH2O6")
RegexMatch("H2", 1="2")
julia> match(atomregex("O"), "C2CaH2O6")
RegexMatch("O6", 1="6")Note that the regex for "C" does not match "Ca".