PubChemCrawler
PubChemCrawler makes it easier to search the PubChem database from Julia. You can use it to access information about particular compounds or query substructures.
The package supports only a subset of the available functionality, but it is fairly straightforward to expand to other types of query. Pull requests are welcome! If you do want to make improvements to this package, this tutorial might help you get started.
Before you start: be aware of search limits
PubChem places significant limits on requests:
- No more than 5 requests per second
- No more than 400 requests per minute
- No longer than 300 second running time per minute
- Requests made via REST time out after 30s. The PUG XML interface does not have this limitation. For substructure searches,
query_substructure_pug
is recommended.
Getting started
Some queries make use of the CID, the Compound IDentifier, which you can obtain in a variety of ways. Let's get the CID for aspirin:
julia> cid = get_cid(name="aspirin")
2244
julia> cid = get_cid(smiles="CC(=O)OC1=CC=CC=C1C(=O)O") # use the SMILES string
2244
You can then retrieve individual properties:
julia> smiles = chomp(String(get_for_cids(2244, properties="CanonicalSMILES", output="TXT")))
"CC(=O)OC1=CC=CC=C1C(=O)O"
or a list of properties:
julia> using CSV, DataFrames
julia> df = CSV.File(get_for_cids(2244; properties="MolecularFormula,MolecularWeight,XLogP,IsomericSMILES", output="CSV")) |> DataFrame
1×5 DataFrame
│ Row │ CID │ MolecularFormula │ MolecularWeight │ XLogP │ IsomericSMILES │
│ │ $Int │ String │ Float64 │ Float64 │ String │
├─────┼───────┼──────────────────┼─────────────────┼─────────┼──────────────────────────┤
│ 1 │ 2244 │ C9H8O4 │ 180.16 │ 1.2 │ CC(=O)OC1=CC=CC=C1C(=O)O │
You can query properties for a whole list of cids
.
You can also download structure data and save it to a file. This saves a 3d conformer for aspirin:
julia> open("/tmp/aspirin.sdf", "w") do io
write(io, get_for_cids(2244, output="SDF", record_type="3d"))
end
3637
Finally, you can perform substructure searches. Let's retrieve up to 10 bicyclic compounds using a SMARTS search:
julia> cids = query_substructure_pug(smarts = "[\$([*R2]([*R])([*R])([*R]))].[\$([*R2]([*R])([*R])([*R]))]", maxhits = 10)
┌ Warning: maxhits was hit, results are partial
└ @ PubChemCrawler ~/.julia/dev/PubChemCrawler/src/pugxml.jl:164
10-element Vector{$Int}:
135398658
5280795
5430
5143
54675779
5280961
5280804
5280793
5280343
3034034
Note that Julia (not this package) requires the SMARTS string characters $
be escaped.
API
Queries
PubChemCrawler.get_cid
— Functioncid = get_cid(name="glucose")
cid = get_cid(smiles="C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O")
Return the PubChem compound identification number for the specified compound.
PubChemCrawler.query_substructure_pug
— Functioncids = query_substructure_pug(;cid=nothing, smiles=nothing, smarts=nothing, # specifier for the substructure to search for
maxhits=200_000, poll_interval=10)
Retrieve a list of compounds containing a substructure specified via its cid
, the SMILES string, or a SMARTS string.
Example
julia> using PubChemCrawler
julia> cids = query_substructure_pug(smarts="[r13]Br") # query brominated 13-atom rings
66-element Vector{Int64}:
54533707
153064026
152829033
...
PUG searches can take a while to run (they poll for completion), but conversely they allow more complex, long-running searches to succeed. See also query_substructure
.
PubChemCrawler.query_substructure
— Functionmsg = query_substructure(;cid=nothing, smiles=nothing, smarts=nothing, # specifier for the substructure to search for
properties="MolecularFormula,MolecularWeight,XLogP,", # properties to retrieve
output="CSV") # output format
Perform a substructure search of the entire database. You can specify the target via its cid
, the SMILES string, or a SMARTS string. Specify the properties
you want to retrieve as a comma-separated list from among the choices in http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest, "Compound Property Tables". Requesting more properties takes more time.
The output is a Vector{UInt8}
. For output="CSV"
, a good choice to generate a manipulable result is DataFrame(CSV.File(msg))
from the DataFrames and CSV packages, respectively. Alternatively String(msg)
will convert it to a string, which you can write to a file.
Example
julia> using PubChemCrawler, CSV, DataFrames
julia> cid = get_cid(name="estriol")
5756
julia> df = CSV.File(query_substructure(;cid)) |> DataFrame # on Julia 1.0, use `(;cid=cid)`
11607×4 DataFrame
│ Row │ CID │ MolecularFormula │ MolecularWeight │ XLogP │
│ │ Int64 │ String │ Float64 │ Float64? │
├──────┼───────────┼──────────────────┼─────────────────┼──────────┤
│ 1 │ 5756 │ C18H24O3 │ 288.4 │ 2.5 │
│ 2 │ 5281904 │ C24H32O9 │ 464.5 │ 1.1 │
│ 3 │ 27125 │ C18H24O4 │ 304.4 │ 1.5 │
...
will query for derivatives of estriol.
For complex queries that risk timing out, consider query_substructure_pug
in combination with get_for_cids
.
PubChemCrawler.get_for_cids
— Functionmsg = get_for_cids(cids; properties|xrefs|cids_type|record_type, output="CSV")
Retrieve the given properties
, xrefs
, CIDs, or records, respectively, for a list of compounds specified by their cids
. The documentation for these traits can be found at http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest; this URL will be referred to as PUGREST below.
properties
include structural features like the molecular formula, number of undefined stereocenters, and so on. Specify these as a comma-separated list from among the choices in PUGREST under "Compound Property Tables".xrefs
("cross-references") include identifiers used by other databases, e.g., the CAS (Registry) number, PubMedID, and so on. The supported values forxrefs
are available at PUGREST under "XRefs".cids_type
is used to retrieve CIDs for compounds related to those specified incids
; see PUGREST under "SIDS / CIDS / AIDS".record_type
is used to retrieve data files and to specify options for these files, e.g., 2d or 3d SDF files. See PUGREST under "Full-record Retrieval".
output
specifies the output format. Not all options are applicable to all queries; for example, "CSV" is appropriate for properties
queries but "SDF" might be used for a record_type
query. See PUGREST, "Output".
Examples
julia> using PubChemCrawler, CSV, DataFrames, JSON3
julia> cids = [get_cid(name="cyclic guanosine monophosphate"), get_cid(name="aspirin")]
2-element Array{Int64,1}:
135398570
2244
julia> CSV.File(get_for_cids(cids; properties="MolecularFormula,XLogP", output="CSV")) |> DataFrame
2×3 DataFrame
Row │ CID MolecularFormula XLogP
│ Int64 String Float64
─────┼──────────────────────────────────────
1 │ 135398570 C10H12N5O7P -3.4
2 │ 2244 C9H8O4 1.2
julia> open("/tmp/aspirin_3d.sdf", "w") do io # save the 3d SDF file for aspirin (CID 2244)
write(io, get_for_cids(2244; record_type="3d", output="SDF"))
end
4055
julia> dct = JSON3.read(get_for_cids(cids; xrefs="RN,", output="JSON")); # get the Registry Number(s) (CAS)
julia> dct[:InformationList][:Information]
2-element JSON3.Array{JSON3.Object,Array{UInt8,1},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}:
{
"CID": 135398570,
"RN": [
"40732-48-7",
"7665-99-8"
]
}
{
"CID": 2244,
"RN": [
"11126-35-5",
"156865-15-5",
"50-78-2",
"52080-78-1",
"921943-73-9",
"98201-60-6",
"99512-66-0"
]
}
Utilities
PubChemCrawler.parse_formula
— Functionatomcounts = parse_formula(str::AbstractString)
Parse str
as a chemical formula, return a list of atom=>multiplicity
pairs.
Example
julia> parse_formula("C2CaH2O6")
4-element Vector{Pair{String, Int64}}:
"C" => 2
"Ca" => 1
"H" => 2
"O" => 6
PubChemCrawler.atomregex
— Functionrex = atomregex(chemicalsymbol)
Create a regular expression for detecting how many atoms of type chemicalsymbol
are in a molecular formula.
Examples
The formula for calcium bicarbonate is Ca(HCO3)2, i.e., C2CaH2O6.
julia> match(atomregex("C"), "C2CaH2O6")
RegexMatch("C2", 1="2")
julia> match(atomregex("Ca"), "C2CaH2O6")
RegexMatch("Ca", 1="")
julia> match(atomregex("H"), "C2CaH2O6")
RegexMatch("H2", 1="2")
julia> match(atomregex("O"), "C2CaH2O6")
RegexMatch("O6", 1="6")
Note that the regex for "C"
does not match "Ca"
.