PubChemCrawler

PubChemCrawler makes it easier to search the PubChem database from Julia. You can use it to access information about particular compounds or query substructures.

The package supports only a subset of the available functionality, but it is fairly straightforward to expand to other types of query. Pull requests are welcome! If you do want to make improvements to this package, this tutorial might help you get started.

Before you start: be aware of search limits

PubChem places significant limits on requests:

No more than 5 requests per second
No more than 400 requests per minute
No longer than 300 second running time per minute
Requests made via REST time out after 30s. The PUG XML interface does not have this limitation. For substructure searches, query_substructure_pug is recommended.

Getting started

Some queries make use of the CID, the Compound IDentifier, which you can obtain in a variety of ways. Let's get the CID for aspirin:

julia> cid = get_cid(name="aspirin")
2244

julia> cid = get_cid(smiles="CC(=O)OC1=CC=CC=C1C(=O)O")   # use the SMILES string
2244

You can then retrieve individual properties:

julia> smiles = chomp(String(get_for_cids(2244, properties="CanonicalSMILES", output="TXT")))
"CC(=O)OC1=CC=CC=C1C(=O)O"

or a list of properties:

julia> using CSV, DataFrames

julia> df = CSV.File(get_for_cids(2244; properties="MolecularFormula,MolecularWeight,XLogP,IsomericSMILES", output="CSV")) |> DataFrame
1×5 DataFrame
│ Row │ CID   │ MolecularFormula │ MolecularWeight │ XLogP   │ IsomericSMILES           │
│     │ $Int │ String           │ Float64         │ Float64 │ String                   │
├─────┼───────┼──────────────────┼─────────────────┼─────────┼──────────────────────────┤
│ 1   │ 2244  │ C9H8O4           │ 180.16          │ 1.2     │ CC(=O)OC1=CC=CC=C1C(=O)O │

You can query properties for a whole list of cids.

If your query returns multiple cids, you need to use get_cids:

julia> cids = get_cids(cas_number="50-78-2")
4-element Vector{Int64}:
     2244
    67252
  3434975
 12280114

You can also download structure data and save it to a file. This saves a 3d conformer for aspirin:

julia> open("/tmp/aspirin.sdf", "w") do io
           write(io, get_for_cids(2244, output="SDF", record_type="3d"))
       end
3637

Finally, you can perform substructure searches. Let's retrieve up to 10 bicyclic compounds using a SMARTS search:

julia> cids = query_substructure_pug(smarts = "[\$([*R2]([*R])([*R])([*R]))].[\$([*R2]([*R])([*R])([*R]))]", maxhits = 10)
┌ Warning: maxhits was hit, results are partial
└ @ PubChemCrawler ~/.julia/dev/PubChemCrawler/src/pugxml.jl:164
10-element Vector{$Int}:
 135398658
   5280795
      5430
      5143
  54675779
   5280961
   5280804
   5280793
   5280343
   3034034

Note that Julia (not this package) requires the SMARTS string characters $ be escaped.

API

Queries

PubChemCrawler.get_cid — Function

get_cid(; name=nothing, smiles=nothing, cas_number=nothing, kwargs...)

Return the PubChem compound identification number for the specified compound.

Examples:

julia> cid = get_cid(name="glucose")
5793

julia> cid = get_cid(smiles="C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O")
5793

source

PubChemCrawler.get_cids — Function

get_cids(; name=nothing, smiles=nothing, cas_number=nothing,kwargs...)

Return all the PubChem compound identification numbers for the specified compound.

get_cid returns a single identifier and fails if there are multiple results.
get_cids returns a vector of identifiers, containing all the identifiers that match

Queries on cas_number often return multiple cids.

Examples:

julia> get_cids(name="2-nonenal")
3-element Vector{Int64}:
 5283335
   17166
 5354833

julia> get_cid(name="2-nonenal")
ERROR: ArgumentError: Collection has multiple elements, must contain exactly 1 element

julia> get_cids(cas_number="50-78-2")
4-element Vector{Int64}:
     2244
    67252
  3434975
 12280114

source

PubChemCrawler.query_substructure_pug — Function

cids = query_substructure_pug(;cid=nothing, smiles=nothing, smarts=nothing,        # specifier for the substructure to search for
                               maxhits=200_000, poll_interval=10)

Retrieve a list of compounds containing a substructure specified via its cid, the SMILES string, or a SMARTS string.

Example

julia> using PubChemCrawler

julia> cids = query_substructure_pug(smarts="[r13]Br")   # query brominated 13-atom rings
66-element Vector{Int64}:
  54533707
 153064026
 152829033
...

PUG searches can take a while to run (they poll for completion), but conversely they allow more complex, long-running searches to succeed. See also query_substructure.

source

PubChemCrawler.query_substructure — Function

msg = query_substructure(;cid=nothing, smiles=nothing, smarts=nothing,           # specifier for the substructure to search for
                          properties="MolecularFormula,MolecularWeight,XLogP,",  # properties to retrieve
                          output="CSV")                                          # output format

Perform a substructure search of the entire database. You can specify the target via its cid, the SMILES string, or a SMARTS string. Specify the properties you want to retrieve as a comma-separated list from among the choices in http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest, "Compound Property Tables". Requesting more properties takes more time.

The output is a Vector{UInt8}. For output="CSV", a good choice to generate a manipulable result is DataFrame(CSV.File(msg)) from the DataFrames and CSV packages, respectively. Alternatively String(msg) will convert it to a string, which you can write to a file.

Example

julia> using PubChemCrawler, CSV, DataFrames

julia> cid = get_cid(name="estriol")
5756

julia> df = CSV.File(query_substructure(;cid)) |> DataFrame      # on Julia 1.0, use `(;cid=cid)`
11607×4 DataFrame
│ Row  │ CID       │ MolecularFormula │ MolecularWeight │ XLogP    │
│      │ Int64     │ String           │ Float64         │ Float64? │
├──────┼───────────┼──────────────────┼─────────────────┼──────────┤
│ 1    │ 5756      │ C18H24O3         │ 288.4           │ 2.5      │
│ 2    │ 5281904   │ C24H32O9         │ 464.5           │ 1.1      │
│ 3    │ 27125     │ C18H24O4         │ 304.4           │ 1.5      │
...

will query for derivatives of estriol.

Info

For complex queries that risk timing out, consider query_substructure_pug in combination with get_for_cids.

source

PubChemCrawler.get_for_cids — Function

msg = get_for_cids(cids; properties|xrefs|cids_type|record_type, output="CSV")

Retrieve the given properties, xrefs, CIDs, or records, respectively, for a list of compounds specified by their cids. The documentation for these traits can be found at http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest; this URL will be referred to as PUGREST below.

properties include structural features like the molecular formula, number of undefined stereocenters, and so on. Specify these as a comma-separated list from among the choices in PUGREST under "Compound Property Tables".
xrefs ("cross-references") include identifiers used by other databases, e.g., the CAS (Registry) number, PubMedID, and so on. The supported values for xrefs are available at PUGREST under "XRefs".
cids_type is used to retrieve CIDs for compounds related to those specified in cids; see PUGREST under "SIDS / CIDS / AIDS".
record_type is used to retrieve data files and to specify options for these files, e.g., 2d or 3d SDF files. See PUGREST under "Full-record Retrieval".

output specifies the output format. Not all options are applicable to all queries; for example, "CSV" is appropriate for properties queries but "SDF" might be used for a record_type query. See PUGREST, "Output".

Examples

julia> using PubChemCrawler, CSV, DataFrames, JSON3

julia> cids = [get_cid(name="cyclic guanosine monophosphate"), get_cid(name="aspirin")]
2-element Array{Int64,1}:
 135398570
      2244

julia> CSV.File(get_for_cids(cids; properties="MolecularFormula,XLogP", output="CSV")) |> DataFrame
2×3 DataFrame
 Row │ CID        MolecularFormula  XLogP
     │ Int64      String            Float64
─────┼──────────────────────────────────────
   1 │ 135398570  C10H12N5O7P          -3.4
   2 │      2244  C9H8O4                1.2

julia> open("/tmp/aspirin_3d.sdf", "w") do io    # save the 3d SDF file for aspirin (CID 2244)
           write(io, get_for_cids(2244; record_type="3d", output="SDF"))
       end
4055

julia> dct = JSON3.read(get_for_cids(cids; xrefs="RN,", output="JSON"));   # get the Registry Number(s) (CAS)

julia> dct[:InformationList][:Information]
2-element JSON3.Array{JSON3.Object,Array{UInt8,1},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}:
 {
   "CID": 135398570,
    "RN": [
            "40732-48-7",
            "7665-99-8"
          ]
}
 {
   "CID": 2244,
    "RN": [
            "11126-35-5",
            "156865-15-5",
            "50-78-2",
            "52080-78-1",
            "921943-73-9",
            "98201-60-6",
            "99512-66-0"
          ]
}

source

PubChemCrawler.pug — Function

pug(args...; silent = true, escape_args = true, return_text = true, status_exception = false, kwargs...)

Generate a PUG endpoint and call it. The details about PUG endpoints are described here: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest

Keyword arguments:

escape_args = true, URL encodes each argument before generating the endpoint.

Setting this false is useful when copy-pasting an existing PUG endpoint, e.g. from documentation.

silent = false print the pug URL called.
return_text = true, call String on the output to return a string rather than a byte vector.
status_exception = false, tell HTTP.jl to not throw an exception on return codes >= 300.

Other keyword arguments are passed on to HTTP.request.

Examples:

julia> pug(:compound, :name, "ethanol", :cids, :txt, silent = false, return_text = true)
[ Info: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/ethanol/cids/txt
"702"

julia> pug("compound/cid/2244", :cids, :txt, escape_args = false, silent = false, return_text = true)
[ Info: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/cids/txt
"2244"

julia> pug(:compound, :smiles, "C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O", :cids, :txt, return_text = true)
"5793"

julia> pug(:compound, :cid, 708, :txt, return_text = true, status_exception = false)
"Status: 400
Code: PUGREST.BadRequest
Message: Invalid output format
Detail: Full-record output format must be one of ASNT/B, XML, JSON(P), SDF, or PNG"

source

PubChemCrawler.get_synonyms — Function

synonyms = get_synonyms(name="glucose")
synonyms = get_synonyms(smiles="C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O")
synonyms = get_synonyms(cid=5793)

Return a list of substance or compound synonyms.

source

Utilities

PubChemCrawler.parse_formula — Function

atomcounts = parse_formula(str::AbstractString)

Parse str as a chemical formula, return a list of atom=>multiplicity pairs.

Example

julia> parse_formula("C2CaH2O6")
4-element Vector{Pair{String, Int64}}:
  "C" => 2
 "Ca" => 1
  "H" => 2
  "O" => 6

source

PubChemCrawler.atomregex — Function

rex = atomregex(chemicalsymbol)

Create a regular expression for detecting how many atoms of type chemicalsymbol are in a molecular formula.

Examples

The formula for calcium bicarbonate is Ca(HCO3)2, i.e., C2CaH2O6.

julia> match(atomregex("C"), "C2CaH2O6")
RegexMatch("C2", 1="2")

julia> match(atomregex("Ca"), "C2CaH2O6")
RegexMatch("Ca", 1="")

julia> match(atomregex("H"), "C2CaH2O6")
RegexMatch("H2", 1="2")

julia> match(atomregex("O"), "C2CaH2O6")
RegexMatch("O6", 1="6")

Note that the regex for "C" does not match "Ca".

source