PubChemCrawler

PubChemCrawler makes it easier to search the PubChem database from Julia. You can use it to access information about particular compounds or query substructures.

The package supports only a subset of the available functionality, but it is fairly straightforward to expand to other types of query. Pull requests are welcome! If you do want to make improvements to this package, this tutorial might help you get started.

Before you start: be aware of search limits

PubChem places significant limits on requests:

  • No more than 5 requests per second
  • No more than 400 requests per minute
  • No longer than 300 second running time per minute
  • Requests made via REST time out after 30s. The PUG XML interface does not have this limitation. For substructure searches, query_substructure_pug is recommended.

Getting started

Some queries make use of the CID, the Compound IDentifier, which you can obtain in a variety of ways. Let's get the CID for aspirin:

julia> cid = get_cid(name="aspirin")
2244

julia> cid = get_cid(smiles="CC(=O)OC1=CC=CC=C1C(=O)O")   # use the SMILES string
2244

You can then retrieve individual properties:

julia> smiles = chomp(String(get_for_cids(2244, properties="CanonicalSMILES", output="TXT")))
"CC(=O)OC1=CC=CC=C1C(=O)O"

or a list of properties:

julia> using CSV, DataFrames

julia> df = CSV.File(get_for_cids(2244; properties="MolecularFormula,MolecularWeight,XLogP,IsomericSMILES", output="CSV")) |> DataFrame
1×5 DataFrame
│ Row │ CID   │ MolecularFormula │ MolecularWeight │ XLogP   │ IsomericSMILES           │
│     │ $Int │ String           │ Float64         │ Float64 │ String                   │
├─────┼───────┼──────────────────┼─────────────────┼─────────┼──────────────────────────┤
│ 1   │ 2244  │ C9H8O4           │ 180.16          │ 1.2     │ CC(=O)OC1=CC=CC=C1C(=O)O │

You can query properties for a whole list of cids.

You can also download structure data and save it to a file. This saves a 3d conformer for aspirin:

julia> open("/tmp/aspirin.sdf", "w") do io
           write(io, get_for_cids(2244, output="SDF", record_type="3d"))
       end
3637

Finally, you can perform substructure searches. Let's retrieve up to 10 bicyclic compounds using a SMARTS search:

julia> cids = query_substructure_pug(smarts = "[\$([*R2]([*R])([*R])([*R]))].[\$([*R2]([*R])([*R])([*R]))]", maxhits = 10)
┌ Warning: maxhits was hit, results are partial
└ @ PubChemCrawler ~/.julia/dev/PubChemCrawler/src/pugxml.jl:164
10-element Vector{$Int}:
 135398658
   5280795
      5430
      5143
  54675779
   5280961
   5280804
   5280793
   5280343
   3034034

Note that Julia (not this package) requires the SMARTS string characters $ be escaped.

API

Queries

PubChemCrawler.get_cidFunction
cid = get_cid(name="glucose")
cid = get_cid(smiles="C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O")

Return the PubChem compound identification number for the specified compound.

source
PubChemCrawler.query_substructure_pugFunction
cids = query_substructure_pug(;cid=nothing, smiles=nothing, smarts=nothing,        # specifier for the substructure to search for
                               maxhits=200_000, poll_interval=10)

Retrieve a list of compounds containing a substructure specified via its cid, the SMILES string, or a SMARTS string.

Example

julia> using PubChemCrawler

julia> cids = query_substructure_pug(smarts="[r13]Br")   # query brominated 13-atom rings
66-element Vector{Int64}:
  54533707
 153064026
 152829033
...

PUG searches can take a while to run (they poll for completion), but conversely they allow more complex, long-running searches to succeed. See also query_substructure.

source
PubChemCrawler.query_substructureFunction
msg = query_substructure(;cid=nothing, smiles=nothing, smarts=nothing,           # specifier for the substructure to search for
                          properties="MolecularFormula,MolecularWeight,XLogP,",  # properties to retrieve
                          output="CSV")                                          # output format

Perform a substructure search of the entire database. You can specify the target via its cid, the SMILES string, or a SMARTS string. Specify the properties you want to retrieve as a comma-separated list from among the choices in http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest, "Compound Property Tables". Requesting more properties takes more time.

The output is a Vector{UInt8}. For output="CSV", a good choice to generate a manipulable result is DataFrame(CSV.File(msg)) from the DataFrames and CSV packages, respectively. Alternatively String(msg) will convert it to a string, which you can write to a file.

Example

julia> using PubChemCrawler, CSV, DataFrames

julia> cid = get_cid(name="estriol")
5756

julia> df = CSV.File(query_substructure(;cid)) |> DataFrame      # on Julia 1.0, use `(;cid=cid)`
11607×4 DataFrame
│ Row  │ CID       │ MolecularFormula │ MolecularWeight │ XLogP    │
│      │ Int64     │ String           │ Float64         │ Float64? │
├──────┼───────────┼──────────────────┼─────────────────┼──────────┤
│ 1    │ 5756      │ C18H24O3         │ 288.4           │ 2.5      │
│ 2    │ 5281904   │ C24H32O9         │ 464.5           │ 1.1      │
│ 3    │ 27125     │ C18H24O4         │ 304.4           │ 1.5      │
...

will query for derivatives of estriol.

Info

For complex queries that risk timing out, consider query_substructure_pug in combination with get_for_cids.

source
PubChemCrawler.get_for_cidsFunction
msg = get_for_cids(cids; properties|xrefs|cids_type|record_type, output="CSV")

Retrieve the given properties, xrefs, CIDs, or records, respectively, for a list of compounds specified by their cids. The documentation for these traits can be found at http://pubchemdocs.ncbi.nlm.nih.gov/pug-rest; this URL will be referred to as PUGREST below.

  • properties include structural features like the molecular formula, number of undefined stereocenters, and so on. Specify these as a comma-separated list from among the choices in PUGREST under "Compound Property Tables".
  • xrefs ("cross-references") include identifiers used by other databases, e.g., the CAS (Registry) number, PubMedID, and so on. The supported values for xrefs are available at PUGREST under "XRefs".
  • cids_type is used to retrieve CIDs for compounds related to those specified in cids; see PUGREST under "SIDS / CIDS / AIDS".
  • record_type is used to retrieve data files and to specify options for these files, e.g., 2d or 3d SDF files. See PUGREST under "Full-record Retrieval".

output specifies the output format. Not all options are applicable to all queries; for example, "CSV" is appropriate for properties queries but "SDF" might be used for a record_type query. See PUGREST, "Output".

Examples

julia> using PubChemCrawler, CSV, DataFrames, JSON3

julia> cids = [get_cid(name="cyclic guanosine monophosphate"), get_cid(name="aspirin")]
2-element Array{Int64,1}:
 135398570
      2244

julia> CSV.File(get_for_cids(cids; properties="MolecularFormula,XLogP", output="CSV")) |> DataFrame
2×3 DataFrame
 Row │ CID        MolecularFormula  XLogP
     │ Int64      String            Float64
─────┼──────────────────────────────────────
   1 │ 135398570  C10H12N5O7P          -3.4
   2 │      2244  C9H8O4                1.2

julia> open("/tmp/aspirin_3d.sdf", "w") do io    # save the 3d SDF file for aspirin (CID 2244)
           write(io, get_for_cids(2244; record_type="3d", output="SDF"))
       end
4055

julia> dct = JSON3.read(get_for_cids(cids; xrefs="RN,", output="JSON"));   # get the Registry Number(s) (CAS)

julia> dct[:InformationList][:Information]
2-element JSON3.Array{JSON3.Object,Array{UInt8,1},SubArray{UInt64,1,Array{UInt64,1},Tuple{UnitRange{Int64}},true}}:
 {
   "CID": 135398570,
    "RN": [
            "40732-48-7",
            "7665-99-8"
          ]
}
 {
   "CID": 2244,
    "RN": [
            "11126-35-5",
            "156865-15-5",
            "50-78-2",
            "52080-78-1",
            "921943-73-9",
            "98201-60-6",
            "99512-66-0"
          ]
}
source

Utilities

PubChemCrawler.parse_formulaFunction
atomcounts = parse_formula(str::AbstractString)

Parse str as a chemical formula, return a list of atom=>multiplicity pairs.

Example

julia> parse_formula("C2CaH2O6")
4-element Array{Pair{String,Int64},1}:
  "C" => 2
 "Ca" => 1
  "H" => 2
  "O" => 6
source
PubChemCrawler.atomregexFunction
rex = atomregex(chemicalsymbol)

Create a regular expression for detecting how many atoms of type chemicalsymbol are in a molecular formula.

Examples

The formula for calcium bicarbonate is Ca(HCO3)2, i.e., C2CaH2O6.

julia> match(atomregex("C"), "C2CaH2O6")
RegexMatch("C2", 1="2")

julia> match(atomregex("Ca"), "C2CaH2O6")
RegexMatch("Ca", 1="")

julia> match(atomregex("H"), "C2CaH2O6")
RegexMatch("H2", 1="2")

julia> match(atomregex("O"), "C2CaH2O6")
RegexMatch("O6", 1="6")

Note that the regex for "C" does not match "Ca".

source