PubMed

Utility functions to parse and store PubMed searches via BioServices.EUtils

Import Module

using BioMedQuery.PubMed

This module provides utility functions to parse, store and export queries to PubMed via the NCBI EUtils and its julia interface BioServices.EUtils. For many purposes you may interact with the higher level pipelines in [BioMedQuery.Processes]. Here, some of the lower level functions are discussed in case you need to assemble different pipelines.

Basics of searching PubMed

We are often interested in searching PubMed for all articles related to a search term, and possibly restricted by other search criteria. To do so we use BioServices.EUtils. A basic example of how we may use the functions esearch and efetch to accomplish such task is illustrated below.

using BioServices.EUtils
using XMLDict
using LightXML

search_term = "obstructive sleep apnea[MeSH Major Topic]"

#esearch
esearch_response = esearch(db="pubmed", term = search_term,
retstart = 0, retmax = 20, tool ="BioJulia")

#convert xml to dictionary
esearch_dict = parse_xml(String(esearch_response.body))

#convert id's to a array of numbers
ids = [parse(Int64, id_node) for id_node in esearch_dict["IdList"]["Id"]]

#efetch
efetch_response = efetch(db = "pubmed", tool = "BioJulia", retmode = "xml", rettype = "null", id = ids)

#convert xml to xml node tree
efetch_doc = root(parse_string(String(efetch_response.body)))

Handling XML responses

Many APIs return responses in XML form.

To parse an XML to a Julia dictionary we can use the XMLDict package

using XMLDict
dict = parse_xml(String(response.body))  

You can save directly the XML String to file

xdoc = parse_string(esearch)
save_file(xdoc, "./file.xml")

Save eseach/efetch responses

Save PMIDs to MySQL

If we are only interseted in saving a list of PMIDs associated with a query, we can do so as follows

dbname = "entrez_test"
host = "127.0.0.1";
user = "root"
pwd = ""

#Collect PMIDs from esearch result
ids = Array{Int64,1}()
for id_node in esearch_dict["IdList"]["Id"]
    push!(ids, parse(Int64, id_node))
end

# Initialize or connect to database
const conn = DBUtils.init_mysql_database(host, user, pwd, dbname)

# Create `article` table to store pmids
PubMed.create_pmid_table!(conn)

#Save pmids
PubMed.save_pmids!(conn, ids)

#query the article table to explore list of pmids
all_pmids = BioMedQuery.PubMed.all_pmids(conn)

Export efetch response as EndNote citation file

We can export the information returned by efetch as and EndNote/BibTex library file

citation = PubMed.CitationOutput("endnote", "./citations_temp.endnote", true)
nsucceses = PubMed.save_efetch!(citation, efetch_doc, verbose)

Save efetch response to MySQL database

Save the information returned by efetch to a MySQL database

dbname = "efetch_test"
host = "127.0.0.1";
user = "root"
pwd = ""

# Save results of efetch to database and cleanup intermediate CSV files
const conn = DBUtils.init_mysql_database(host, user, pwd, dbname)
PubMed.create_tables!(conn)
PubMed.save_efetch!(conn, efetch_doc, false, true) # verbose = false, drop_csv = true

Save efetch response to SQLite database

Save the information returned by efetch to a MySQL database

db_path = "./test_db.db"

const conn = SQLite.DB(db_path)
PubMed.create_tables!(conn)
PubMed.save_efetch!(conn, efetch_doc)

Return efetch response as a dictionary of DataFrames

The information returned by efetch can also be returned as dataframes. The dataframes match the format of the tables that are created for the sql saving functions (schema image below). These dataframes can also easily be saved to csv files.

    dfs = PubMed.parse(efetch_doc)

    PubMed.dfs_to_csv(dfs, "my/path", "my_file_prefix_")

Exploring output databases

The following schema has been used to store the results. If you are interested in having this module store additional fields, feel free to open an issue

alt

We can also explore the tables using BioMedQuery.DBUtils, e,g

tables = ["author_ref", "mesh_desc",
"mesh_qual", "mesh_heading"]

for t in tables
    query_str = "SELECT * FROM "*t*" LIMIT 10;"
    q = DBUtils.db_query(db, query_str)
    println(q)
end

Index

Structs and Functions

abstracts(db; local_medline=false)

Return all abstracts related to PMIDs in the database. If local_medline flag is set to true, it is assumed that db contains basic table with only PMIDs and all other info is available in a (same host) medline database

source
abstracts_by_year(db, pub_year; local_medline=false)

Return all abstracts of article published in the given year. If local_medline flag is set to true, it is assumed that db contains article table with only PMIDs and all other info is available in a (same host) medline database

source
add_mysql_keys!(conn)

Adds indices/keys to MySQL PubMed tables.

source
all_pmids(db)

Return all PMIDs stored in the basic table of the input database

source
citations_bibtex(article::Dict{String,DataFrame}, verbose=false)

Transforms a Dictionary of pubmed dataframes into text corresponding to its bibtex citation

source
citations_endnote(article::Dict{String,DataFrame}, verbose=false)

Transforms a Dictionary of pubmed dataframes into text corresponding to its endnote citation

source
create_pmid_table!(conn; tablename="article")

Creates a table, using either MySQL of SQLite, to store PMIDs from Entrez related searches. All tables are empty at this point

source
create_tables!(conn)

Create and initialize tables to save results from an Entrez/PubMed search or a medline file load. Caution, all related tables are dropped if they exist

source
db_insert!(conn, articles::Dict{String,DataFrame}, csv_path=pwd(), csv_prefix="<current date>_PubMed_"; verbose=false, drop_csvs=false)

Writes dictionary of dataframes to a MySQL database. Tables must already exist (see PubMed.create_tables!). CSVs that are created during writing can be saved (default) or removed.

source
db_insert!(conn, csv_path=pwd(), csv_prefix="<current date>_PubMed_"; verbose=false, drop_csvs=false)

Writes CSVs from PubMed parsing to a MySQL database. Tables must already exist (see PubMed.create_tables!). CSVs can optionally be removed after being written to DB.

source
db_insert!(conn, articles::Dict{String,DataFrame}, csv_path=pwd(), csv_prefix="<current date>_PubMed_"; verbose=false, drop_csvs=false)

Writes dictionary of dataframes to a SQLite database. Tables must already exist (see PubMed.create_tables!). CSVs that are created during writing can be saved (default) or removed.

source
dfs_to_csv(dfs::Dict, path::String, [file_prefix::String])

Takes output of toDataFrames and writes to CSV files at the provided path and with the file prefix.

source
drop_mysql_keys!(conn)

Removes keys/indices from MySQL PubMed tables.

source
get_article_mesh(db, pmid)

Get the all mesh-descriptors associated with a given article

source
get_article_mesh_by_concept(db, pmid, umls_concepts...; local_medline)

Get the all mesh-descriptors associated with a given article

Arguments:

  • query_string: "" - assumes full set of results were saved by BioMedQuery directly from XML
source
parse_articles(xml)

Parses a PubMedArticleSet that matches the NCBI-XML format

source
save_efetch!(output::CitationOutput, efetch_dict, verbose=false)

Save the results of a Entrez efetch to a bibliography file, with format and file path given by output::CitationOutput

source
 pubmed_save_efetch(efetch_dict, conn)

Save the results (dictionary) of an entrez-pubmed fetch to the input database.

source
 save_pmids!(conn, pmids::Vector{Int64}, verbose::Bool=false)

Save a list of PMIDS into input database.

Arguments:

  • conn: Database connection (MySQL or SQLite)
  • pmids: Array of PMIDs
  • verbose: Boolean to turn on extra print statements
source
all_mesh(db)

Return all MeSH stored in the mesh_desc table of the input database

source
dict_to_array(dict::Dict)

Given a dictionary, returns a tuple of arrays with the keys and values.

source
parse_MedlineDate(ml_dt::String)

Parses the contents of the MedlineDate element and returns a tuple of the year and month.

source
parse_author

Takes xml for author, and returns parsed elements

source
parse_month(mon::String)

Parses the string month (month or season) and returns an integer with the first month in range.

source
parse_orcid(raw_orc::String)

Takes a string containing an ORC ID (url, 16 digit string) and returns a formatted ID (0000-1111-2222-3333).

source
parse_year(yr::String)

Parses the string year and returns an integer with the first year in range.

source
remove_csvs(dfs, path, file_prefix)

Removes all of the CSV files associated with a dictionary of dataframes

source
remove_csvs(paths::Vector)

Removes all of the CSV files associated with an array of paths

source
strip_newline(val::String)

Replaces new line characters with spaces.

source