Processes/Workflows

This module provides common processes/workflows when using the BioMedQuery utilities. For instance, searching PubMed, requires calling the NCBI e-utils in a particular order. After the search, the results are often saved to the database. This module contains pre-assembled functions performing all necessary steps. To see sample scripts that use this processes, refer to the following section

##Import

using BioMedQuery.Processes

Index

Functions

export_citation(pmid::Int64, citation_type, output_file,verbose)

Export, to an output file, the citation for PubMed article identified by the given pmid

Arguments

  • citation_type::String: At the moment supported types include: "endnote", "bibtex"
source
export_citation(pmids::Vector{Int64}, citation_type, output_file,verbose)

Export, to an output file, the citation for collection of PubMed articles identified by the given pmids

Arguments

  • citation_type::String: At the moment supported types include: "endnote", "bibtex"
source
load_medline(db_con, output_dir; start_file=1, end_file=972, year=2019, test=false)

Given a MySQL connection and optionally the start and end files, fetches the medline files, parses the xml, and loads into a MySQL DB (assumes tables already exist). The raw (xml.gz) and parsed (csv) files will be stored in the output_dir.

Arguments

  • db_con : A MySQL Connection to a db (tables must already be created - see PubMed.create_tables!)
  • output_dir : root directory where the raw and parsed files should be stored
  • start_file : which medline file should the loading start at
  • end_file : which medline file should the loading end at (default is last file in 2018 baseline)
  • year : which year medline is (current is 2018)
  • test : if true, a sample file will be downloaded, parsed, and loaded instead of the baseline files
source
map_mesh_to_umls_async!(db, c::Credentials; timeout, append_results, verbose)

Build (using async UMLS-API calls) and store in the given database a map from MESH descriptors to UMLS Semantic Concepts. For large queies this function will be faster than it's synchrounous counterpart

Arguments

  • db: Database. Must contain TABLE:mesh_descriptor. For each of the descriptors in that table, search and insert the associated semantic concepts into a new (cleared) TABLE:mesh2umls
  • user : UMLS username
  • psswd : UMLS Password
  • append_results::Bool : If false a NEW and EMPTY mesh2umls database table in creted
  • batch_size: Number of
source
map_mesh_to_umls_async(mesh_df, user, psswd; timeout, append_results, verbose)

Build (using async UMLS-API calls) and return a map from MESH descriptors to UMLS Semantic Concepts. For large queies this function will be faster than it's synchrounous counterpart

Arguments

  • mesh_df: DataFrame countaining MeshDescriptors. This is the dataframe with the key `meshdesc` that is returned from pubmedsearchand_parse.
  • user : UMLS username
  • psswd : UMLS Password
source
pubmed_search_and_parse(email, search_term::String, article_max, verbose=false)

Search pubmed and parse the results into a dictionary of DataFrames. The dataframes have the same names and fields as the pubmed database schema. (e.g. df_dict["basic"] returns a dataframe with the basic article info)

Arguments

  • email : valid email address (otherwise pubmed may block you)
  • search_term : search string to submit to PubMed e.g (asthma[MeSH Terms]) AND ("2001/01/29"[Date - Publication] : "2010"[Date - Publication]) see http://www.ncbi.nlm.nih.gov/pubmed/advanced for help constructing the string
  • article_max : maximum number of articles to return
  • verbose : if true, the NCBI xml response files are saved to current directory
source
pubmed_search_and_save!(email, search_term::String, article_max,
conn, verbose=false)

Search pubmed and save the results into a database connection. The database is expected to exist and have the appriate pubmed related tables. You can create such tables using PubMed.create_tables(conn)

Arguments

  • email : valid email address (otherwise pubmed may block you)
  • search_term : search string to submit to PubMed e.g (asthma[MeSH Terms]) AND ("2001/01/29"[Date - Publication] : "2010"[Date - Publication]) see http://www.ncbi.nlm.nih.gov/pubmed/advanced for help constructing the string
  • article_max : maximum number of articles to return
  • conn : database connection
  • verbose : if true, the NCBI xml response files are saved to current directory
source
umls_semantic_occurrences(db, umls_semantic_type)

Return a sparse matrix indicating the presence of MESH descriptors associated with a given umls semantic type in all articles of the input database

Output

  • des_ind_dict: Dictionary matching row number to descriptor names
  • disease_occurances : Sparse matrix. The columns correspond to a feature vector, where each row is a MESH descriptor. There are as many columns as articles. The occurance/abscense of a descriptor is labeled as 1/0
source
umls_semantic_occurrences(dfs, mesh2umls_df, umls_semantic_type)

Return a sparse matrix indicating the presence of MESH descriptors associated with a given umls semantic type in all articles of the input database

Output

  • des_ind_dict: Dictionary matching row number to descriptor names
  • disease_occurances : Sparse matrix. The columns correspond to a feature vector, where each row is a MESH descriptor. There are as many columns as articles. The occurance/abscense of a descriptor is labeled as 1/0
source
close_cons(ftp_con)

closes connection and cleans up

source
get_file_name(fnum::Int, year::Int = 2018, test = false)

Returns the medline file name given the file number and year.

source
get_ftp_con(test = false)

Get an FTP connection

source
get_ml_file(fname::String, conn::ConnContext, output_dir)

Retrieves the file with fname and puts in medline/raw_files. Returns the HTTP response.

source
init_medline(output_dir, test=false)

Sets up environment (folders), and connects to medline FTP Server and returns the connection.

source
parse_ml_file(fname::String, output_dir::String)

Parses the medline xml file into a dictionary of dataframes. Saves the resulting CSV files to medline/parsed_files.

source