Processes/Workflows · BioMedQuery.jl

This module provides common processes/workflows when using the BioMedQuery utilities. For instance, searching PubMed, requires calling the NCBI e-utils in a particular order. After the search, the results are often saved to the database. This module contains pre-assembled functions performing all necessary steps. To see sample scripts that use this processes, refer to the following section

##Import

using BioMedQuery.Processes

Index

BioMedQuery.Processes.close_cons
BioMedQuery.Processes.export_citation
BioMedQuery.Processes.export_citation
BioMedQuery.Processes.get_file_name
BioMedQuery.Processes.get_ftp_con
BioMedQuery.Processes.get_ml_file
BioMedQuery.Processes.init_medline
BioMedQuery.Processes.load_medline!
BioMedQuery.Processes.map_mesh_to_umls_async
BioMedQuery.Processes.map_mesh_to_umls_async!
BioMedQuery.Processes.parse_ml_file
BioMedQuery.Processes.pubmed_search_and_parse
BioMedQuery.Processes.pubmed_search_and_save!
BioMedQuery.Processes.umls_semantic_occurrences
BioMedQuery.Processes.umls_semantic_occurrences

Functions

BioMedQuery.Processes.export_citation — Function.

export_citation(pmid::Int64, citation_type, output_file,verbose)

Export, to an output file, the citation for PubMed article identified by the given pmid

Arguments

citation_type::String: At the moment supported types include: "endnote", "bibtex"

BioMedQuery.Processes.export_citation — Function.

export_citation(pmids::Vector{Int64}, citation_type, output_file,verbose)

Export, to an output file, the citation for collection of PubMed articles identified by the given pmids

Arguments

citation_type::String: At the moment supported types include: "endnote", "bibtex"

BioMedQuery.Processes.load_medline! — Method.

load_medline(db_con, output_dir; start_file=1, end_file=972, year=2019, test=false)

Given a MySQL connection and optionally the start and end files, fetches the medline files, parses the xml, and loads into a MySQL DB (assumes tables already exist). The raw (xml.gz) and parsed (csv) files will be stored in the output_dir.

Arguments

db_con : A MySQL Connection to a db (tables must already be created - see PubMed.create_tables!)
output_dir : root directory where the raw and parsed files should be stored
start_file : which medline file should the loading start at
end_file : which medline file should the loading end at (default is last file in 2018 baseline)
year : which year medline is (current is 2018)
test : if true, a sample file will be downloaded, parsed, and loaded instead of the baseline files

BioMedQuery.Processes.map_mesh_to_umls_async! — Method.

map_mesh_to_umls_async!(db, c::Credentials; timeout, append_results, verbose)

Build (using async UMLS-API calls) and store in the given database a map from MESH descriptors to UMLS Semantic Concepts. For large queies this function will be faster than it's synchrounous counterpart

Arguments

db: Database. Must contain TABLE:mesh_descriptor. For each of the descriptors in that table, search and insert the associated semantic concepts into a new (cleared) TABLE:mesh2umls
user : UMLS username
psswd : UMLS Password
append_results::Bool : If false a NEW and EMPTY mesh2umls database table in creted
batch_size: Number of

BioMedQuery.Processes.map_mesh_to_umls_async — Method.

map_mesh_to_umls_async(mesh_df, user, psswd; timeout, append_results, verbose)

Build (using async UMLS-API calls) and return a map from MESH descriptors to UMLS Semantic Concepts. For large queies this function will be faster than it's synchrounous counterpart

Arguments

mesh_df: DataFrame countaining MeshDescriptors. This is the dataframe with the key `meshdesc` that is returned from pubmedsearchand_parse.
user : UMLS username
psswd : UMLS Password

BioMedQuery.Processes.pubmed_search_and_parse — Function.

pubmed_search_and_parse(email, search_term::String, article_max, verbose=false)

Search pubmed and parse the results into a dictionary of DataFrames. The dataframes have the same names and fields as the pubmed database schema. (e.g. df_dict["basic"] returns a dataframe with the basic article info)

Arguments

email : valid email address (otherwise pubmed may block you)
search_term : search string to submit to PubMed e.g (asthma[MeSH Terms]) AND ("2001/01/29"[Date - Publication] : "2010"[Date - Publication]) see http://www.ncbi.nlm.nih.gov/pubmed/advanced for help constructing the string
article_max : maximum number of articles to return
verbose : if true, the NCBI xml response files are saved to current directory

BioMedQuery.Processes.pubmed_search_and_save! — Function.

pubmed_search_and_save!(email, search_term::String, article_max,
conn, verbose=false)

Search pubmed and save the results into a database connection. The database is expected to exist and have the appriate pubmed related tables. You can create such tables using PubMed.create_tables(conn)

Arguments

email : valid email address (otherwise pubmed may block you)
search_term : search string to submit to PubMed e.g (asthma[MeSH Terms]) AND ("2001/01/29"[Date - Publication] : "2010"[Date - Publication]) see http://www.ncbi.nlm.nih.gov/pubmed/advanced for help constructing the string
article_max : maximum number of articles to return
conn : database connection
verbose : if true, the NCBI xml response files are saved to current directory

BioMedQuery.Processes.umls_semantic_occurrences — Method.

umls_semantic_occurrences(db, umls_semantic_type)

Return a sparse matrix indicating the presence of MESH descriptors associated with a given umls semantic type in all articles of the input database

Output

des_ind_dict: Dictionary matching row number to descriptor names
disease_occurances : Sparse matrix. The columns correspond to a feature vector, where each row is a MESH descriptor. There are as many columns as articles. The occurance/abscense of a descriptor is labeled as 1/0

BioMedQuery.Processes.umls_semantic_occurrences — Method.

umls_semantic_occurrences(dfs, mesh2umls_df, umls_semantic_type)

Return a sparse matrix indicating the presence of MESH descriptors associated with a given umls semantic type in all articles of the input database

Output

des_ind_dict: Dictionary matching row number to descriptor names
disease_occurances : Sparse matrix. The columns correspond to a feature vector, where each row is a MESH descriptor. There are as many columns as articles. The occurance/abscense of a descriptor is labeled as 1/0

BioMedQuery.Processes.close_cons — Method.

close_cons(ftp_con)

closes connection and cleans up

BioMedQuery.Processes.get_file_name — Function.

get_file_name(fnum::Int, year::Int = 2018, test = false)

Returns the medline file name given the file number and year.

BioMedQuery.Processes.get_ftp_con — Function.

get_ftp_con(test = false)

Get an FTP connection

BioMedQuery.Processes.get_ml_file — Function.

get_ml_file(fname::String, conn::ConnContext, output_dir)

Retrieves the file with fname and puts in medline/raw_files. Returns the HTTP response.

BioMedQuery.Processes.init_medline — Function.

init_medline(output_dir, test=false)

Sets up environment (folders), and connects to medline FTP Server and returns the connection.

BioMedQuery.Processes.parse_ml_file — Method.

parse_ml_file(fname::String, output_dir::String)

Parses the medline xml file into a dictionary of dataframes. Saves the resulting CSV files to medline/parsed_files.