The MEDLINE loader process in BioMedQuery saves the MEDLINE baseline files to a MySQL database and saves the raw (xml.gz) and parsed (csv) files to a
medline directory that will be created in the provided
WARNING: There are 900+ medline files each with approximately 30,000 articles. This process will take hours to run for the full baseline load.
The baseline files can be found here.
The database and tables must already be created before loading the medline files. This process is set up for parallel processing. To take advantage of this, workers can be added before loading the BioMedQuery package using the
BioMedQuery has utility functions to create the database and tables. Note: creating the tables using this function will drop any tables that already exist in the target database.
const conn = BioMedQuery.DBUtils.init_mysql_database("127.0.0.1","root","","test_db", overwrite=true); BioMedQuery.PubMed.create_tables!(conn);
Load a Test File
As the full medline load is a large operation, it is recommended that a test run be completed first.
@time BioMedQuery.Processes.load_medline!(conn, pwd(), test=true)
[ Info: ======Setting up folders and creating FTP Connection====== ┌ Warning: FTP error during package test └ @ BioMedQuery.Processes ~/build/bcbi/BioMedQuery.jl/src/Processes/medline_load.jl:153 [ Info: Getting files from Medline Getting file: pubmedsample19n0001.xml.gz [ Info: Parsing files into CSV Parsing file: pubmedsample19n0001.xml.gz [ Info: Loading CSVs into MySQL Loading file: pubmedsample19n0001.xml.gz warning: failed parsing String on row=15, col=2, error=INVALID: OK, QUOTED, DELIMITED, INVALID_DELIMITER warning: failed parsing String on row=35, col=2, error=INVALID: OK, QUOTED, DELIMITED, INVALID_DELIMITER warning: failed parsing String on row=58, col=2, error=INVALID: OK, QUOTED, DELIMITED, INVALID_DELIMITER warning: failed parsing String on row=40, col=4, error=INVALID: OK, QUOTED, DELIMITED, INVALID_DELIMITER warning: failed parsing String on row=41, col=4, error=INVALID: OK, QUOTED, DELIMITED, INVALID_DELIMITER [ Info: All files processed - closing FTP connection 140.594272 seconds (34.36 M allocations: 1.455 GiB, 1.05% gc time)
Review the output of this run in MySQL to make sure that it ran as expected. Additionally, the sample raw and parsed file should be in the new
medline directory in the current directory.
Performing a Full Load
To run a full load, use the same code as above, but do not pass the test variable. It is also possible to break up the load by passing which files to start and stop at - simply pass
start_file=n andendfile=p`. Currently the default endfile reflects the last file of the 2019 baseline.
After loading, it is recommended you add indexes to the tables, the
add_mysql_keys! function can be used to add a standard set of indexes.
This page was generated using Literate.jl.