Using UMLS Concepts with MeSH
The Medical Subject Headings (MeSH) terms returned from a PubMed search can be further analyzed by mapping them to Unified Medical Language System (UMLS) concepts, as well as filtering the MeSH Terms by concepts.
For both mapping MeSH to UMLS Concepts and filtering MeSH by concept, the following backends are supported:
- MySQL
- SQLite
- DataFrames
Set Up
using SQLite
using MySQL
using BioMedQuery.DBUtils
using BioMedQuery.Processes
using BioServices.UMLS
using BioMedQuery.PubMed
using DataFrames
Credentials are environment variables (e.g set in your .juliarc.jl)
umls_user = ENV["UMLS_USER"];
umls_pswd = ENV["UMLS_PSSWD"];
email = ""; # Only needed if you want to contact NCBI with inqueries
search_term = """(obesity[MeSH Major Topic]) AND ("2010"[Date - Publication] : "2012"[Date - Publication])""";
umls_concept = "Disease or Syndrome";
max_articles = 5;
results_dir = ".";
verbose = true;
results_dir = ".";
MySQL
Map Medical Subject Headings (MeSH) to UMLS
This example demonstrates the typical workflow to populate a MESH2UMLS database table relating all concepts associated with all MeSH terms in the input database.
Note: this example reuses the MySQL DB from the PubMed Search and Save example.
Create MySQL DB connection
host = "127.0.0.1";
mysql_usr = "root";
mysql_pswd = "";
dbname = "pubmed_obesity_2010_2012";
db_mysql = MySQL.connect(host, mysql_usr, mysql_pswd, db = dbname);
Getting 5 articles, starting at index 0
------ESearch--------
------EFetch--------
------Save to database--------
Saving 5 articles to database
Finished searching, total number of articles: 5
Map MeSH to UMLS
@time map_mesh_to_umls_async!(db_mysql, umls_user, umls_pswd; append_results=false, timeout=3);
----------Matching MESH to UMLS-----------
["Adult", "Aged", "Aged, 80 and over", "Analysis of Variance", "Body Weight", "C-Reactive Protein", "Child", "Cross-Sectional Studies", "Fatigue", "Female", "Fibromyalgia", "Germany", "Health Status", "Humans", "Japan", "Male", "Middle Aged", "Nutrition Surveys", "Obesity", "Pain", "Pain Measurement", "Physical Fitness", "Prognosis", "Quality of Life", "Surveys and Questionnaires", "Reference Values", "Risk Factors", "ROC Curve", "Severity of Illness Index", "Sports", "Television", "Thyrotropin", "Biomarkers", "Weight Gain", "Exercise", "Body Mass Index", "Incidence", "Prevalence", "Logistic Models", "Odds Ratio", "Case-Control Studies", "Age Distribution", "Sex Distribution", "Sleep Apnea, Obstructive", "Metabolic Syndrome", "Overweight", "Waist Circumference", "Young Adult", "Obesity, Abdominal", "Republic of Korea", "Sedentary Behavior", "Pediatric Obesity"]
[ Info: UTS: Requesting new TGT
[ Info: Descriptor 25 out of 52: Surveys and Questionnaires
[ Info: Descriptor 26 out of 52: Reference Values
[ Info: Descriptor 23 out of 52: Prognosis
[ Info: Descriptor 19 out of 52: Obesity
[ Info: Descriptor 35 out of 52: Exercise
[ Info: Descriptor 33 out of 52: Biomarkers
[ Info: Descriptor 1 out of 52: Adult
[ Info: Descriptor 31 out of 52: Television
[ Info: Descriptor 2 out of 52: Aged
[ Info: Descriptor 22 out of 52: Physical Fitness
[ Info: Descriptor 24 out of 52: Quality of Life
[ Info: Descriptor 44 out of 52: Sleep Apnea, Obstructive
[ Info: Descriptor 32 out of 52: Thyrotropin
[ Info: Descriptor 39 out of 52: Logistic Models
[ Info: Descriptor 28 out of 52: ROC Curve
[ Info: Descriptor 17 out of 52: Middle Aged
[ Info: Descriptor 50 out of 52: Republic of Korea
[ Info: Descriptor 38 out of 52: Prevalence
[ Info: Descriptor 4 out of 52: Analysis of Variance
[ Info: Descriptor 10 out of 52: Female
[ Info: Descriptor 11 out of 52: Fibromyalgia
[ Info: Descriptor 36 out of 52: Body Mass Index
[ Info: Descriptor 3 out of 52: Aged, 80 and over
[ Info: Descriptor 41 out of 52: Case-Control Studies
[ Info: Descriptor 46 out of 52: Overweight
[ Info: Descriptor 7 out of 52: Child
[ Info: Descriptor 27 out of 52: Risk Factors
[ Info: Descriptor 34 out of 52: Weight Gain
[ Info: Descriptor 9 out of 52: Fatigue
[ Info: Descriptor 48 out of 52: Young Adult
[ Info: Descriptor 8 out of 52: Cross-Sectional Studies
[ Info: Descriptor 29 out of 52: Severity of Illness Index
[ Info: Descriptor 15 out of 52: Japan
[ Info: Descriptor 37 out of 52: Incidence
[ Info: Descriptor 14 out of 52: Humans
[ Info: Descriptor 12 out of 52: Germany
[ Info: Descriptor 20 out of 52: Pain
[ Info: Descriptor 5 out of 52: Body Weight
[ Info: Descriptor 6 out of 52: C-Reactive Protein
[ Info: Descriptor 47 out of 52: Waist Circumference
[ Info: Descriptor 42 out of 52: Age Distribution
[ Info: Descriptor 43 out of 52: Sex Distribution
[ Info: Descriptor 51 out of 52: Sedentary Behavior
[ Info: Descriptor 16 out of 52: Male
[ Info: Descriptor 40 out of 52: Odds Ratio
[ Info: Descriptor 21 out of 52: Pain Measurement
[ Info: Descriptor 18 out of 52: Nutrition Surveys
[ Info: Descriptor 45 out of 52: Metabolic Syndrome
[ Info: Descriptor 49 out of 52: Obesity, Abdominal
[ Info: Descriptor 13 out of 52: Health Status
[ Info: Descriptor 30 out of 52: Sports
[ Info: Descriptor 52 out of 52: Pediatric Obesity
[ Info: Descriptor 51 out of 52: Sedentary Behavior
8.832177 seconds (7.90 M allocations: 394.041 MiB, 2.30% gc time)
Explore the output table
db_query(db_mysql, "SELECT * FROM mesh2umls")
mesh | umls | |
---|---|---|
String | String | |
1 | Adult | Age Group |
2 | Age Distribution | Quantitative Concept |
3 | Aged | Organism Attribute |
4 | Aged, 80 and over | Age Group |
5 | Analysis of Variance | Quantitative Concept |
6 | Biomarkers | Clinical Attribute |
7 | Body Mass Index | Diagnostic Procedure |
8 | Body Weight | Organism Attribute |
9 | C-Reactive Protein | Amino Acid, Peptide, or Protein |
10 | C-Reactive Protein | Immunologic Factor |
11 | Case-Control Studies | Research Activity |
12 | Child | Age Group |
13 | Cross-Sectional Studies | Research Activity |
14 | Exercise | Daily or Recreational Activity |
15 | Fatigue | Sign or Symptom |
16 | Female | Population Group |
17 | Fibromyalgia | Disease or Syndrome |
18 | Germany | Geographic Area |
19 | Health Status | Qualitative Concept |
20 | Humans | Human |
21 | Incidence | Quantitative Concept |
22 | Japan | Geographic Area |
23 | Logistic Models | Intellectual Product |
24 | Logistic Models | Quantitative Concept |
⋮ | ⋮ | ⋮ |
Filtering MeSH terms by UMLS concept
Getting the descriptor to index dictionary and the occurence matrix
@time labels2ind, occur = umls_semantic_occurrences(db_mysql, umls_concept);
Filter mesh query string : SELECT mesh FROM mesh2umls WHERE umls IN ('Disease or Syndrome')
-------------------------------------------------------------
Found 5 articles with valid descriptors
-------------------------------------------------------------
1.101153 seconds (1.94 M allocations: 97.543 MiB, 3.96% gc time)
Descriptor to Index Dictionary
labels2ind
Dict{String,Int64} with 5 entries:
"Obesity" => 1
"Pediatric Obesity" => 2
"Sleep Apnea, Obstructive" => 3
"Metabolic Syndrome" => 4
"Fibromyalgia" => 5
Output Data Matrix
Matrix(occur)
5×5 Array{Float64,2}:
1.0 1.0 0.0 1.0 0.0
0.0 0.0 1.0 0.0 0.0
0.0 1.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0
SQLite
This example demonstrates the typical workflow to populate a MESH2UMLS database table relating all concepts associated with all MeSH terms in the input database.
Note: this example reuses the SQLite DB from the PubMed Search and Save example.
Create SQLite DB connection
db_path = "$(results_dir)/pubmed_obesity_2010_2012.db";
db_sqlite = SQLite.DB(db_path);
Getting 5 articles, starting at index 0
------ESearch--------
------EFetch--------
------Save to database--------
Saving 5 articles to database
Finished searching, total number of articles: 5
Map MeSH to UMLS
@time map_mesh_to_umls_async!(db_sqlite, umls_user, umls_pswd; append_results=false, timeout=3);
----------Matching MESH to UMLS-----------
Union{Missing, String}["Reference Values", "Republic of Korea", "ROC Curve", "Fatigue", "Obesity", "Risk Factors", "Logistic Models", "Severity of Illness Index", "Male", "Case-Control Studies", "Analysis of Variance", "Sedentary Behavior", "Prevalence", "Quality of Life", "Odds Ratio", "Exercise", "Body Mass Index", "Aged", "Child", "Sex Distribution", "Adult", "Germany", "Sports", "Thyrotropin", "Pediatric Obesity", "Humans", "Japan", "Cross-Sectional Studies", "Weight Gain", "Middle Aged", "Surveys and Questionnaires", "Health Status", "Young Adult", "Incidence", "Prognosis", "Body Weight", "Pain Measurement", "Waist Circumference", "Metabolic Syndrome", "Pain", "Nutrition Surveys", "Fibromyalgia", "Sleep Apnea, Obstructive", "Television", "Age Distribution", "Overweight", "Physical Fitness", "Female", "Biomarkers", "Obesity, Abdominal", "C-Reactive Protein", "Aged, 80 and over"]
[ Info: UTS: Reading TGT from file
[ Info: Descriptor 3 out of 52: ROC Curve
[ Info: Descriptor 4 out of 52: Fatigue
[ Info: Descriptor 1 out of 52: Reference Values
[ Info: Descriptor 8 out of 52: Severity of Illness Index
[ Info: Descriptor 7 out of 52: Logistic Models
[ Info: Descriptor 17 out of 52: Body Mass Index
[ Info: Descriptor 13 out of 52: Prevalence
[ Info: Descriptor 21 out of 52: Adult
[ Info: Descriptor 11 out of 52: Analysis of Variance
[ Info: Descriptor 16 out of 52: Exercise
[ Info: Descriptor 14 out of 52: Quality of Life
[ Info: Descriptor 22 out of 52: Germany
[ Info: Descriptor 10 out of 52: Case-Control Studies
[ Info: Descriptor 9 out of 52: Male
[ Info: Descriptor 5 out of 52: Obesity
[ Info: Descriptor 19 out of 52: Child
[ Info: Descriptor 6 out of 52: Risk Factors
[ Info: Descriptor 25 out of 52: Pediatric Obesity
[ Info: Descriptor 18 out of 52: Aged
[ Info: Descriptor 15 out of 52: Odds Ratio
[ Info: Descriptor 29 out of 52: Weight Gain
[ Info: Descriptor 23 out of 52: Sports
[ Info: Descriptor 26 out of 52: Humans
[ Info: Descriptor 32 out of 52: Health Status
[ Info: Descriptor 33 out of 52: Young Adult
[ Info: Descriptor 2 out of 52: Republic of Korea
[ Info: Descriptor 44 out of 52: Television
[ Info: Descriptor 27 out of 52: Japan
[ Info: Descriptor 28 out of 52: Cross-Sectional Studies
[ Info: Descriptor 31 out of 52: Surveys and Questionnaires
[ Info: Descriptor 20 out of 52: Sex Distribution
[ Info: Descriptor 36 out of 52: Body Weight
[ Info: Descriptor 46 out of 52: Overweight
[ Info: Descriptor 39 out of 52: Metabolic Syndrome
[ Info: Descriptor 38 out of 52: Waist Circumference
[ Info: Descriptor 30 out of 52: Middle Aged
[ Info: Descriptor 42 out of 52: Fibromyalgia
[ Info: Descriptor 41 out of 52: Nutrition Surveys
[ Info: Descriptor 50 out of 52: Obesity, Abdominal
[ Info: Descriptor 51 out of 52: C-Reactive Protein
[ Info: Descriptor 24 out of 52: Thyrotropin
[ Info: Descriptor 34 out of 52: Incidence
[ Info: Descriptor 49 out of 52: Biomarkers
[ Info: Descriptor 12 out of 52: Sedentary Behavior
[ Info: Descriptor 35 out of 52: Prognosis
[ Info: Descriptor 37 out of 52: Pain Measurement
[ Info: Descriptor 43 out of 52: Sleep Apnea, Obstructive
[ Info: Descriptor 48 out of 52: Female
[ Info: Descriptor 45 out of 52: Age Distribution
[ Info: Descriptor 47 out of 52: Physical Fitness
[ Info: Descriptor 40 out of 52: Pain
[ Info: Descriptor 52 out of 52: Aged, 80 and over
[ Info: Descriptor 51 out of 52: C-Reactive Protein
3.476114 seconds (2.13 M allocations: 104.232 MiB, 1.18% gc time)
Explore the output table
db_query(db_sqlite, "SELECT * FROM mesh2umls;")
mesh | umls | |
---|---|---|
String⍰ | String⍰ | |
1 | Logistic Models | Intellectual Product |
2 | Logistic Models | Quantitative Concept |
3 | ROC Curve | Quantitative Concept |
4 | Severity of Illness Index | Quantitative Concept |
5 | Prevalence | Quantitative Concept |
6 | Fatigue | Sign or Symptom |
7 | Reference Values | Quantitative Concept |
8 | Male | Organism Attribute |
9 | Case-Control Studies | Research Activity |
10 | Body Mass Index | Diagnostic Procedure |
11 | Germany | Geographic Area |
12 | Adult | Age Group |
13 | Exercise | Daily or Recreational Activity |
14 | Obesity | Disease or Syndrome |
15 | Risk Factors | Finding |
16 | Aged | Organism Attribute |
17 | Young Adult | Age Group |
18 | Analysis of Variance | Quantitative Concept |
19 | Quality of Life | Idea or Concept |
20 | Television | Manufactured Object |
21 | Humans | Human |
22 | Sports | Daily or Recreational Activity |
23 | Weight Gain | Finding |
24 | Child | Age Group |
⋮ | ⋮ | ⋮ |
Filtering MeSH terms by UMLS concept
Getting the descriptor to index dictionary and occurence matrix
@time labels2ind, occur = umls_semantic_occurrences(db_sqlite, umls_concept);
Filter mesh query string : SELECT mesh FROM mesh2umls WHERE umls IN ('Disease or Syndrome')
-------------------------------------------------------------
Found 5 articles with valid descriptors
-------------------------------------------------------------
0.732982 seconds (1.28 M allocations: 64.875 MiB, 4.39% gc time)
Descriptor to Index Dictionary
labels2ind
Dict{String,Int64} with 5 entries:
"Obesity" => 1
"Pediatric Obesity" => 2
"Sleep Apnea, Obstructive" => 3
"Metabolic Syndrome" => 4
"Fibromyalgia" => 5
Output Data Matrix
Matrix(occur)
5×5 Array{Float64,2}:
1.0 1.0 0.0 1.0 0.0
0.0 0.0 1.0 0.0 0.0
0.0 1.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 1.0
1.0 0.0 0.0 0.0 0.0
DataFrames
This example demonstrates the typical workflow to create a MeSH to UMLS map as a DataFrame relating all concepts associated with all MeSH terms in the input dataframe.
Get the articles (same as example in PubMed Search and Parse)
dfs = Processes.pubmed_search_and_parse(email, search_term, max_articles, verbose)
Getting 5 articles, starting at index 0
------ESearch--------
------EFetch--------
------Save to dataframes--------
Dict{String,DataFrames.DataFrame} with 8 entries:
"basic" => 5×13 DataFrames.DataFrame. Omitted printing of 9 col…
"mesh_desc" => 52×2 DataFrames.DataFrame…
"mesh_qual" => 9×2 DataFrames.DataFrame…
"pub_type" => 10×3 DataFrames.DataFrame…
"abstract_full" => 5×2 DataFrames.DataFrame. Omitted printing of 1 colu…
"author_ref" => 35×8 DataFrames.DataFrame. Omitted printing of 3 col…
"mesh_heading" => 78×5 DataFrames.DataFrame…
"abstract_structured" => 4×4 DataFrames.DataFrame. Omitted printing of 1 colu…
Map MeSH to UMLS and explore the output table
@time res = map_mesh_to_umls_async(dfs["mesh_desc"], umls_user, umls_pswd)
descriptor | concept | |
---|---|---|
String | String | |
1 | Adult | Age Group |
2 | Age Distribution | Quantitative Concept |
3 | Aged | Organism Attribute |
4 | Aged, 80 and over | Age Group |
5 | Analysis of Variance | Quantitative Concept |
6 | Biomarkers | Clinical Attribute |
7 | Body Mass Index | Diagnostic Procedure |
8 | Body Weight | Organism Attribute |
9 | C-Reactive Protein | Amino Acid, Peptide, or Protein |
10 | C-Reactive Protein | Immunologic Factor |
11 | Case-Control Studies | Research Activity |
12 | Child | Age Group |
13 | Cross-Sectional Studies | Research Activity |
14 | Exercise | Daily or Recreational Activity |
15 | Fatigue | Sign or Symptom |
16 | Female | Population Group |
17 | Fibromyalgia | Disease or Syndrome |
18 | Germany | Geographic Area |
19 | Health Status | Qualitative Concept |
20 | Humans | Human |
21 | Incidence | Quantitative Concept |
22 | Japan | Geographic Area |
23 | Logistic Models | Intellectual Product |
24 | Logistic Models | Quantitative Concept |
⋮ | ⋮ | ⋮ |
Getting the descriptor to index dictionary and occurence matrix
@time labels2ind, occur = umls_semantic_occurrences(dfs, res, umls_concept);
-------------------------------------------------------------
Found 5 articles with valid descriptors
-------------------------------------------------------------
0.741944 seconds (1.20 M allocations: 57.093 MiB, 6.51% gc time)
Descriptor to Index Dictionary
labels2ind
Dict{String,Int64} with 5 entries:
"Obesity" => 1
"Pediatric Obesity" => 2
"Sleep Apnea, Obstructive" => 3
"Metabolic Syndrome" => 4
"Fibromyalgia" => 5
Output Data Matrix
Matrix(occur)
5×5 Array{Float64,2}:
0.0 1.0 0.0 1.0 1.0
0.0 0.0 1.0 0.0 0.0
0.0 0.0 0.0 1.0 0.0
1.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 1.0
This page was generated using Literate.jl.