MeSH/UMLS Map and Filtering

Using UMLS Concepts with MeSH

nbviewer

The Medical Subject Headings (MeSH) terms returned from a PubMed search can be further analyzed by mapping them to Unified Medical Language System (UMLS) concepts, as well as filtering the MeSH Terms by concepts.

For both mapping MeSH to UMLS Concepts and filtering MeSH by concept, the following backends are supported:

Set Up

using SQLite
using MySQL
using BioMedQuery.DBUtils
using BioMedQuery.Processes
using BioServices.UMLS
using BioMedQuery.PubMed
using DataFrames

Credentials are environment variables (e.g set in your .juliarc.jl)

umls_user = ENV["UMLS_USER"];
umls_pswd = ENV["UMLS_PSSWD"];
email = ""; # Only needed if you want to contact NCBI with inqueries
search_term = """(obesity[MeSH Major Topic]) AND ("2010"[Date - Publication] : "2012"[Date - Publication])""";
umls_concept = "Disease or Syndrome";
max_articles = 5;
results_dir = ".";
verbose = true;

results_dir = ".";

MySQL

Map Medical Subject Headings (MeSH) to UMLS

This example demonstrates the typical workflow to populate a MESH2UMLS database table relating all concepts associated with all MeSH terms in the input database.

Note: this example reuses the MySQL DB from the PubMed Search and Save example.

Create MySQL DB connection

host = "127.0.0.1";
mysql_usr = "root";
mysql_pswd = "";
dbname = "pubmed_obesity_2010_2012";


db_mysql = MySQL.connect(host, mysql_usr, mysql_pswd, db = dbname);
Getting 5 articles, starting at index 0
------ESearch--------
------EFetch--------
------Save to database--------
Saving 5 articles to database
Finished searching, total number of articles: 5

Map MeSH to UMLS

@time map_mesh_to_umls_async!(db_mysql, umls_user, umls_pswd; append_results=false, timeout=3);
----------Matching MESH to UMLS-----------
["Adult", "Aged", "Aged, 80 and over", "Analysis of Variance", "Body Weight", "C-Reactive Protein", "Child", "Cross-Sectional Studies", "Fatigue", "Female", "Fibromyalgia", "Germany", "Health Status", "Humans", "Japan", "Male", "Middle Aged", "Nutrition Surveys", "Obesity", "Pain", "Pain Measurement", "Physical Fitness", "Prognosis", "Quality of Life", "Surveys and Questionnaires", "Reference Values", "Risk Factors", "ROC Curve", "Severity of Illness Index", "Sports", "Television", "Thyrotropin", "Biomarkers", "Weight Gain", "Exercise", "Body Mass Index", "Incidence", "Prevalence", "Logistic Models", "Odds Ratio", "Case-Control Studies", "Age Distribution", "Sex Distribution", "Sleep Apnea, Obstructive", "Metabolic Syndrome", "Overweight", "Waist Circumference", "Young Adult", "Obesity, Abdominal", "Republic of Korea", "Sedentary Behavior", "Pediatric Obesity"]
[ Info: UTS: Requesting new TGT
[ Info: Descriptor 25 out of 52: Surveys and Questionnaires
[ Info: Descriptor 26 out of 52: Reference Values
[ Info: Descriptor 23 out of 52: Prognosis
[ Info: Descriptor 19 out of 52: Obesity
[ Info: Descriptor 35 out of 52: Exercise
[ Info: Descriptor 33 out of 52: Biomarkers
[ Info: Descriptor 1 out of 52: Adult
[ Info: Descriptor 31 out of 52: Television
[ Info: Descriptor 2 out of 52: Aged
[ Info: Descriptor 22 out of 52: Physical Fitness
[ Info: Descriptor 24 out of 52: Quality of Life
[ Info: Descriptor 44 out of 52: Sleep Apnea, Obstructive
[ Info: Descriptor 32 out of 52: Thyrotropin
[ Info: Descriptor 39 out of 52: Logistic Models
[ Info: Descriptor 28 out of 52: ROC Curve
[ Info: Descriptor 17 out of 52: Middle Aged
[ Info: Descriptor 50 out of 52: Republic of Korea
[ Info: Descriptor 38 out of 52: Prevalence
[ Info: Descriptor 4 out of 52: Analysis of Variance
[ Info: Descriptor 10 out of 52: Female
[ Info: Descriptor 11 out of 52: Fibromyalgia
[ Info: Descriptor 36 out of 52: Body Mass Index
[ Info: Descriptor 3 out of 52: Aged, 80 and over
[ Info: Descriptor 41 out of 52: Case-Control Studies
[ Info: Descriptor 46 out of 52: Overweight
[ Info: Descriptor 7 out of 52: Child
[ Info: Descriptor 27 out of 52: Risk Factors
[ Info: Descriptor 34 out of 52: Weight Gain
[ Info: Descriptor 9 out of 52: Fatigue
[ Info: Descriptor 48 out of 52: Young Adult
[ Info: Descriptor 8 out of 52: Cross-Sectional Studies
[ Info: Descriptor 29 out of 52: Severity of Illness Index
[ Info: Descriptor 15 out of 52: Japan
[ Info: Descriptor 37 out of 52: Incidence
[ Info: Descriptor 14 out of 52: Humans
[ Info: Descriptor 12 out of 52: Germany
[ Info: Descriptor 20 out of 52: Pain
[ Info: Descriptor 5 out of 52: Body Weight
[ Info: Descriptor 6 out of 52: C-Reactive Protein
[ Info: Descriptor 47 out of 52: Waist Circumference
[ Info: Descriptor 42 out of 52: Age Distribution
[ Info: Descriptor 43 out of 52: Sex Distribution
[ Info: Descriptor 51 out of 52: Sedentary Behavior
[ Info: Descriptor 16 out of 52: Male
[ Info: Descriptor 40 out of 52: Odds Ratio
[ Info: Descriptor 21 out of 52: Pain Measurement
[ Info: Descriptor 18 out of 52: Nutrition Surveys
[ Info: Descriptor 45 out of 52: Metabolic Syndrome
[ Info: Descriptor 49 out of 52: Obesity, Abdominal
[ Info: Descriptor 13 out of 52: Health Status
[ Info: Descriptor 30 out of 52: Sports
[ Info: Descriptor 52 out of 52: Pediatric Obesity
[ Info: Descriptor 51 out of 52: Sedentary Behavior
  8.832177 seconds (7.90 M allocations: 394.041 MiB, 2.30% gc time)

Explore the output table

db_query(db_mysql, "SELECT * FROM mesh2umls")

56 rows × 2 columns

meshumls
StringString
1AdultAge Group
2Age DistributionQuantitative Concept
3AgedOrganism Attribute
4Aged, 80 and overAge Group
5Analysis of VarianceQuantitative Concept
6BiomarkersClinical Attribute
7Body Mass IndexDiagnostic Procedure
8Body WeightOrganism Attribute
9C-Reactive ProteinAmino Acid, Peptide, or Protein
10C-Reactive ProteinImmunologic Factor
11Case-Control StudiesResearch Activity
12ChildAge Group
13Cross-Sectional StudiesResearch Activity
14ExerciseDaily or Recreational Activity
15FatigueSign or Symptom
16FemalePopulation Group
17FibromyalgiaDisease or Syndrome
18GermanyGeographic Area
19Health StatusQualitative Concept
20HumansHuman
21IncidenceQuantitative Concept
22JapanGeographic Area
23Logistic ModelsIntellectual Product
24Logistic ModelsQuantitative Concept

Filtering MeSH terms by UMLS concept

Getting the descriptor to index dictionary and the occurence matrix

@time labels2ind, occur = umls_semantic_occurrences(db_mysql, umls_concept);
Filter mesh query string : SELECT mesh FROM mesh2umls WHERE umls IN ('Disease or Syndrome')
-------------------------------------------------------------
Found 5 articles with valid descriptors
-------------------------------------------------------------
  1.101153 seconds (1.94 M allocations: 97.543 MiB, 3.96% gc time)

Descriptor to Index Dictionary

labels2ind
Dict{String,Int64} with 5 entries:
  "Obesity"                  => 1
  "Pediatric Obesity"        => 2
  "Sleep Apnea, Obstructive" => 3
  "Metabolic Syndrome"       => 4
  "Fibromyalgia"             => 5

Output Data Matrix

Matrix(occur)
5×5 Array{Float64,2}:
 1.0  1.0  0.0  1.0  0.0
 0.0  0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0
 1.0  0.0  0.0  0.0  0.0

SQLite

This example demonstrates the typical workflow to populate a MESH2UMLS database table relating all concepts associated with all MeSH terms in the input database.

Note: this example reuses the SQLite DB from the PubMed Search and Save example.

Create SQLite DB connection

db_path = "$(results_dir)/pubmed_obesity_2010_2012.db";
db_sqlite = SQLite.DB(db_path);
Getting 5 articles, starting at index 0
------ESearch--------
------EFetch--------
------Save to database--------
Saving 5 articles to database
Finished searching, total number of articles: 5

Map MeSH to UMLS

@time map_mesh_to_umls_async!(db_sqlite, umls_user, umls_pswd; append_results=false, timeout=3);
----------Matching MESH to UMLS-----------
Union{Missing, String}["Reference Values", "Republic of Korea", "ROC Curve", "Fatigue", "Obesity", "Risk Factors", "Logistic Models", "Severity of Illness Index", "Male", "Case-Control Studies", "Analysis of Variance", "Sedentary Behavior", "Prevalence", "Quality of Life", "Odds Ratio", "Exercise", "Body Mass Index", "Aged", "Child", "Sex Distribution", "Adult", "Germany", "Sports", "Thyrotropin", "Pediatric Obesity", "Humans", "Japan", "Cross-Sectional Studies", "Weight Gain", "Middle Aged", "Surveys and Questionnaires", "Health Status", "Young Adult", "Incidence", "Prognosis", "Body Weight", "Pain Measurement", "Waist Circumference", "Metabolic Syndrome", "Pain", "Nutrition Surveys", "Fibromyalgia", "Sleep Apnea, Obstructive", "Television", "Age Distribution", "Overweight", "Physical Fitness", "Female", "Biomarkers", "Obesity, Abdominal", "C-Reactive Protein", "Aged, 80 and over"]
[ Info: UTS: Reading TGT from file
[ Info: Descriptor 3 out of 52: ROC Curve
[ Info: Descriptor 4 out of 52: Fatigue
[ Info: Descriptor 1 out of 52: Reference Values
[ Info: Descriptor 8 out of 52: Severity of Illness Index
[ Info: Descriptor 7 out of 52: Logistic Models
[ Info: Descriptor 17 out of 52: Body Mass Index
[ Info: Descriptor 13 out of 52: Prevalence
[ Info: Descriptor 21 out of 52: Adult
[ Info: Descriptor 11 out of 52: Analysis of Variance
[ Info: Descriptor 16 out of 52: Exercise
[ Info: Descriptor 14 out of 52: Quality of Life
[ Info: Descriptor 22 out of 52: Germany
[ Info: Descriptor 10 out of 52: Case-Control Studies
[ Info: Descriptor 9 out of 52: Male
[ Info: Descriptor 5 out of 52: Obesity
[ Info: Descriptor 19 out of 52: Child
[ Info: Descriptor 6 out of 52: Risk Factors
[ Info: Descriptor 25 out of 52: Pediatric Obesity
[ Info: Descriptor 18 out of 52: Aged
[ Info: Descriptor 15 out of 52: Odds Ratio
[ Info: Descriptor 29 out of 52: Weight Gain
[ Info: Descriptor 23 out of 52: Sports
[ Info: Descriptor 26 out of 52: Humans
[ Info: Descriptor 32 out of 52: Health Status
[ Info: Descriptor 33 out of 52: Young Adult
[ Info: Descriptor 2 out of 52: Republic of Korea
[ Info: Descriptor 44 out of 52: Television
[ Info: Descriptor 27 out of 52: Japan
[ Info: Descriptor 28 out of 52: Cross-Sectional Studies
[ Info: Descriptor 31 out of 52: Surveys and Questionnaires
[ Info: Descriptor 20 out of 52: Sex Distribution
[ Info: Descriptor 36 out of 52: Body Weight
[ Info: Descriptor 46 out of 52: Overweight
[ Info: Descriptor 39 out of 52: Metabolic Syndrome
[ Info: Descriptor 38 out of 52: Waist Circumference
[ Info: Descriptor 30 out of 52: Middle Aged
[ Info: Descriptor 42 out of 52: Fibromyalgia
[ Info: Descriptor 41 out of 52: Nutrition Surveys
[ Info: Descriptor 50 out of 52: Obesity, Abdominal
[ Info: Descriptor 51 out of 52: C-Reactive Protein
[ Info: Descriptor 24 out of 52: Thyrotropin
[ Info: Descriptor 34 out of 52: Incidence
[ Info: Descriptor 49 out of 52: Biomarkers
[ Info: Descriptor 12 out of 52: Sedentary Behavior
[ Info: Descriptor 35 out of 52: Prognosis
[ Info: Descriptor 37 out of 52: Pain Measurement
[ Info: Descriptor 43 out of 52: Sleep Apnea, Obstructive
[ Info: Descriptor 48 out of 52: Female
[ Info: Descriptor 45 out of 52: Age Distribution
[ Info: Descriptor 47 out of 52: Physical Fitness
[ Info: Descriptor 40 out of 52: Pain
[ Info: Descriptor 52 out of 52: Aged, 80 and over
[ Info: Descriptor 51 out of 52: C-Reactive Protein
  3.476114 seconds (2.13 M allocations: 104.232 MiB, 1.18% gc time)

Explore the output table

db_query(db_sqlite, "SELECT * FROM mesh2umls;")

56 rows × 2 columns

meshumls
String⍰String⍰
1Logistic ModelsIntellectual Product
2Logistic ModelsQuantitative Concept
3ROC CurveQuantitative Concept
4Severity of Illness IndexQuantitative Concept
5PrevalenceQuantitative Concept
6FatigueSign or Symptom
7Reference ValuesQuantitative Concept
8MaleOrganism Attribute
9Case-Control StudiesResearch Activity
10Body Mass IndexDiagnostic Procedure
11GermanyGeographic Area
12AdultAge Group
13ExerciseDaily or Recreational Activity
14ObesityDisease or Syndrome
15Risk FactorsFinding
16AgedOrganism Attribute
17Young AdultAge Group
18Analysis of VarianceQuantitative Concept
19Quality of LifeIdea or Concept
20TelevisionManufactured Object
21HumansHuman
22SportsDaily or Recreational Activity
23Weight GainFinding
24ChildAge Group

Filtering MeSH terms by UMLS concept

Getting the descriptor to index dictionary and occurence matrix

@time labels2ind, occur = umls_semantic_occurrences(db_sqlite, umls_concept);
Filter mesh query string : SELECT mesh FROM mesh2umls WHERE umls IN ('Disease or Syndrome')
-------------------------------------------------------------
Found 5 articles with valid descriptors
-------------------------------------------------------------
  0.732982 seconds (1.28 M allocations: 64.875 MiB, 4.39% gc time)

Descriptor to Index Dictionary

labels2ind
Dict{String,Int64} with 5 entries:
  "Obesity"                  => 1
  "Pediatric Obesity"        => 2
  "Sleep Apnea, Obstructive" => 3
  "Metabolic Syndrome"       => 4
  "Fibromyalgia"             => 5

Output Data Matrix

Matrix(occur)
5×5 Array{Float64,2}:
 1.0  1.0  0.0  1.0  0.0
 0.0  0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0
 1.0  0.0  0.0  0.0  0.0

DataFrames

This example demonstrates the typical workflow to create a MeSH to UMLS map as a DataFrame relating all concepts associated with all MeSH terms in the input dataframe.

Get the articles (same as example in PubMed Search and Parse)

dfs = Processes.pubmed_search_and_parse(email, search_term, max_articles, verbose)
Getting 5 articles, starting at index 0
------ESearch--------
------EFetch--------
------Save to dataframes--------
Dict{String,DataFrames.DataFrame} with 8 entries:
  "basic"               => 5×13 DataFrames.DataFrame. Omitted printing of 9 col…
  "mesh_desc"           => 52×2 DataFrames.DataFrame…
  "mesh_qual"           => 9×2 DataFrames.DataFrame…
  "pub_type"            => 10×3 DataFrames.DataFrame…
  "abstract_full"       => 5×2 DataFrames.DataFrame. Omitted printing of 1 colu…
  "author_ref"          => 35×8 DataFrames.DataFrame. Omitted printing of 3 col…
  "mesh_heading"        => 78×5 DataFrames.DataFrame…
  "abstract_structured" => 4×4 DataFrames.DataFrame. Omitted printing of 1 colu…

Map MeSH to UMLS and explore the output table

@time res = map_mesh_to_umls_async(dfs["mesh_desc"], umls_user, umls_pswd)

56 rows × 2 columns

descriptorconcept
StringString
1AdultAge Group
2Age DistributionQuantitative Concept
3AgedOrganism Attribute
4Aged, 80 and overAge Group
5Analysis of VarianceQuantitative Concept
6BiomarkersClinical Attribute
7Body Mass IndexDiagnostic Procedure
8Body WeightOrganism Attribute
9C-Reactive ProteinAmino Acid, Peptide, or Protein
10C-Reactive ProteinImmunologic Factor
11Case-Control StudiesResearch Activity
12ChildAge Group
13Cross-Sectional StudiesResearch Activity
14ExerciseDaily or Recreational Activity
15FatigueSign or Symptom
16FemalePopulation Group
17FibromyalgiaDisease or Syndrome
18GermanyGeographic Area
19Health StatusQualitative Concept
20HumansHuman
21IncidenceQuantitative Concept
22JapanGeographic Area
23Logistic ModelsIntellectual Product
24Logistic ModelsQuantitative Concept

Getting the descriptor to index dictionary and occurence matrix

@time labels2ind, occur = umls_semantic_occurrences(dfs, res, umls_concept);
-------------------------------------------------------------
Found 5 articles with valid descriptors
-------------------------------------------------------------
  0.741944 seconds (1.20 M allocations: 57.093 MiB, 6.51% gc time)

Descriptor to Index Dictionary

labels2ind
Dict{String,Int64} with 5 entries:
  "Obesity"                  => 1
  "Pediatric Obesity"        => 2
  "Sleep Apnea, Obstructive" => 3
  "Metabolic Syndrome"       => 4
  "Fibromyalgia"             => 5

Output Data Matrix

Matrix(occur)
5×5 Array{Float64,2}:
 0.0  1.0  0.0  1.0  1.0
 0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0

This page was generated using Literate.jl.