GSoC ’24: IPUMS.jl Small Project

gsoc

geospatial

census

A summary of my project for Google Summer of Code

Author

Michela Rocchetti

Published

August 26, 2024

Hello! 👋

Hi! I am Michela, I have a Master’s degree in Physics of Complex Systems and I am currently working as a software engineer in Rome, where I am from. During my studies, I became interested in the use of modeling and AI methods to improve healthcare and how these tools can be used to better understand how cultural and social backgrounds influence the health of individuals. I am also interested in the computational modeling of the brain and the human body and its implications for a better understanding of certain pathological conditions.

With these motivations in mind, I heard about Google Summer of Code. Since I had studied Julia in some courses and given that the language is expanding rapidly, I decided to find a project within Julia. As a result, I found the project of Jacob Zelko (@TheCedarPrince) to start this experience.

If you want to learn more about me, you can connect with me here: LinkedIn, GitHub

Project Description

IPUMS is the “world’s largest available single database of census microdata”, providing survey and census data from around the world. It includes several projects that provide a wide variety of datasets. The information and data collected by IPUMS are useful for comparative research, as well as for the analysis of individuals in their life contexts. These data can be used to create a more comprehensive dataset that will facilitate research on the social determinants of health for different types of diseases, social communities, and geographical areas.

To learn more about IPUMS, visit the website

Tasks and Goals

The primary objectives of this proposal are to:

Develop a native Julia package to interact with the APIs available around the datasets IPUMS provides.
Provide useful utilities within this package for manipulating IPUMS datasets.
Compose this package with the wider Julia ecosystem to enable novel research in health, economics, and more.

To achieve this, the work was distributed as follows:

Expand some of the functionality developed in ipumsr IPUMS NHGIS
- Create a link between OpenAPI documentation and the functions internally used in IPUMS.jl: updating already present functions, determining if updating is needed, and testing them
- Develop functionality similar to the get_metadata_nghis function present in ipumsr
Update IPUMS documentation
- Set up and deploy DocumenterVitepress.jl
- Write a blog post on how IPUMS.jl can be composed within the ecosystem.

How the work was done

The first task was to migrate documents from Documenter to DocumenterVitepress.This issue aims to support the significant refactoring underway across JuliaHealth, aimed at improving the discoverability and cohesion of the JuliaHealth ecosystem, particularly about documentation. This issue is intended to create a more attractive entry point for new Julia users interested in health research within the Julia community. To accomplish this task, a dependency of DocumenterVitepress was added to the docs directory of the IPUMS.jl repository. Once this was done, the Documenter.jl make.jl file was migrated into a DocumenterVitepress.jl make.jl file. Working on the make.jl file, the pages structure were added to the web page explaining the IPUMS.jl package. With this in mind, those were added: 1. Home: to explain the main purpose of the package 2. Workflows: to explain the working process 3. How to: to give general information 4. Tutorials: to show how to use IPUMS.jl
5. Examples: some examples of activities 6. Mission: to explain why the package is useful for the community 7. References: references used to write the pages.

This first task takes some time, especially setting up GitHub and cloning the repository locally. At this point, my experience with GitHub was really limited and I had to learn how to use the Git environment from scratch, for example how to do continuous integration (to commit code to a shared repository), documentation release and merge, and local testing. I found the support of my mentors and searching for material online was really helpful.

The second task was to update the documentation of IPUMS.jl by modifying the functionality within the model folder in the IPUMS.jl folder. The main aim of this task was to a description of the function and its attributes, an example of possible implementation and result, and finally to show how to use it. The documentation to be updated as of several types of functions: 1. Data extract 2. Data set 3. Data Table 4. Time series table 5. Error 6. Shapefile. Each of these macro-categories (from 1 to 4) contains a set of functions, each signaling the different expected output and specific purpose. Information about what each function does, and the meaning of each specific input variable, has been found on the IPUMS website and references have been made in the written documentation.

How to work with IPUMS

After writing down the description of the function and the inputs, examples were formulated, starting from the IPUMS website: when you register at IPUMS, an API key is given. which is used, among other things, to run pre-written code on the website. This code contains examples of these functions, and these examples have been adapted by changing some input values and adapting them to work in the Julia framework. The latter task was done by simply rewriting some structures, such as dictionaries, maps, or lists, in the Julia language. Here is a small guide on how to set up working with the API: 1. Create an IPUMS account 2. Log in to your account 3. Copy the API key, which can be obtained from the website 4. Use the key to run the code that is already available on the IPUMS Developer Portal, where you will also find information about the variables and packages.

Functions testing

A final task was to test the functions in the ‘api_IPUMSAPI.jl’ file. In this file, the function to be tested and other functions are defined and the most important ones are extracted to be available in the available throughout the framework. Some of the functions to be tested were the following:

metadata_nhgis_data_tables_get
metadata_nhgis_datasets_dataset_data_tables_data_table_get
metadata_nhgis_datasets_dataset_get
metadata_nhgis_datasets_get

Before working on the Julia files, testing and understanding the original R function was done using R studio.

Each function was then tested using the API key from the IPUMS registration as well as other input examples taken from the documentation or the IPUMS website. or from the IPUMS website. All functions were displayed successfully, giving the expected result, so it can be concluded that the translation from R to Julia is successful.

using IPUMS
using OpenAPI

api_key = "insert your key here"

version = "2"
page_number = 1
page_size = 2500
#media_type = 

api = IPUMSAPI("https://api.ipums.org", Dict("Authorization" => api_key));

res1 = metadata_nhgis_data_tables_get(api, version)

res2 = metadata_nhgis_datasets_dataset_get(api, "2022_ACS1", "2");

res3 = metadata_nhgis_datasets_dataset_data_tables_data_table_get(api, "2022_ACS1","B01001", "2");

res4 = metadata_nhgis_datasets_get(api, "2");

An example of the output is:

. . .

{
  "name": "NT1",
  "nhgisCode": "AAA",
  "description": "Total Population",
  "universe": "Persons",
  "sequence": 1,
  "datasetName": "1790_cPop",
  "nVariables": [
    1
  ]
}

. . .

Accomplished Goals and Future Development

The project was a 90-hour small project and during this time the documentation was completed and the testing of the metadata function was done, as well as the migration from Documenter.jl to DocumenterVitepress.jl. During these months some things took longer than I expected because of some problems that occurred, so some things were missing in relation to the original plan. However, this time was useful for learning new things: - I saw how to work with a package under development, how to work with large datasets, and how to write documentation - I had the opportunity to better understand how to work with Git and GitHub - I learned some new things about R, which was a completely unknown language to me. - I deepened my knowledge of Julia, a language I had worked with during my time at university. - I had the chance to work on a large open-source project, to be part of a large community, and to learn how to communicate with it efficiently.

A special thanks goes to my mentors, Jacob Zelko and Krishna Bhogaonker, for helping me through this process.

Future developments of this work could include deepening the work that my mentors and I have started, with the possibility of integrating this package with other machine learning packages in Julia and, from there, doing new analyses of the data in terms of social and geographical implications for health.

Citation

BibTeX citation:

@online{rocchetti2024,
  author = {Rocchetti, Michela},
  title = {GSoC ’24: {IPUMS.jl} {Small} {Project}},
  date = {2024-08-26},
  langid = {en}
}

For attribution, please cite this work as:

Rocchetti, Michela. 2024. “GSoC ’24: IPUMS.jl Small Project.” August 26, 2024.