Introductory Case Studies / Demographic Data Project

Global Demographic Data Analysis

Descriptive analysis of global life expectancy and under-5 mortality rates across 227 countries, five regions, and 21 subregions using Python, pandas, visualization, regional comparison, correlation analysis, and 2002–2022 change analysis.

Course

Introductory Case Studies

Data Source

U.S. Census IDB

Coverage

227 Countries

Tools

Python / pandas

Project Overview

This project performs a descriptive statistical analysis of global demographic data. The goal is to compare life expectancy at birth and under-age-5 mortality rates across countries, sexes, regions, and subregions, and to study how these indicators changed between 2002 and 2022.

The project was completed for the Introductory Case Studies course at TU Dortmund. It focuses on exploratory and descriptive methods rather than predictive modelling, making it a strong foundation project for statistical data analysis and visualization.

Dataset

The dataset is an extract from the International Data Base of the U.S. Census Bureau. It includes demographic indicators for 227 countries for the years 2002 and 2022. The countries are grouped into five regions and 21 subregions.

The two main demographic indicators are life expectancy at birth and under-age-5 mortality rate. Both indicators are available for both sexes together and separately for males and females.

Dataset Component Description Use in Analysis
Country.Area.Name Country or area name Country-level comparison
Region Five global regions Regional averages and comparisons
Subregion 21 geographical subregions Homogeneity and heterogeneity analysis
Year 2002 and 2022 20-year change analysis
Life Expectancy At birth, by both sexes, males, and females Health and longevity indicator
Under-5 Mortality Mortality rate by both sexes, males, and females Child health and development indicator

Research Tasks

The project was designed around four main descriptive-analysis tasks. For the first three tasks, only the year 2022 was analyzed. The final task compared 2002 and 2022 to study changes over a 20-year period.

  1. Describe frequency distributions of life expectancy and under-5 mortality variables.
  2. Compare differences between sexes and regions.
  3. Analyze homogeneity within subregions and heterogeneity between subregions.
  4. Investigate bivariate correlations between demographic indicators.
  5. Compare changes in life expectancy and under-5 mortality from 2002 to 2022.

Statistical Methods

The project used standard descriptive statistics and graphical methods. These methods were selected because the course objective was to practice exploratory and descriptive analysis with interpretable statistical measures.

Method Purpose Interpretation
Mean and Median Measure central tendency Summarize typical life expectancy or mortality values
Variance and Standard Deviation Measure spread Show how much countries differ within groups
Quantiles Summarize distribution shape Compare lower, middle, and upper parts of the distribution
Histograms Display frequency distributions Show skewness and concentration of values
Boxplots Compare groups visually Reveal median, spread, and outliers
Correlation Matrix Analyze bivariate relationships Measure linear associations between demographic variables

Frequency Distribution Analysis

The 2022 distribution showed clear global variation in life expectancy. The average life expectancy for both sexes was around 74.8 years. Monaco had the highest life expectancy, while Afghanistan had the lowest value in the dataset.

74.8 years Average life expectancy for both sexes across regions in 2022.
89.52 years Highest life expectancy, observed for Monaco.
53.65 years Lowest life expectancy, observed for Afghanistan.

Sex-Based Differences

Female life expectancy was generally higher than male life expectancy. Monaco had the highest female life expectancy, while Afghanistan had the lowest values for both males and females. The project used bar plots to compare male and female life expectancy across regions.

Regional Comparison

The regional comparison showed strong differences between continents. Europe had the highest average life expectancy, while Africa had the lowest. The Americas were generally second-highest after Europe, while Asia and Oceania showed similar average levels.

Region Life Expectancy Pattern Under-5 Mortality Pattern
Europe Highest average life expectancy Lowest under-5 mortality
Americas High average life expectancy Low to moderate under-5 mortality
Asia Middle range, close to Oceania Second-highest under-5 mortality after Africa
Oceania Middle range, similar to Asia Moderate under-5 mortality
Africa Lowest average life expectancy Highest under-5 mortality

Under-5 Mortality Analysis

Under-age-5 mortality was much more unequal across regions than life expectancy. Africa showed the highest mortality rate, followed by Asia. Europe and the Americas had the lowest mortality rates. The analysis also showed that male under-5 mortality was higher than female under-5 mortality on average.

26.67 Average under-5 mortality rate for both sexes in 2022.
29.23 Average under-5 mortality rate for males in 2022.
24.01 Average under-5 mortality rate for females in 2022.

Boxplots were used to compare mortality rates between males, females, and both sexes. The distributions showed many high-value outliers, especially for countries with much higher child mortality rates.

Subregion Homogeneity and Heterogeneity

One part of the project focused on Africa and compared its subregions. Boxplots were used to assess whether values were homogeneous within subregions and heterogeneous between different subregions.

The analysis found visible heterogeneity among African subregions. Northern Africa showed relatively high variation in life expectancy, while Southern Africa appeared less heterogeneous. For under-5 mortality, the variation was also large, with several outliers and different median levels across African subregions.

  • Box size and median position were used to assess homogeneity.
  • Subregions with similar box sizes and median lines were interpreted as more homogeneous.
  • Large spreads and outliers indicated heterogeneity.
  • Africa showed strong subregional differences in both life expectancy and under-5 mortality.

Bivariate Correlation Analysis

The project used correlation analysis to study linear relationships between life expectancy variables and under-5 mortality variables. A correlation matrix and heatmap were created for the 2022 data.

As expected, male, female, and both-sex life expectancy variables were strongly positively correlated with each other. Under-5 mortality variables were also strongly positively correlated with each other. Life expectancy and under-5 mortality showed a negative relationship, meaning that countries with higher life expectancy tended to have lower under-5 mortality.

Change from 2002 to 2022

The final part compared demographic indicators from 2002 and 2022. The project found a general improvement in global demographic outcomes over the 20-year period: life expectancy increased in many countries and under-5 mortality decreased.

  • Most countries and subregions showed an upward trend in life expectancy.
  • Under-5 mortality generally declined from 2002 to 2022.
  • Regional inequality remained visible, especially between Africa and Europe.
  • The results support the broader pattern of global health improvement with persistent regional disparities.

Python Implementation

The analysis was implemented in Python using pandas for data handling and matplotlib/seaborn for visualization. The notebook/code included data loading, filtering by year, grouping by region and subregion, summary statistics, boxplots, histograms, and correlation heatmaps.

Implemented Steps

  • Loaded the CSV dataset with pandas.
  • Filtered the dataset for 2022 and compared it with 2002.
  • Computed means, medians, standard deviations, quantiles, and descriptive summaries.
  • Created histograms for life expectancy distributions.
  • Created bar charts for regional comparisons.
  • Created boxplots for mortality and subregion heterogeneity.
  • Computed and visualized correlation matrices.

Limitations

This project is an introductory descriptive analysis. It does not use inferential models or causal methods, so the results should be interpreted as descriptive patterns rather than causal explanations. Some parts of the original code contained exploratory trial-and-error cells and minor variable-name mistakes, but the final analysis still demonstrates the core descriptive workflow.

  • The analysis is descriptive and not causal.
  • Only two years, 2002 and 2022, were compared.
  • Country-level demographic indicators may hide within-country inequality.
  • Some results depend on regional grouping definitions.
  • The project is useful as a foundation-level exploratory data analysis project.

Outcome

This project strengthened my foundation in exploratory data analysis, descriptive statistics, data visualization, regional comparison, correlation analysis, and Python-based data workflows. It also helped connect statistical methods with real-world demographic and public-health indicators.

It is a useful portfolio project for showing solid basics in pandas, matplotlib, statistical summaries, and visual interpretation of global demographic data.

Descriptive Statistics Demography Life Expectancy Under-5 Mortality Python pandas matplotlib Correlation Analysis Data Visualization