Statistics Meets Linguistics

Project Overview

This project explores the intersection of statistics and linguistics in political media coverage. The main objective is to analyze how politicians are described in media narratives and whether personal attributes such as age, gender, and ethnicity are associated with the adjectives used to portray them.

The project focuses on adjective usage in descriptions of political figures from the Politico 28 Class of 2023 and related media text from the NOW Corpus. The analysis combines linguistic categorization with statistical testing and machine learning classification.

Research Motivation

Political media narratives influence how the public perceives political leaders. Adjectives such as “competent”, “emotional”, “defiant”, “energetic”, “far-right”, or “experienced” can frame a politician’s identity and shape public interpretation of leadership, credibility, ideology, and character.

Main research question: How do politicians’ personal attributes such as age, gender, and ethnicity influence the adjectives used to describe them in media reports?

Data Sources

The analysis is based on two main data sources. The first source is Politico Europe’s Politico 28 Class of 2023, which profiles influential figures in European politics, policy, and culture. The second source is the News on the Web Corpus, a large corpus of online news text used to examine broader media language patterns.

Politico 28 Class of 2023: used for targeted political profiles and manually extracted descriptive adjectives.
NOW Corpus: used for broader media context and adjective usage in online news discourse.
Manual categorization: adjectives were mapped to tags such as competence, emotion, politics, character, age, and appearance.

Dataset

The dataset combines politician-level attributes with adjective-level linguistic information. Each row represents an adjective used to describe a politician, together with metadata such as gender, age, ethnicity, token frequency, and semantic tag category.

Variable	Description	Example
ADJ	Adjective used in media descriptions	defiant, energetic, experienced
Name	Name of the politician	Volodymyr Zelenskyy
Gender	Gender of the politician	male / female
Age	Age of the politician	44, 54, 67
Ethnicity	Ethnic or national background	Ukrainian, German, French
Tokens	Frequency of the adjective in the dataset	9 for “defiant”
Tag	Semantic category of the adjective	competence, emotion, politics, character

Problem Definition

The project investigates whether patterns in adjective usage are associated with demographic and political attributes. The analysis asks whether younger and older politicians receive different descriptions, whether gender is associated with competence or emotion-related adjectives, and whether ethnicity influences descriptive language patterns.

A secondary modelling task uses age and adjective tags to predict gender, not because gender prediction itself is the final goal, but because the model can reveal which linguistic and demographic variables carry predictive information.

Statistical Methods

The project combines exploratory data analysis with statistical testing and interpretable machine learning. The goal is not only prediction, but also interpretation of linguistic and demographic patterns in media descriptions.

Method	Purpose	Interpretation
Exploratory Data Analysis	Inspect distributions of gender, ethnicity, age, and adjective tags	Understand dataset structure and possible imbalance
Welch Two Sample t-Test	Compare group means under unequal variance assumptions	Used to test differences in adjective frequencies
Density Plots	Visualize distributional differences	Useful for comparing descriptor frequency across groups
Decision Tree	Predict gender using age and adjective tags	Interpretable model for linguistic-demographic patterns
Confusion Matrix	Evaluate classification performance	Shows correct and incorrect gender predictions
MSE / RMSE / R²	Measure model prediction error and explanatory power	Summarize model quality and limitations

Exploratory Analysis

The dataset contains a diverse set of politicians with different ages, genders, and ethnic backgrounds. The age range spans roughly from the mid-40s to the early 70s, and the ethnicity distribution includes Ukrainian, German, French, American, Turkish, Estonian, and other backgrounds.

French ethnicity appeared most frequently in the analyzed politician dataset.
The gender distribution showed a male majority, but with substantial female representation.
Tags such as politics, character, age, emotion, competence, and status were used to classify adjectives.
Gender-based tag comparisons suggested different descriptor patterns for male and female politicians.

Statistical Testing

Welch t-tests were used to compare adjective-frequency patterns across groups. One analysis suggested that the descriptor “competence” appeared more frequently for male politicians than for female politicians in the dataset.

t = 3.70 Welch t-test statistic for competence descriptor frequency across gender.

p = 0.00057 Statistically significant difference in competence descriptor frequency.

p = 0.680 No significant age-group difference for emotion descriptor usage.

These findings suggest potential gender-related differences in how competence is linguistically attributed, while age did not clearly explain the use of emotion-related descriptors in this dataset.

Decision Tree Model

A decision tree model was used to predict gender based on age and adjective tag variables. The goal was to produce an interpretable model that reveals which features contribute most to gender-related linguistic patterns.

The decision tree initially split politicians by age group, indicating that age carried predictive information in the model. Subsequent splits used linguistic tag features such as character, politics, and competence. This helped identify how demographic variables and adjective categories interacted in the dataset.

Feature Interpretation

Age: strongest feature in the final model interpretation.
Character tags: important for separating some gender groups.
Politics tags: captured ideological and role-related descriptions.
Competence tags: reflected ability, leadership, and performance-related descriptions.

Model Evaluation

The final decision tree model was evaluated using a confusion matrix and common classification metrics. The confusion matrix shows that the model correctly classified many cases but still made false positive and false negative predictions, especially due to limited data size and overlapping linguistic patterns.

Precision = 0.85 High precision for the positive class in the final reported model.

Recall = 0.80 The model captured a large proportion of positive-class examples.

R² = 0.2849 Moderate explanatory power with room for improvement.

Additional Metrics

MSE: 0.1643.
RMSE: 0.4054.
Feature importance: age contributed the largest share in the reported model.

Key Findings

The analysis suggests that media descriptions of politicians are not random. Some adjective categories appear to be associated with demographic and political attributes, especially gender and age. However, due to the limited dataset size, the results should be interpreted as exploratory rather than definitive.

Competence descriptors were more frequent for male politicians in the analyzed dataset.
Emotion descriptors did not show a statistically significant difference between age groups.
Decision tree results suggested that age and adjective tag categories contain useful predictive information.
The model achieved good precision and recall, but explanatory power remained moderate.
The project highlights how language can reflect or reinforce media framing and political stereotypes.

Limitations

The main limitation is dataset size. Politico 28 is a small and curated sample, so conclusions cannot be generalized to all political media coverage. Some adjective categories were manually assigned, which can introduce subjectivity. Additionally, token frequency may be influenced by media attention rather than only by personal characteristics.

Small sample size based on a curated group of political figures.
Manual adjective tagging may introduce classification bias.
NOW Corpus context may vary by source, country, and publication style.
Decision tree models are interpretable but sensitive to small data changes.
Results should be treated as exploratory evidence, not causal proof of media bias.

Future Work

Future work could expand the dataset, include more political figures, add more news sources, and improve linguistic feature extraction using NLP methods. More advanced models could be compared with the decision tree while preserving interpretability.

Use larger corpora and more political profiles.
Automate adjective extraction with NLP pipelines.
Apply sentiment analysis and contextual embeddings.
Compare decision trees with random forests, logistic regression, and transformer-based models.
Validate findings across countries, media outlets, and time periods.

Outcome

This project strengthened my ability to combine statistical analysis with text-based data and linguistic interpretation. It demonstrates practical experience with data cleaning, feature categorization, exploratory visualization, hypothesis testing, decision tree modelling, and model evaluation in a social-science context.

It is a useful portfolio project because it connects statistics, machine learning, NLP-style feature engineering, and political media analysis in one interpretable workflow.

Political Media Language Analysis

Course

Domain

Main Model

Tools

Project Overview

Research Motivation

Data Sources

Dataset

Problem Definition

Statistical Methods

Exploratory Analysis

Statistical Testing

Decision Tree Model

Feature Interpretation

Model Evaluation

Additional Metrics

Key Findings

Limitations

Future Work

Outcome