Using the Dataset
A synthetic dataset and R analysis script to accompany every worked example in the course and textbook.
Overview
The dataset is synthetic — it was generated to match the descriptive statistics, correlation structure, and missing data pattern of the original UCA dissertation (Todd McCaffrey, ATU Letterkenny, 2025). It is not the real survey data, but it behaves statistically the same way. Every result you reproduce here will mirror the dissertation findings.
Variables
| Column name | Type | Range | Description |
|---|---|---|---|
| participant_id | integer | 1 – 167 | Unique participant identifier |
| age | integer | 17 – 65 | Participant age in years. Mean ≈ 24.3, SD ≈ 6.1 |
| gender | integer | 1, 2, 3 | 1 = Female (58%), 2 = Male (39%), 3 = Non-binary / Other (3%) |
| moral_disengagement | numeric | 1.0 – 5.0 | Composite scale score. Bandura's Mechanisms of Moral Disengagement Scale (8 items). Higher = greater disengagement. M = 2.81, SD = 0.60 |
| cyber_aggression | numeric | 1.0 – 5.0 | Composite scale score. Cyber-Aggression Typology Questionnaire. Higher = greater aggression. M = 3.08, SD = 0.65 |
| ai_trust | numeric | 1.0 – 5.0 | Composite scale score. AI Trust Scale (5 items). Higher = greater trust in AI systems. M = 3.37, SD = 0.68 |
| ai_use | numeric | 1.0 – 5.0 | Composite scale score. AI Use Frequency Scale (4 items). Higher = more frequent AI use. M = 3.19, SD = 0.71 |
Download
Two files. Put them in the same folder on your machine.
Place both in the same working directory in R, or set your working directory to wherever you saved the CSV.
R Setup
If you haven't used R before, install R and RStudio first. Then install the required packages — you only need to do this once.
# Install required packages (run once) install.packages(c( "mice", # multiple imputation "psych", # descriptive stats, alpha "ggplot2", # visualisation "dplyr" # data wrangling ))
# Load packages at the start of every session library(mice) library(psych) library(ggplot2) library(dplyr)
setwd("/path/to/your/folder").
Load & Explore
First steps: load the data and get a feel for it.
# Load the dataset df <- read.csv("uca_synthetic.csv") # First look head(df) # first 6 rows str(df) # structure: variable types dim(df) # rows × columns: should be 167 × 7
Descriptive Statistics
# Full descriptives: mean, SD, median, skew, kurtosis describe(df[, c("moral_disengagement", "cyber_aggression", "ai_trust", "ai_use")]) # Check missing values colSums(is.na(df)) # Visualise missing data pattern md.pattern(df[, c("moral_disengagement", "cyber_aggression", "ai_trust", "ai_use")])
You should see means close to: MD = 2.81, CA = 3.08, AIT = 3.37, AIU = 3.19. The missing data pattern will show which combinations of variables have missing values simultaneously.
Multiple Imputation
Before running any analysis, handle the missing data properly using multiple imputation with the mice package. This creates m = 5 complete datasets and any analysis you run will be pooled across all five using Rubin's Rules.
-
1Create imputed datasets. The
mice()function runs the imputation.method = "pmm"is predictive mean matching — it replaces missing values with plausible observed values from similar participants. -
2Run your analysis on each dataset. Use
with(imp, ...)to apply a model to all 5 imputed datasets automatically. -
3Pool the results.
pool()combines the 5 sets of estimates using Rubin's Rules, producing a single set of coefficients with correctly inflated standard errors.
# Step 1: Create 5 imputed datasets set.seed(42) # for reproducibility imp <- mice( df[, c("moral_disengagement", "cyber_aggression", "ai_trust", "ai_use", "age", "gender")], m = 5, # number of imputed datasets method = "pmm", # predictive mean matching printFlag = FALSE # suppress iteration output ) # Check imputation looks reasonable densityplot(imp) # imputed values should overlap observed # Get one complete dataset for exploration df_complete <- complete(imp, 1)
densityplot() shows the distribution of imputed values (magenta) overlaid on observed values (blue) for each variable. They should look similar — if imputed values are in a completely different range, something is wrong with the imputation model.
Hierarchical Regression
The key analysis. Three blocks entered sequentially. The critical question is whether Block 3 (AI factors) adds significant incremental variance above Block 2.
# Run each block on all 5 imputed datasets # Block 1: Demographics only fit1 <- with(imp, lm(cyber_aggression ~ age + gender)) # Block 2: + Moral Disengagement fit2 <- with(imp, lm(cyber_aggression ~ age + gender + moral_disengagement)) # Block 3: + AI Trust and AI Use fit3 <- with(imp, lm(cyber_aggression ~ age + gender + moral_disengagement + ai_trust + ai_use)) # Pool results using Rubin's Rules summary(pool(fit1)) summary(pool(fit2)) summary(pool(fit3))
# R² and ΔR² using one complete dataset m1 <- lm(cyber_aggression ~ age + gender, data = df_complete) m2 <- lm(cyber_aggression ~ age + gender + moral_disengagement, data = df_complete) m3 <- lm(cyber_aggression ~ age + gender + moral_disengagement + ai_trust + ai_use, data = df_complete) cat("Block 1 R²:", round(summary(m1)$r.squared, 3), "\n") cat("Block 2 R²:", round(summary(m2)$r.squared, 3), " ΔR²:", round(summary(m2)$r.squared - summary(m1)$r.squared, 3), "\n") cat("Block 3 R²:", round(summary(m3)$r.squared, 3), " ΔR²:", round(summary(m3)$r.squared - summary(m2)$r.squared, 3), "\n") # F-change test: does Block 3 add significantly? anova(m2, m3)
t-test
Compare two groups on cyber-aggression. Here: male vs. female participants.
# Independent samples t-test: gender × cyber-aggression # Filter to male (2) and female (1) only df_gender <- df_complete[df_complete$gender %in% c(1, 2), ] t_result <- t.test(cyber_aggression ~ gender, data = df_gender) print(t_result) # Cohen's d (effect size) male <- df_gender$cyber_aggression[df_gender$gender == 2] female <- df_gender$cyber_aggression[df_gender$gender == 1] cohens_d <- (mean(male) - mean(female)) / sqrt((sd(male)^2 + sd(female)^2) / 2) cat("Cohen's d =", round(cohens_d, 3), "\n")
Correlation
# Correlation matrix with p-values corr.test(df_complete[, c("moral_disengagement", "cyber_aggression", "ai_trust", "ai_use")]) # Scatter plot: MD vs CA with regression line ggplot(df_complete, aes(x = moral_disengagement, y = cyber_aggression)) + geom_point(alpha = 0.5, colour = "#0dcfb2") + geom_smooth(method = "lm", colour = "#f59e0b") + labs(x = "Moral Disengagement", y = "Cyber-Aggression", title = "Moral Disengagement → Cyber-Aggression") + theme_minimal()
Expected: r(MD, CA) ≈ .61 — a large positive correlation. r(AIT, CA) ≈ .15, r(AIU, CA) ≈ .19 — small, and likely non-significant at n = 167 after accounting for other variables.
Full Script
The complete annotated R script runs all of the above in sequence. Download it and open it in RStudio — it's designed to be read top-to-bottom alongside the textbook.
📊