Dataset Guide — Statistics 101 for Cyberpsychologists

📋

Overview

The dataset is synthetic — it was generated to match the descriptive statistics, correlation structure, and missing-data pattern of the real UCA dissertation sample (Todd McCaffrey, ATU Letterkenny, 2026; N = 164). It is not the real survey data, but it behaves statistically the same way, so every result you reproduce here mirrors the dissertation's findings. The generator that builds it ships alongside the data as make_synthetic.py.

Responses

164

screened sample

Variables

18

14 measures + 4 demo

Regression n

139

listwise complete

MD–HRL r

.38

key correlation

Key Finding

The outcome is Hostile Response Likelihood (HRL) — how sharply a person says they would respond to an online provocation. Moral disengagement is its dominant predictor (β ≈ .47, p < .001). The AI block — AI trust, AI-mediated disinhibition, AI familiarity — adds negligible incremental variance (ΔR² ≈ .02, p ≈ .27): a theoretically meaningful null. And perceived anonymity is a suppression effect — null on its own, but a significant negative predictor once the others are held constant.

🗂

Variables

Column name	Type	Range	Description
participant_id	integer	1 – 164	Unique participant identifier
age_group	factor	18-24 … 45-54	Age band. Mostly 18–24 (a student sample)
gender	factor	Woman / Man / Non-binary	Self-described gender
ai_frequency	factor	Never … Daily	How often the person uses AI tools. "Never" users skipped the AI / cyber-cognition block — this is the source of the missing data
hostile_response	numeric	1 – 10	The outcome (HRL). Single-item likelihood of a sharp/hostile reply to a provocation vignette. M ≈ 4.5, SD ≈ 2.3
habitual_use	numeric	1.0 – 5.0	Habitual SNS use (4 items, α = .73). M ≈ 3.75
empathy_deficit	numeric	1.0 – 5.0	Online empathy deficit (4 items, α = .75). M ≈ 2.94
normalization	numeric	1.0 – 5.0	Aggression normalisation (4 items, α = .67). M ≈ 3.20
anonymity	numeric	1.0 – 5.0	Perceived online anonymity (4 items, α = .78). M ≈ 3.16. The suppressor
moral_disengagement	numeric	1.0 – 5.0	Moral disengagement (6 items, α = .82). Higher = greater disengagement. M ≈ 2.71
ai_trust	numeric	1.0 – 5.0	Trust in AI tools (5 items, α = .84). M ≈ 2.87
ai_disinhibition	numeric	1.0 – 5.0	AI-mediated disinhibition (5 items, α = .84). M ≈ 2.77
ai_familiarity	numeric	1 – 10	Self-rated AI familiarity (single item). M ≈ 6.4
extraversion … openness	numeric	1.0 – 5.0	Big Five single-item markers: `extraversion`, `conscientiousness`, `neuroticism`, `agreeableness`, `openness`

Missing Data — a by-design skip, not random dropout

The 22 participants who said they "Never" use AI tools were routed past the AI and cyber-cognition questions, so all seven multi-item scales (plus ai_familiarity) are blank for them: analytic n = 142, ai_trust n = 140. This is Missing At Random conditional on ai_frequency — so include AI frequency in the imputation model. Listwise-complete cases for the full regression: n = 139.

⬇

Download

Two files. Put them in the same folder on your machine.

📄

uca_synthetic.csv

164 rows × 18 columns

📊

uca_analysis.R

Fully annotated R script

Place both in the same working directory in R, or set your working directory to wherever you saved the CSV.

⚙️

R Setup

If you haven't used R before, install R and RStudio first. Then install the required packages — you only need to do this once.

R

# Install required packages (run once)
install.packages(c(
  "mice",      # multiple imputation
  "psych",     # descriptive stats, alpha
  "ggplot2",   # visualisation
  "dplyr"      # data wrangling
))

R

# Load packages at the start of every session
library(mice)
library(psych)
library(ggplot2)
library(dplyr)

Set Your Working Directory

In RStudio: Session → Set Working Directory → To Source File Location. This tells R where to find the CSV file. Alternatively use setwd("/path/to/your/folder").

🔍

Load & Explore

First steps: load the data and get a feel for it.

R

# Load the dataset
df <- read.csv("uca_synthetic.csv")

# First look
head(df)          # first 6 rows
str(df)           # structure: variable types
dim(df)           # rows × columns: should be 164 × 18

Descriptive Statistics

R

# Full descriptives: mean, SD, median, skew, kurtosis
# Note the n column: HRL is at 164, the scales at 142 (ai_trust 140)
describe(df[, c("hostile_response", "moral_disengagement",
               "anonymity", "ai_trust", "ai_disinhibition")])

# Check missing values — note the 22-row AI/cyber-cognition block
colSums(is.na(df))

# Visualise missing data pattern (the 'Never'-AI skip branch)
md.pattern(df[, c("hostile_response", "moral_disengagement",
                  "anonymity", "ai_trust")])

You should see means close to: HRL ≈ 4.5 (1–10), MD ≈ 2.71, anonymity ≈ 3.16, AI trust ≈ 2.87, AI disinhibition ≈ 2.77 (all 1–5). The missing-data pattern shows one large block — the 22 "Never"-AI users with every AI / cyber-cognition scale blank at once.

🔧

Multiple Imputation

Before running any analysis, handle the missing data properly using multiple imputation with the mice package. Because the missingness is tied to ai_frequency (the "Never" users skipped the block), we include AI frequency in the imputation model so the imputed values respect that structure. This creates m = 20 complete datasets and any analysis you run is pooled across all twenty using Rubin's Rules.

1

Create imputed datasets. The mice() function runs the imputation. method = "pmm" is predictive mean matching — it replaces missing values with plausible observed values from similar participants.
2

Run your analysis on each dataset. Use with(imp, ...) to apply a model to all 5 imputed datasets automatically.
3

Pool the results. pool() combines the 5 sets of estimates using Rubin's Rules, producing a single set of coefficients with correctly inflated standard errors.

R

# Step 1: Create 20 imputed datasets
set.seed(42)   # for reproducibility

# Include ai_frequency: the missingness is MAR conditional on it
imp <- mice(
  df[, c("hostile_response", "habitual_use", "empathy_deficit",
         "normalization", "anonymity", "moral_disengagement",
         "ai_trust", "ai_disinhibition", "ai_familiarity",
         "ai_frequency")],
  m          = 20,       # number of imputed datasets
  method     = "pmm",    # predictive mean matching
  printFlag  = FALSE    # suppress iteration output
)

# Check imputation looks reasonable
densityplot(imp)   # imputed values should overlap observed

# Get one complete dataset for exploration
df_complete <- complete(imp, 1)

What to Check

The densityplot() shows the distribution of imputed values (magenta) overlaid on observed values (blue) for each variable. They should look similar — if imputed values are in a completely different range, something is wrong with the imputation model.

🧱

Hierarchical Regression

The key analysis. Five nested blocks predicting HRL, entered in order: the "classic" cyber-cognition predictors first, then the AI factors one at a time, then the Big Five. The critical question is whether the AI block adds significant incremental variance above the classic predictors. To keep every ΔR² test valid, all blocks are fit on the same listwise-complete set (n = 139).

R

# Same complete-case set for all blocks -> valid ΔR² tests
preds <- c("habitual_use", "empathy_deficit", "normalization",
           "anonymity", "moral_disengagement", "ai_trust",
           "ai_disinhibition", "ai_familiarity",
           "extraversion", "conscientiousness", "neuroticism",
           "agreeableness", "openness")
cc <- df[complete.cases(df[, c("hostile_response", preds)]), ]
nrow(cc)   # 139

# Block 1: Classic predictors
m1 <- lm(hostile_response ~ habitual_use + empathy_deficit +
          normalization + anonymity + moral_disengagement, data = cc)
m2 <- update(m1, . ~ . + ai_trust)              # + AI Trust
m3 <- update(m2, . ~ . + ai_disinhibition)       # + AI Disinhibition
m4 <- update(m3, . ~ . + ai_familiarity)         # + AI Familiarity
m5 <- update(m4, . ~ . + extraversion + conscientiousness +
          neuroticism + agreeableness + openness)   # + Big Five

R

# R² for each block
sapply(list(m1, m2, m3, m4, m5),
       function(m) round(summary(m)$r.squared, 3))
# 0.217  0.217  0.237  0.254  0.290

# Does the AI block (3 predictors) add anything over Classic?
anova(m1, m4)        # omnibus F-test for the AI block

# Pool the full model across the 20 imputations (Rubin's Rules)
fit_full <- with(imp, lm(hostile_response ~ habitual_use +
  empathy_deficit + normalization + anonymity + moral_disengagement +
  ai_trust + ai_disinhibition + ai_familiarity))
summary(pool(fit_full))

Expected Output

Classic R² ≈ .22 · + AI Trust ΔR² ≈ .00 · + AI Disinhibition ΔR² ≈ .02 · + AI Familiarity ΔR² ≈ .02 · + Big Five ΔR² ≈ .04. The AI block as a whole adds ΔR² ≈ .02–.04, non-significant — that is the finding. Moral disengagement carries the model (β ≈ .47, p < .001).

🪤

Suppression Effect

Here is the subtle one. On its own, perceived anonymity barely correlates with hostile responding — a weak, non-significant negative. But put it in the full model alongside the other predictors and it becomes a clear, significant negative predictor. The coefficient gets bigger when you add controls. That is a suppression effect: the other variables soak up irrelevant variance in anonymity, sharpening the part that actually relates to HRL.

R

# Anonymity alone vs. anonymity adjusted for everything else
summary(lm(hostile_response ~ anonymity, data = cc))$coefficients
# b ≈ -0.30, p ≈ .26   -> non-significant on its own

summary(m5)$coefficients["anonymity", ]
# b ≈ -0.62, p ≈ .026  -> significant once adjusted

# Standardised betas make the flip easy to read
zc <- as.data.frame(scale(cc[, c("hostile_response", preds)]))
round(coef(lm(hostile_response ~ ., data = zc)), 3)
# moral_disengagement ≈ .46 (dominant); anonymity ≈ -.20 (suppressed)

Why It Matters

A naïve "screen for significant zero-order correlations first" workflow would have thrown anonymity out before the regression even ran — and missed a real effect. Suppression is the standard cautionary tale for why you model variables together, not one at a time. It is also why the sign of a coefficient can legitimately differ from its raw correlation.

🔬

t-test

Compare two groups on Hostile Response Likelihood. Here: women vs. men.

R

# Independent samples t-test: gender × HRL (uses the full N = 164)
# Filter to Woman and Man only
df_gender <- df[df$gender %in% c("Woman", "Man"), ]

t_result <- t.test(hostile_response ~ gender, data = df_gender)
print(t_result)

# Cohen's d (effect size)
man   <- df_gender$hostile_response[df_gender$gender == "Man"]
woman <- df_gender$hostile_response[df_gender$gender == "Woman"]

cohens_d <- (mean(man) - mean(woman)) /
  sqrt((sd(man)^2 + sd(woman)^2) / 2)

cat("Cohen's d =", round(cohens_d, 3), "\n")
# Small and non-significant here (d ≈ -0.12, p ≈ .47)

Remember

Always report t, df, p, and Cohen's d. The p-value tells you if the difference is unlikely by chance. Cohen's d tells you how big the difference actually is. You need both.

📈

Correlation

R

# Zero-order correlations with HRL, with p-values
corr.test(df[, c("hostile_response", "moral_disengagement",
                 "anonymity", "ai_trust", "ai_disinhibition")],
          use = "pairwise")

# Scatter plot: MD vs HRL with regression line
ggplot(df,
       aes(x = moral_disengagement, y = hostile_response)) +
  geom_jitter(alpha = 0.5, colour = "#0dcfb2", height = 0.2) +
  geom_smooth(method = "lm", colour = "#f59e0b") +
  labs(x = "Moral Disengagement",
       y = "Hostile Response Likelihood",
       title = "Moral Disengagement → HRL") +
  theme_minimal()

Expected: r(MD, HRL) ≈ .38 — the strongest zero-order correlate by some margin. AI trust (≈ .01) and AI familiarity (≈ .02) are essentially flat. Watch anonymity: its zero-order correlation is only ≈ −.10 (non-significant) — yet it becomes a significant predictor in the full model. That gap is the suppression effect from the previous section.

📜

Full Script

The complete annotated R script runs all of the above in sequence. Download it and open it in RStudio — it's designed to be read top-to-bottom alongside the textbook.

📊

uca_analysis.R

Descriptives · Imputation · Hierarchical Regression · t-test · Correlation · ggplot

Tip

In RStudio, use Ctrl+Enter (Cmd+Enter on Mac) to run one line at a time. Work through the script section by section alongside the relevant module in the interactive course — the numbers will match.

Using the Dataset

Overview

Variables

Download

R Setup

Load & Explore

Descriptive Statistics

Multiple Imputation

Hierarchical Regression

Suppression Effect

t-test

Correlation

Full Script