---
title: "Factors and numeric coding in AMCP"
author: "Ken Kelley"
date: "`r format(Sys.Date(), '%B %Y')`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Factors and numeric coding in AMCP}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>"
)
library(AMCP)
```

## Why AMCP codes grouping variables as numbers

The data sets in AMCP are entered to match the printed tables in
Maxwell, Delaney, and Kelley, *Designing Experiments and Analyzing
Data: A Model Comparison Perspective* (4th ed.). The package's reason
for existing is that a reader can reproduce the book's worked examples
**exactly**, so the data deliberately keep the book's coding rather
than being "modernized." In particular, grouping variables are usually
stored as integer codes (for example `1`/`2`, or `0`/`1`) rather than
as `factor`s with text labels.

That is a feature, not an oversight: it keeps the object identical to
the book. But for your own analysis you will often want human-readable
labels, and for any variable with **three or more** groups you must
tell R that the codes are categories and not a quantity. This vignette
shows how to add labels safely, without changing the canonical data and
without breaking the book's results.

The golden rule:

> Convert a *copy* of the column to a factor in your own workspace.
> Never rely on the integer codes carrying meaning in a model with
> three or more groups, and always confirm which code means which
> condition from the data set's help page and the book.

## The two-line recipe

```r
my_data <- chapter_X_table_Y                 # take a copy
my_data$g <- factor(my_data$g,
                    levels = c(1, 2),        # the codes, in the order you want
                    labels = c("Label 1", "Label 2"))
```

`levels` lists the numeric codes (controlling the reference level and
the order R uses); `labels` gives the text shown in output. If you omit
`labels`, the factor simply uses the codes as its labels
(`"1"`, `"2"`, ...), which is already enough to make a variable
categorical.

## A worked example with known labels

A case where the coding is documented is the teacher-expectancy
("Pygmalion") data in `chapter_9_exercise_15`. Here `Treatment` is
coded `0` for the control children and `1` for the children who were
(at random) identified to their teachers as likely "intellectual
bloomers."

```{r}
data(chapter_9_exercise_15)
pyg <- chapter_9_exercise_15            # work on a copy
table(pyg$Treatment)                    # 246 control, 64 "bloomer"

pyg$group <- factor(pyg$Treatment,
                    levels = c(0, 1),
                    labels = c("Control", "Bloomer"))
table(pyg$group)
```

Crucially, labeling does **not** change the analysis. For a two-group
comparison the integer-coded fit and the labeled-factor fit give the
identical omnibus test:

```{r}
f_numeric <- anova(lm(IQGain ~ Treatment, data = pyg))
f_factor  <- anova(lm(IQGain ~ group,     data = pyg))

c(F_numeric = f_numeric[["F value"]][1],
  F_factor  = f_factor[["F value"]][1])
```

This "does my relabeling reproduce the original?" check is the safe way
to confirm that a tidy-up has not silently altered a result. (DMAR
ships this same data set, already labeled in exactly this way, as the
`pygmalion` data set; see `vignette("pygmalion", package = "DMAR")`.)

## With three or more groups, the coding *matters*

For two groups the numeric and factor versions happen to agree. With
three or more groups they do **not**, because R treats a bare numeric
column as a single quantitative predictor. Consider `chapter_4_table_1`,
where `cond` is a four-group treatment code (`1`, `2`, `3`, `4`) and
`bloodpr` is the response.

```{r}
data(chapter_4_table_1)
d4 <- chapter_4_table_1
sort(unique(d4$cond))

# WRONG for a grouping variable: 'cond' enters as one numeric slope
# (1 numerator df), testing only a linear trend across the codes.
anova(lm(bloodpr ~ cond, data = d4))

# RIGHT: factor(cond) enters as four groups (3 numerator df), the
# one-way ANOVA the book intends.
anova(lm(bloodpr ~ factor(cond), data = d4))
```

The two tables have different degrees of freedom and different *F*
statistics. Whenever a column is a group label, wrap it in `factor()`
(or convert a copy, as above) before modeling. If you are unsure
whether a numeric column is a group label or a real measurement, the
help page and the corresponding table or exercise in the book are
authoritative.

## Finding the categorical variables in a data set

AMCP does not impose a single coding scheme across the book: some
examples use `1`/`2`, others use `0`/`1`, and the meaning of each code
follows the example it comes from. A quick way to spot columns that are
*probably* categorical is to look for numeric columns with only a few
distinct whole-number values:

```{r}
likely_factors <- function(df) {
  looks_categorical <- function(x) {
    if (!is.numeric(x)) return(FALSE)
    u <- unique(x[!is.na(x)])
    length(u) <= 6 && all(u == floor(u))
  }
  names(df)[vapply(df, looks_categorical, logical(1))]
}

likely_factors(chapter_7_table_16)   # e.g., Sex, Education
likely_factors(chapter_16_table_1)   # e.g., Gender, Trainee, Condition
```

This is a heuristic, not a guarantee (a genuine count variable can also
have only a few whole-number values, and a categorical variable can have
many levels). Always
confirm a variable's role, and the meaning of its codes, from its help
page (for example `?chapter_7_table_16`) and from the book before
assigning labels. Because the codes carry meaning that this package
cannot infer for you, AMCP intentionally does **not** guess label text
on your behalf.

## Summary

* The canonical AMCP data keep the book's numeric coding so the
  examples reproduce exactly; nothing here changes that.
* Add labels on a **copy**: `factor(x, levels = , labels = )`.
* For two groups, numeric and factor fits agree; for three or more
  groups you must use `factor()` so the codes are treated as
  categories.
* Confirm which code means which condition from the help page and the
  book before labeling.
* Verify a relabeling by checking that it reproduces the original test
  statistic.