Factors and numeric coding in AMCP

Why AMCP codes grouping variables as numbers

The data sets in AMCP are entered to match the printed tables in Maxwell, Delaney, and Kelley, Designing Experiments and Analyzing Data: A Model Comparison Perspective (4th ed.). The package’s reason for existing is that a reader can reproduce the book’s worked examples exactly, so the data deliberately keep the book’s coding rather than being “modernized.” In particular, grouping variables are usually stored as integer codes (for example 1/2, or 0/1) rather than as factors with text labels.

That is a feature, not an oversight: it keeps the object identical to the book. But for your own analysis you will often want human-readable labels, and for any variable with three or more groups you must tell R that the codes are categories and not a quantity. This vignette shows how to add labels safely, without changing the canonical data and without breaking the book’s results.

The golden rule:

Convert a copy of the column to a factor in your own workspace. Never rely on the integer codes carrying meaning in a model with three or more groups, and always confirm which code means which condition from the data set’s help page and the book.

The two-line recipe

my_data <- chapter_X_table_Y                 # take a copy
my_data$g <- factor(my_data$g,
                    levels = c(1, 2),        # the codes, in the order you want
                    labels = c("Label 1", "Label 2"))

levels lists the numeric codes (controlling the reference level and the order R uses); labels gives the text shown in output. If you omit labels, the factor simply uses the codes as its labels ("1", "2", …), which is already enough to make a variable categorical.

A worked example with known labels

A case where the coding is documented is the teacher-expectancy (“Pygmalion”) data in chapter_9_exercise_15. Here Treatment is coded 0 for the control children and 1 for the children who were (at random) identified to their teachers as likely “intellectual bloomers.”

data(chapter_9_exercise_15)
pyg <- chapter_9_exercise_15            # work on a copy
table(pyg$Treatment)                    # 246 control, 64 "bloomer"
#> 
#>   0   1 
#> 246  64

pyg$group <- factor(pyg$Treatment,
                    levels = c(0, 1),
                    labels = c("Control", "Bloomer"))
table(pyg$group)
#> 
#> Control Bloomer 
#>     246      64

Crucially, labeling does not change the analysis. For a two-group comparison the integer-coded fit and the labeled-factor fit give the identical omnibus test:

f_numeric <- anova(lm(IQGain ~ Treatment, data = pyg))
f_factor  <- anova(lm(IQGain ~ group,     data = pyg))

c(F_numeric = f_numeric[["F value"]][1],
  F_factor  = f_factor[["F value"]][1])
#> F_numeric  F_factor 
#>  3.763069  3.763069

This “does my relabeling reproduce the original?” check is the safe way to confirm that a tidy-up has not silently altered a result. (DMAR ships this same data set, already labeled in exactly this way, as the pygmalion data set; see vignette("pygmalion", package = "DMAR").)

With three or more groups, the coding matters

For two groups the numeric and factor versions happen to agree. With three or more groups they do not, because R treats a bare numeric column as a single quantitative predictor. Consider chapter_4_table_1, where cond is a four-group treatment code (1, 2, 3, 4) and bloodpr is the response.

data(chapter_4_table_1)
d4 <- chapter_4_table_1
sort(unique(d4$cond))
#> [1] 1 2 3 4

# WRONG for a grouping variable: 'cond' enters as one numeric slope
# (1 numerator df), testing only a linear trend across the codes.
anova(lm(bloodpr ~ cond, data = d4))
#> Analysis of Variance Table
#> 
#> Response: bloodpr
#>           Df  Sum Sq Mean Sq F value  Pr(>F)  
#> cond       1  240.87 240.868  3.7003 0.07036 .
#> Residuals 18 1171.68  65.093                  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# RIGHT: factor(cond) enters as four groups (3 numerator df), the
# one-way ANOVA the book intends.
anova(lm(bloodpr ~ factor(cond), data = d4))
#> Analysis of Variance Table
#> 
#> Response: bloodpr
#>              Df  Sum Sq Mean Sq F value Pr(>F)
#> factor(cond)  3  334.55 111.517  1.6552 0.2165
#> Residuals    16 1078.00  67.375

The two tables have different degrees of freedom and different F statistics. Whenever a column is a group label, wrap it in factor() (or convert a copy, as above) before modeling. If you are unsure whether a numeric column is a group label or a real measurement, the help page and the corresponding table or exercise in the book are authoritative.

Finding the categorical variables in a data set

AMCP does not impose a single coding scheme across the book: some examples use 1/2, others use 0/1, and the meaning of each code follows the example it comes from. A quick way to spot columns that are probably categorical is to look for numeric columns with only a few distinct whole-number values:

likely_factors <- function(df) {
  looks_categorical <- function(x) {
    if (!is.numeric(x)) return(FALSE)
    u <- unique(x[!is.na(x)])
    length(u) <= 6 && all(u == floor(u))
  }
  names(df)[vapply(df, looks_categorical, logical(1))]
}

likely_factors(chapter_7_table_16)   # e.g., Sex, Education
#> [1] "Sex"       "Education"
likely_factors(chapter_16_table_1)   # e.g., Gender, Trainee, Condition
#> [1] "Trainee" "Gender"

This is a heuristic, not a guarantee (a genuine count variable can also have only a few whole-number values, and a categorical variable can have many levels). Always confirm a variable’s role, and the meaning of its codes, from its help page (for example ?chapter_7_table_16) and from the book before assigning labels. Because the codes carry meaning that this package cannot infer for you, AMCP intentionally does not guess label text on your behalf.

Summary

  • The canonical AMCP data keep the book’s numeric coding so the examples reproduce exactly; nothing here changes that.
  • Add labels on a copy: factor(x, levels = , labels = ).
  • For two groups, numeric and factor fits agree; for three or more groups you must use factor() so the codes are treated as categories.
  • Confirm which code means which condition from the help page and the book before labeling.
  • Verify a relabeling by checking that it reproduces the original test statistic.