Research Methods

The Experimental Method

What Is an Experiment?

An experiment is a research method in which the researcher manipulates one variable (the independent variable) while measuring its effect on another variable (the dependent variable), while controlling all other variables. The experimental method is the only research design that can establish cause-and-effect relationships.

Types of Experiments

Laboratory experiments: conducted in a controlled, artificial environment (e.g., a psychology laboratory). The researcher has precise control over extraneous variables, allowing high internal validity. However, the artificial setting may reduce ecological validity (the findings may not generalise to real-life settings). Participants may exhibit demand characteristics (guessing the hypothesis and altering their behaviour accordingly).

Field experiments: conducted in a natural environment (e.g., a school, a workplace, a street) where the independent variable is still manipulated by the researcher. Participants are often unaware that they are participating in an experiment, reducing demand characteristics and increasing ecological validity. However, the researcher has less control over extraneous variables, potentially reducing internal validity.

Natural experiments: the independent variable is not manipulated by the researcher but varies naturally (e.g., comparing the mental health of people before and after a natural disaster, or studying children raised in different institutional settings). The researcher takes advantage of a naturally occurring situation. This method can study variables that would be unethical to manipulate (e.g., the effects of brain damage, institutional deprivation). However, the lack of manipulation means that causal conclusions are weaker (confounding variables cannot be controlled).

Quasi-experiments: similar to experiments but participants are not randomly assigned to conditions. Instead, participants are assigned based on existing characteristics (e.g., gender, age, diagnosis). Quasi-experiments are used when random allocation is impractical or unethical. However, the lack of randomisation means that pre-existing differences between groups may confound the results.

Variables

Independent Variable (IV)

The variable that the researcher deliberately manipulates or changes. In a well-designed experiment, the IV is the only systematic difference between conditions.

Dependent Variable (DV)

The variable that the researcher measures to see if it is affected by the IV. The DV must be operationalised (defined in terms of how it will be measured).

Operationalising Variables

Operationalisation is the process of defining a variable in precise, measurable terms. For example, "memory" might be operationalised as "the number of words correctly recalled from a list of $20$ words after a $10$ -minute delay."

Good operationalisation should be:

Clear and precise
Measurable and objective
Replicable by other researchers
Valid (measuring what it claims to measure)

Extraneous and Confounding Variables

Extraneous variables: any variable other than the IV that could affect the DV. If not controlled, extraneous variables become confounding variables.

Confounding variables: variables that change systematically with the IV, making it impossible to determine whether changes in the DV are caused by the IV or the confounding variable. A confounding variable threatens the internal validity of the experiment.

Types of extraneous variables:

Participant variables: individual differences between participants (age, gender, intelligence, mood, prior experience)
Situational variables: environmental factors (noise, temperature, lighting, time of day)
Experimenter variables: characteristics of the researcher (age, gender, tone of voice, expectations) that may influence participant behaviour

Controlling Extraneous Variables

Random allocation: assigning participants to conditions randomly ensures that participant variables are distributed evenly across conditions
Standardised procedures: keeping all aspects of the procedure identical for all participants (instructions, timing, environment)
Counterbalancing: in repeated measures designs, alternating the order of conditions across participants to control for order effects (practice, fatigue, boredom)
Experimenter standardisation: using a script for instructions; using double-blind procedures (neither the participant nor the experimenter knows which condition the participant is in)
Matching: pairing participants in different conditions based on relevant characteristics (e.g., age, IQ) to control for participant variables

Experimental Designs

Independent Groups Design

Different participants are assigned to each condition of the experiment.

Advantages: no order effects (each participant experiences only one condition); less chance of demand characteristics (participants cannot compare conditions).

Disadvantages: requires more participants; participant variables may differ between groups, threatening internal validity (mitigated by random allocation).

Repeated Measures Design

Each participant participates in all conditions of the experiment.

Advantages: requires fewer participants; participant variables are controlled (each participant serves as their own control), increasing internal validity.

Disadvantages: order effects (practice effects, fatigue effects, boredom) may confound the results (mitigated by counterbalancing); demand characteristics may increase as participants experience all conditions.

Matched Pairs Design

Participants are paired based on relevant characteristics (e.g., age, gender, IQ), and one member of each pair is assigned to each condition.

Advantages: controls for participant variables without the order effects of repeated measures; reduces individual differences between conditions.

Disadvantages: time-consuming and difficult to match participants effectively; matching is only as good as the variables chosen for matching; if a participant drops out, their pair's data may be unusable.

Sampling Techniques

Random Sampling

Every member of the target population has an equal chance of being selected. This can be achieved using a random number generator or drawing names from a hat. Provides the most representative sample and minimises sampling bias, but may be impractical for large populations and requires access to the full population.

Systematic Sampling

Every $n$ th person from a list of the target population is selected (e.g., every 10th name from a school register). Simpler than random sampling but may produce a biased sample if the list has a periodic pattern.

Stratified Sampling

The population is divided into subgroups (strata) based on relevant characteristics (e.g., age, gender, ethnicity), and participants are randomly selected from each stratum in proportion to their representation in the population. Ensures the sample is representative of the population on key variables, but requires knowledge of the population's composition and is time-consuming.

Opportunity Sampling

Participants are selected from whoever is readily available (e.g., approaching people in a shopping centre). Quick, convenient, and inexpensive, but produces a biased sample that may not represent the target population.

Volunteer (Self-Selected) Sampling

Participants volunteer in response to an advertisement or invitation. Convenient for the researcher but produces a biased sample (volunteers may differ from non-volunteers in motivation, personality, or social desirability). This is the most common sampling method in psychology research.

Ethical Issues

Key Ethical Principles

Informed consent: participants should be given sufficient information about the study to make an informed decision about whether to participate. When full disclosure would invalidate the study (e.g., deception studies), participants should give presumptive consent (consent based on general information about the nature of the study) and be fully debriefed afterwards.

Deception: misleading participants about the true purpose or nature of the study. Deception should only be used when there is no alternative, when the scientific value of the study justifies it, and when participants are debriefed as soon as possible.

Right to withdraw: participants should be informed that they can withdraw from the study at any time without penalty. This is especially important in studies involving deception, stress, or discomfort.

Protection from harm: participants should not be exposed to physical or psychological harm that is greater than they would encounter in their daily lives. If harm is possible, the researcher must take steps to minimise it and provide appropriate support.

Confidentiality: participants' personal information and data should be kept confidential. Data should be anonymised (identifying information removed) before publication.

Debriefing: after the study, participants should be fully informed about the true purpose, given the opportunity to ask questions, and provided with support if they experienced distress. Debriefing should restore the participant to the state they were in before the study.

The British Psychological Society (BPS) Code of Ethics

The BPS provides ethical guidelines for psychological research. Researchers are expected to follow these guidelines and obtain ethical approval from an institutional ethics committee before conducting research.

Data Analysis

Measures of Central Tendency

Mean: the arithmetic average. Calculated by summing all values and dividing by the number of values. Uses all data points but is affected by extreme scores (outliers).
Median: the middle value when data are arranged in order. Unaffected by outliers but does not use all data points.
Mode: the most frequently occurring value. Useful for categorical data but may not be unique (bimodal, multimodal distributions).

Measures of Dispersion

Range: the difference between the highest and lowest values. Simple but heavily affected by outliers.
Standard deviation: a measure of the average distance of each data point from the mean. More informative than the range because it uses all data points, but assumes a normal distribution.

Types of Data

Nominal: categories with no intrinsic order (e.g., gender, eye colour). Mode is the appropriate measure of central tendency.
Ordinal: categories with a meaningful order but equal intervals (e.g., Likert scales, rankings). Median is the appropriate measure.
Interval: numerical data with equal intervals but no true zero (e.g., temperature in Celsius). Mean and standard deviation are appropriate.
Ratio: numerical data with equal intervals and a true zero (e.g., height, weight, reaction time). Mean and standard deviation are appropriate.

Correlations

A correlation is a statistical technique for measuring the strength and direction of the relationship between two variables. A correlation coefficient ( $r$ ) ranges from $-1$ (perfect negative correlation) to $+1$ (perfect positive correlation), with $0$ indicating no relationship.

Important: correlation does not imply causation. A correlation between two variables does not mean that one causes the other; a third variable (a confounding variable) may be responsible for the observed relationship.

Statistical Tests

The choice of statistical test depends on:

Whether the design is independent groups, repeated measures, or correlational
Whether the data are nominal, ordinal, or interval/ratio
Whether the data are normally distributed (parametric) or not (non-parametric)

Test	Design	Data type	Purpose
Mann-Whitney U	Independent groups	Ordinal	Test for difference between two groups
Wilcoxon signed-ranks	Repeated measures	Ordinal	Test for difference between two conditions
Spearman's rho	Correlational	Ordinal	Test for correlation
Chi-squared	Independent groups	Nominal	Test for association between two variables
Unrelated t-test	Independent groups	Interval/ratio (normal)	Test for difference between two groups
Related t-test	Repeated measures	Interval/ratio (normal)	Test for difference between two conditions
Pearson's r	Correlational	Interval/ratio (normal)	Test for correlation

Significance and Probability

A result is considered statistically significant if the probability ( $p$ ) of obtaining the observed result (or a more extreme one) under the null hypothesis is less than the chosen significance level (typically $p < 0.05$ , meaning there is less than a $5\%$ probability that the result occurred by chance).

Null hypothesis ( $H_0$ ): there is no significant difference or correlation. Alternative hypothesis ( $H_1$ ): there is a significant difference or correlation (directional or non-directional).

Type I error: rejecting the null hypothesis when it is true (false positive). The probability of a Type I error equals the significance level (e.g., $0.05$ ).

Type II error: failing to reject the null hypothesis when it is false (false negative). The probability of a Type II error is influenced by sample size, effect size, and the significance level.

Reliability and Validity

Reliability

Reliability is the consistency of a measurement. A reliable measure produces the same results under the same conditions on different occasions.

Types of reliability:

Test-retest reliability: administering the same test to the same participants on two occasions and calculating the correlation between scores. A high correlation indicates good test-retest reliability.
Inter-rater reliability: the degree of agreement between two or more independent raters or observers. Measured using a correlation coefficient or Cohen's kappa.
Internal reliability: the consistency of items within a test (do all items measure the same construct?). Assessed using split-half reliability or Cronbach's alpha.

Improving reliability:

Standardised procedures (instructions, equipment, timing)
Training observers to ensure consistent rating
Pilot studies to identify and correct problems
Larger sample sizes (reduce the influence of random variation)

Validity

Validity is the extent to which a test or measurement accurately measures what it claims to measure.

Types of validity:

Internal validity: the extent to which a study measures what it intends to measure, free from the influence of confounding variables. Threatened by extraneous variables, demand characteristics, and investigator effects.
External validity: the extent to which findings can be generalised beyond the specific study.
- Ecological validity: the extent to which findings generalise to real-life settings (a type of external validity).
- Population validity: the extent to which findings generalise to other populations.
- Temporal validity: the extent to which findings hold over time.
Face validity: whether a test appears to measure what it claims to measure (superficial assessment).
Concurrent validity: whether the results of a new test correlate with an established measure of the same construct.
Construct validity: whether a test measures the theoretical construct it is designed to measure.

Improving validity:

Controlling extraneous variables (increases internal validity)
Using naturalistic or field settings (increases ecological validity)
Using representative sampling (increases population validity)
Operationalising variables carefully
Using multiple measures (triangulation)

Common Pitfalls

Confusing the independent variable with the dependent variable. The IV is manipulated; the DV is measured.
Confusing reliability with validity. A measure can be reliable without being valid (consistently measuring the wrong thing), and valid measures should also be reliable.
Stating that a correlation proves causation. Correlation indicates association, not causation; a third variable may explain the relationship.
Confusing a quasi-experiment with a true experiment. In a quasi-experiment, participants are not randomly allocated to conditions, so causal conclusions are weaker.
Choosing the wrong statistical test. The choice depends on the experimental design, the type of data, and whether the data are normally distributed.
Confusing Type I and Type II errors. Type I = false positive (finding an effect that does not exist); Type II = false negative (failing to find an effect that does exist).

Practice Problems

Problem 1: Experimental Design Evaluation

A researcher wants to investigate whether listening to classical music improves exam performance. They recruit $100$ A-level students and randomly assign $50$ to listen to Mozart for $15$ minutes before a maths test and $50$ to sit in silence for $15$ minutes before the same test. Identify the experimental design, IV, DV, and evaluate the strengths and limitations.

Design: Independent groups design (different participants in each condition).

IV: whether participants listen to classical music or sit in silence before the test (two levels: music vs. silence).

DV: exam performance, operationalised as the score on the maths test (out of $50$ ).

Strengths of this design:

No order effects: each participant takes the test only once, so there is no risk of practice or fatigue effects
Random allocation distributes participant variables (prior mathematical ability, intelligence, motivation) evenly between conditions, controlling for individual differences
Demand characteristics are reduced: participants in each condition are not aware of the other condition, so they cannot compare experiences

Limitations:

Requires a large sample ( $100$ participants), which may be impractical
Individual differences between groups may still affect results despite random allocation (e.g., if, by chance, the music group contains more mathematically able students)
Participants in the music condition may guess the hypothesis (that music improves performance) and alter their effort accordingly (demand characteristics)

Control of extraneous variables: the researcher should standardise the test (same questions, same time limit), the environment (same room, time of day), and the instructions. A double-blind procedure would further reduce bias, though it may be impractical here.

Problem 2: Statistical Test Selection

For each of the following scenarios, identify the appropriate statistical test and justify your choice:

(a) A researcher wants to know if there is a relationship between hours of revision and exam scores. (b) A researcher compares anxiety scores (measured on a 10-point scale) before and after a relaxation intervention for the same participants. (c) A researcher investigates whether there is an association between gender (male/female) and career choice (science/arts/humanities). (d) A researcher compares reaction times (in milliseconds) between a group of gamers and a group of non-gamers.

(a) Pearson's r (or Spearman's rho if data are not normally distributed). This is a correlational design (relationship between two variables). The data are interval/ratio (hours of revision and exam scores). If both variables are normally distributed, Pearson's r is appropriate; otherwise, Spearman's rho.

(b) Wilcoxon signed-ranks test. This is a repeated measures design (same participants before and after). The data are ordinal (a 10-point rating scale is ordinal, not interval). The Wilcoxon test is the appropriate non-parametric test for this design and data type.

(c) Chi-squared test. This is a test of association between two nominal variables (gender: male/female; career choice: science/arts/humanities). Both variables are categorical (nominal data). Chi-squared tests for association between nominal variables in an independent groups design.

(d) Unrelated t-test (or Mann-Whitney U if data are not normally distributed). This is an independent groups design (gamers vs. non-gamers). The data are interval/ratio (reaction time in milliseconds). If the data are normally distributed and variances are equal, the unrelated t-test is appropriate; otherwise, Mann-Whitney U.

Problem 3: Ethics Scenario Analysis

A researcher plans to conduct a study on conformity by having confederates give obviously wrong answers in a classroom setting, recording whether the participant conforms. The participant will not be told the true purpose of the study until afterwards. Evaluate the ethical issues.

Deception: the participant is deceived about the true purpose (they believe they are in a genuine group task) and about the identity of the confederates (they believe they are real participants). Deception is significant but may be justified if there is no feasible way to study conformity without it (as Asch argued). The researcher must provide a thorough debriefing after the study.

Informed consent: the participant cannot give fully informed consent because they do not know the true nature of the study. Presumptive consent (consent based on general information) should be obtained. The researcher should explain that the study involves group decision-making and that some aspects will be explained afterwards.

Right to withdraw: the participant should be informed that they can leave at any time. However, in a classroom setting, the participant may feel social pressure to stay, making the right to withdraw difficult to exercise. The researcher should make clear that withdrawal carries no penalty and should monitor for signs of distress.

Protection from harm: the participant may experience mild stress, embarrassment, or self-doubt upon realising they conformed to an obviously wrong answer. The researcher should minimise distress by debriefing immediately, explaining that conformity is a normal and widespread response, and providing reassurance.

Debriefing: this is essential. The participant must be fully informed about the purpose of the study, the use of confederates, and the nature of conformity. They should be given the opportunity to withdraw their data and to ask questions. The debriefing should be conducted sensitively to restore the participant's confidence.

Ethical approval: the study should be submitted to an institutional ethics committee before being conducted.

Problem 4: Reliability and Validity

A psychologist develops a new questionnaire to measure social anxiety. Describe how they could assess the reliability and validity of the questionnaire.

Assessing reliability:

Test-retest reliability: administer the questionnaire to the same group of participants on two occasions (e.g., two weeks apart) and calculate the correlation between the two sets of scores. A high correlation ( $r > 0.8$ ) indicates good test-retest reliability, meaning the questionnaire produces consistent results over time.
Inter-rater reliability: if the questionnaire involves any subjective scoring (e.g., open-ended items rated by judges), have two independent raters score the responses and calculate the agreement between them (using Cohen's kappa or a correlation coefficient).
Internal reliability: use split-half reliability (divide the questionnaire into two halves and correlate the scores on each half) or Cronbach's alpha to assess whether all items are measuring the same construct. A Cronbach's alpha of $> 0.7$ indicates good internal reliability.

Assessing validity:

Face validity: ask a panel of experts (psychologists, mental health professionals) to review the questionnaire and assess whether it appears to measure social anxiety.
Concurrent validity: administer the new questionnaire alongside an established, validated measure of social anxiety (e.g., the Liebowitz Social Anxiety Scale) to the same participants. A high correlation between the two measures supports concurrent validity.
Construct validity: test whether the questionnaire produces results consistent with theoretical predictions about social anxiety. For example, if the questionnaire measures social anxiety, it should correlate with other measures of anxiety but not with unrelated constructs (e.g., intelligence).
Ecological validity: assess whether scores on the questionnaire predict real-world social anxiety behaviours (e.g., observed social interactions, avoidance of social situations).

Problem 5: Interpreting Results

A researcher conducts a Mann-Whitney U test to compare stress scores between two groups (intervention vs. control). The calculated U value is $42.5$ , and the critical value for $n_1 = 15$ , $n_2 = 15$ at $p < 0.05$ (two-tailed) is $64$ . Explain what this means.

The calculated U value ( $42.5$ ) is less than the critical value ( $64$ ). In the Mann-Whitney U test, when the calculated value is equal to or less than the critical value, the result is statistically significant.

This means:

The null hypothesis is rejected: there is a statistically significant difference in stress scores between the intervention group and the control group ( $p < 0.05$ ).
The probability is less than $5\%$ : the probability of obtaining this result (or a more extreme one) if there were no real difference between the groups is less than $5\%$ . In other words, the observed difference is unlikely to be due to chance alone.
The alternative hypothesis is accepted: the intervention had a statistically significant effect on stress scores.
Practical significance: the researcher should also report the effect size and the descriptive statistics (mean ranks or medians) to indicate the magnitude and direction of the difference. Statistical significance does not imply practical importance — a large sample can produce a statistically significant result even for a very small, practically meaningless effect.
Limitations: as a non-parametric test, the Mann-Whitney U test does not assume a normal distribution but has less statistical power than the parametric equivalent (the unrelated t-test). The study should report the effect size (e.g., rank-biserial correlation) to quantify the strength of the difference.

The Experimental Method​

What Is an Experiment?​

Types of Experiments​

Variables​

Independent Variable (IV)​

Dependent Variable (DV)​

Operationalising Variables​

Extraneous and Confounding Variables​

Controlling Extraneous Variables​

Experimental Designs​

Independent Groups Design​

Repeated Measures Design​

Matched Pairs Design​

Sampling Techniques​

Random Sampling​

Systematic Sampling​

Stratified Sampling​

Opportunity Sampling​

Volunteer (Self-Selected) Sampling​

Ethical Issues​

Key Ethical Principles​

The British Psychological Society (BPS) Code of Ethics​

Data Analysis​

Measures of Central Tendency​

Measures of Dispersion​

Types of Data​

Correlations​

Statistical Tests​

Significance and Probability​

Reliability and Validity​

Reliability​

Validity​

Common Pitfalls​

Practice Problems​