Course Handout  Research Methods and
Psychometrics
Copyright Notice: This material was
written and published in Wales by Derek J. Smith (Chartered Engineer). It forms
part of a multifile elearning resource, and subject only to acknowledging Derek
J. Smith's rights under international copyright law to be identified as author
may be freely downloaded and printed off in single complete copies solely for
the purposes of private study and/or review. Commercial exploitation rights are
reserved. The remote hyperlinks have been selected for the academic appropriacy
of their contents; they were free of offensive and litigious content when
selected, and will be periodically checked to have remained so. Copyright
© 2010, High Tower Consultants Limited.

First published online 14:00 BST 19th June 2006, Copyright Derek J.
Smith (Chartered Engineer). This version
[HT.1  transfer of copyright] dated 18:00 14th January 2010
1  Introduction
This glossary is an alphabetically
sorted series of short crossindexed definitions, cumulatively explaining how
the scientific method in general and research statistics in particular are
typically applied to psychological research. The crossindexing has been done
in such a way that if the individual entries were to be loaded into a semantic
network they would produce a navigable encyclopaedia on the chosen subject.
There are also half a dozen Key Concept
definitions which are so pivotal to the entire study area that we need to deal
with them right now .....
Key Concept – "Causation": One of the
fundamental notions of science is that of "causation",
the idea that some things just happen whilst others are contingent upon prior
events taking place or specified antecedent conditions being reached. We know
the former class of events as "random" or "chaotic, and the
latter as "regular" or "ordered". Science is thus about finding causation where previously we suspected
chaos.
Key Concept – "The Causal Line": What makes the
study of causation really challenging is the fact that when we inspect
processes more closely they often turn out to involve a succession of lesser
causeeffect events. Event A causes State B, which triggers Event C, and so on.
This sequencing of causes and effects is known as a "causal line", and causal lines have been formally
defined as "temporal series of events so related that, given some of them,
something can be inferred about the others whatever may be happening
elsewhere" (Russell, 1948, p459). The
principal scientific skill is accordingly that of unravelling causal lines, and
the payoff is that the resulting predictability gives us some semblance of
control over our world.
Key Concept  Variable: A variable is a
quality or attribute, physical or conceptual, which may exist in two or more
discrete states or at two or more intensities. It is thus a dimension of observation, and observation, as we
shall be explaining in Section 2, is another fundamental building block of the scientific method.
Key Concept – "Discrete" vs
"Continuous" Variables: "Variables are tricky things"
(Coolican, 1990, p12). To start with, they move in different ways, some jerkily
and some smoothly. Variables
which advance in integral steps are known as "discrete" variables. Examples: A runner’s
position in a race can be first, second, third, etc. but not fractional values
thereof. Discrete variables thus have a limit to their arithmetical precision,
that is to say, there is no point in measuring them to more places of decimals
than the steps themselves allow. On the other hand, there are many variables
which have no theoretical limit to their arithmetical precision, and can be
measured to as many places of decimals as you like. These are known as "continuous" variables. Examples:
Time, distance, and mass, or complexes thereof, such as velocity,
acceleration, force, and pressure.
Key Concept  Independent Variable (IV): [Sometimes
"predictor variable".] Some variables also act upon other variables
within causal lines, causing the
latter to vary in turn. These are known as "independent" variables,
and IVs are important because they help us to understand particular causal
lines. Examples: (1) Gender is a twostate discrete
variable capable of differentially
determining an organism’s behaviour through a complex causal line which
includes a number of anatomical, physiological, and psychological variables. (2)
Temperature is a continuous variable which will directly influence many
physical processes.
Key Concept  Dependent Variable (DV): [Sometimes
"criterion variable".] Variables which are acted upon by IVs in a
causal line are known as "dependent" variables [because what they do
"depends on" what the IV does]. In the experimental method,
DVs are the variables which are monitored (i.e. observed and measured), whilst
IVs are those which are manipulated in the hope that by varying a cause you
will become better able to understand its effects.
For more on how variables fit into
the topics of philosophy of science and research design, start with the Section
2 entries for hypothesis testing, inference, and inferential testing, and follow the links from there.
2  The Glossary
Entries
Action Research: A research philosophy originally developed by Lewin (1946) in order
to avoid "research that produces nothing but books" (p35), and
Corey (1949, 1953) to help support the cyclical improvement of educational
initiatives [more
history], but now of proven utility in any similar large institution,
including social services [example],
healthcare [example],
and IT [example].
Alternatively, "a deliberate, solutionoriented investigation that is
[characterised] by spiraling cycles of problem identification, systematic
data collection, reflection, analysis, datadriven action taken, and,
finally, problem redefinition" (Beverley, 1993/2004 online).
For further details, see Wilson and Streatfield (2004 online).
[See also participatory action research.] ANOVA: See
analysis of variance. Analysis of Variance (ANOVA): [See firstly variance and tests for the
difference of more than two means.] The ANOVAs are a class of parametric
statistical procedures capable of processing more than two columns of
groupdifference data in one row [the "one way" analysis of
variance] or two or more columns of such data in two or more rows [the "two
(or more) way" analysis of variance]. Analysis of Variance, "One Way": [See firstly analysis
of variance.] The simple, or "one way", ANOVA is the
statistical analysis of choice for data which sit naturally in onerow tables
of three or more cells. The statistical procedure computes variance between
cells as well as in total. The statistical algorithm itself need not concern
us, but the statistic it produces  known as an Fvalue  is a
valuable index of (1) the amount of variance attributable to
differences between the cells, and (2) random, or "residual",
variance. Significance tables may then be used to convert the
Fvalue and its degrees of freedom intersect into an equivalent pvalue.
Example: In testing the general hypothesis that exercise
depresses the appetite, we might record the daily calorific intake of five
groups, graded by the amount of exercise taken. If we then arranged for each
group to contain around a dozen subjects, you could tabulate their calorific
intakes (in kilocalories) into five horizontally aligned cells, numbered 1 to
5. Analysis of Variance, "Two (or More) Way": [See firstly analysis
of variance.] Two (or more) way ANOVAs are capable of simultaneously
coping with two (or more) independent variables. The popular
shorthand description for such procedures reflects the number of conditions
on each variable as an integer. Thus a 2x2 (pronounced
"twobytwo") ANOVA has two independent variables, each of whose
effects on the dependent variable are sampled under two conditions. The
dependent variables are accumulated under the appropriate cell heading, and
the analysis carried out once enough observations have been made. A 2x2x3
(pronounced "twobytwobythree") ANOVA has three independent
variables, two sampled under two conditions, and the third sampled under
three conditions. Example: In testing the general
hypothesis that exercise depresses the appetite in women but increases it in
men, we might record the daily calorific intake of five groups of men and
five groups of women, both graded by the amount of exercise taken. If we then
arranged for each group to contain around a dozen subjects, you could
tabulate their calorific intakes (in kilocalories) into five horizontally
aligned cells for the men, themselves vertically aligned over five further
horizontally aligned cells for the women. Since the resulting table would
then be five cells wide by two cells deep, it would be a 5x2 (pronounced
"fivebytwo") ANOVA. Artefact: Same
thing as artifact. Artifact:
[Optionally artefact.] Generally, "a thing made by art, an
artificial product" (OED), and thus, in the present context, a research
conclusion which turns out upon critical methodological examination to have
arisen thanks to a bias or confound of some sort, rather than
because of the causal relationship under test. A serious source of
research error. Lewin (1977) gives the following advice on handling
artifacts: "Highly contrived laboratory situations magnify artifacts.
The researcher should ask, 'How else can I study this topic?' [and one good
way] to reduce confounding variables is by imaginative and creative research
design" (p110). Another way is to increase "experimental realism".
Moreover, specific types of artifacts go with specific types of research, so
it is down to the analytical skills of the author(s) concerned to spot them
in advance (i.e. before your peerreviewer, to your cost, does it for
you). Pretesting may itself be the cause of an artifact, and can be
better managed by adopting the Solomon four group design. Attention Bias: This is a type of bias in which the act of observation
itself becomes an important (if not the most important) independent
variable. The Hawthorne effect is a good example of what can then
happen. Awareness of the Hypothesis: This is one of the eight types of confounding
identified by Lewin (1977). It reflects the possibility that a research
participant's understanding of what a given piece of research is for might,
consciously or unconsciously, influence the behaviour being measured. BetweenSubjects Design: See repeated measures design. BetweenSubjects Variance: [See firstly variance.] This is variance
arising from chance or bias in the groups participating in an independent
groups design. Bias: [See
firstly measurement error.] In
everyday usage, a bias is a deviation from an intended path (OED). In
scientific research, it is the systematic
deviation of measurements from their true value (as opposed to random deviation, which is due to statistical "noise").
Unfortunately, bias can arise for a large number of reasons, and originate at
all stages of the research cycle. It therefore has no single definition nor
method of detection nor standard remedy. The following subtypes of bias are
therefore dealt with separately: attention bias, centripetal bias,
confounding, cultural bias, demand characteristics, expectancy
bias, measurement bias, recall bias, sampling bias, volunteer
bias, and withdrawal bias. [For additional discussion, see Palmer
(1996/2004
online).] Binomial
Sign Test: This is a single sample inferential statistic
for nonparametric data at the nominal or ordinal level.
By "single sample" is meant any design which is attempting to judge
whether a sample population is, or is not, drawn from a reference population,
whose distribution is already a matter of record and thus need not be resampled.
If the sample population is representative of the reference
population, then the sample distribution and the reference distribution will,
save for sampling noise, be the same. The test can also be used as a
twosample inferential statistic for nonparametric data in a related
design. Blind: This is the technical
term for the giving or taking of measurements without knowledge of the true
purpose of the research, and possibly under the deliberate influence of a cover
story. [See now double blind and single blind.] Bonferroni's
Correction (for Multiple Comparisons): [See firstly confidence levels and Type 1 error.] The
Bonferroni correction is an adjustment to the confidence level required when
a single scientific hypothesis is being investigated using multiple inferential
statistics. The risk on such occasions lies in the fact that the pvalue,
the probability of a Type 1 error, increases with every new
statistical procedure, in much the same way that the odds of throwing a 6 go
up when you are allowed to keep throwing your die. The solution proposed by
the Italian statistician Carlo Emilio Bonferroni (18921960) was to recalculate the multiple
pvalues into a single adjusted pvalue. This was done by a statistical
algorithm, and a simple online procedure at the SISA website will nowadays do
this calculation for you [click here to use the algorithm and here to see the user instructions]. BoxandWhisker
Plot: The boxandwhisker plot (or "boxplot" for short) is an
easily drawn graphical aid, designed to display both the central tendency and the dispersal
of a given distribution. The
graphic is produced by inspecting said distribution, and identifying five
values. The first two are the lower
extreme and the upper extreme,
and the difference between these values gives us the range of the distribution. The next important value is the median. The median is important
because it locates the midpoint of the distribution on a linear scale for the
variable in question. It divides the range into upper and lower halves, and
the two half ranges are then further subdivided by locating the two quartiles, the lower quartile and the upper
quartile. A rectangle is now drawn above the scale, such that it begins
at the lower quartile, ends at the upper quartile, and is vertically divided into
two at the median. This is the "box" element of the graphic. The
"whiskers" are now added to the box by adding horizontal lines from
the lower and upper limits of the box to the lower and upper ends of the box.
Box Plot: See boxandwhisker
plot. Briefing: [See firstly ethics
and deception.] This is the stage in the standard research procedure
at which participants are told what the research is about (either truthfully,
or – ethical deception having been approved  as part of a cover story),
and given the opportunity to withdraw their consent to take part, as required
by the codes of practice on ethical research laid down by the various
institutions involved. In all research philosophies, it is wise to regard the
briefing as a potential source of demand
characteristics, and in the experimental philosophy it may also need to
be regarded as a treatment as
well. even if it there is no element of deception involved. [See now debriefing.]
Burt, Sir Cyril
(18831971): [Selected Internet biography] British intelligence theorist, initially
acclaimed for his contribution toward the gfactor theory of
intelligence (e.g. Burt, 1917). Burt’s academic reputation suffered after
Hearnshaw (1979) exposed a number of inconsistencies in his data handling,
and he was subsequently adjudged by the British Psychological Society to have
falsified his results. More recent papers have defended Burt, but the
official ruling remains in effect nonetheless. Bystander Apathy: This is the name
given to an unwillingness to get involved on the part of persons close to an
apparent ongoing emergency. It is ignoring one's duty in favour of a
"quiet life". It is "bad Samaritanism". This phenomenon
was investigated by a classic social psychological study, Piliavin, Rodin,
and Piliavin (1969). Causal Line: See the entry for
this topic in Section 1 above. Cause and Effect: See the entry for this topic in the companion Rational
Argument Glossary. Central Tendency: [See firstly distribution.] This is
a measure of where the centre of a given distribution lies with reference to
its lower extreme and upper extreme. Among the graphical displays
of central tendency we have the boxandwhisker plot, and among the
computed measures we have the mean, the median, and the mode.
[Compare Dispersion.] Centrifugal Bias: This is a type of bias in which the research centre itself 
say a failing hospital or a nonprestige university  is avoided by
individuals who can get in somewhere better, thus rendering a sample
of those who are left subtly unrepresentative of the population at
large. [Compare centripetal bias.] Centripetal Bias: This is a type of bias in which the research centre itself 
say a specialist hospital or a prestige university  attracts individuals
with particular strengths and attributes, thus rendering a sample
thereof subtly unrepresentative of the population at large. [Compare centrifugal
bias.] ChiSquared Test: This is the most common method of statistical analysis for
frequency data. Given an array of actual cell frequencies, the
statistical procedure computes a null
hypothesis expected distribution, and then tests whether the
actualexpected difference is big enough to have occurred by chance. [Full
tutorial] Citing Previous Research: This is making due reference to the literature when deriving and
justifying one's research argument.
[Now see criticising previous research.] Clever Hans:
This is a classic example of a procedural confounding bias, in which a
circus horse  Hans  would answer simple arithmetic questions by tapping so many
times with his hoof. Upon closer inspection, however, it turned out that Hans
was not numerate at all  merely sensitive to his trainer's body language.
The secret was that Hans had learned to start tapping when given one type of
behavioural cue, and would stop when given another [full details]. Research which does not thoroughly avoid
confounds of this sort at the early planning stage is likely to be deeply
flawed. Clinical Effectiveness: This is the general concept of value for money –
i.e. demonstrable benefit  in clinical treatment of any kind. Specifically,
a series of initiatives during the early 1990s to maximise value for money in
the British NHS by raising consciousness of costofoutcome amongst clinical
professionals, and which therefore inspired (and still inspires) a large
number of efficacy studies to prove things one way or the other. The
search for maximum clinical effectiveness in the UK is overseen by the National Institute for
Clinical Excellence (NICE). Clinical Judgement: This is one of the two
basic types of clinical assessment (the other being the use of formally
standardised psychometric tests). To reach a clinical judgement
requires a combination of observation, adhoc diagnostic tests,
and prior professional experience. Cluster Analysis: This is one of the four recognised types of multivariate method
(the others being principal components
analysis, factor analysis, and
discriminate analysis). Cluster Sampling: This is one of the standard optional methods of sampling. The
method relies on sampling selected clusters of potential subjects within the
target population, rather than the population
as a whole (thus saving time and expense, but at the risk of introducing some
kind of sampling bias). Coefficient Alpha: Same as Cronbach’s
alpha. Cohort Study:
This is a type of longitudinal study in which the group(s) being
studied are monitored over a suitable period of time. Concurrent Validity: [See firstly validity.] Data collection
instruments such as questionnaires, test batteries, or psychometric
tests may be said to have concurrent validity to the extent to which
their findings correlate with other tests – criterion tests  of the
same construct. Concurrent validity can be assessed by an appropriate
statistical technique, and expressed as a correlation coefficient. Unfortunately, since suitable criterion
tests are in fact surprisingly rare, this sort of validity can actually be
difficult to quantify. Indeed, Kline (2000b) warns that "almost the only
field where accepted tests exist such that high correlations with them
indicate validity is intelligence […..] In most other fields confusion
reigns" (p20). Kline also warns that the reliability of the selected criterion test also needs to be taken
into account, because if you select a test with low reliability to validate a
new test against, then it may be the validating test which is misbehaving,
not the new one. Confidence Level: [See firstly hypothesis testing and inferential statistics.]
This is an expression of the probability of accepting a hypothesis
without committing a Type 1 error, conventionally expressed by a
pvalue. The usual confidence boundaries in psychological research are
"not significant" (a pvalue greater than 5%),
"significant" (a pvalue between 1% and 5%), "very
significant" (a pvalue between 0.1% and 1%), and "highly
significant" (a pvalue less than 0.1%). The conventional shorthand for
expressing these levels of significance in social science research is to add
the code "p > 0.05", "p < 0.05", "p <
0.01", and "p < 0.001", respectively. Confound:
In the current context, "to confound" is to fail to detect a confounding
bias prior to carrying out a piece of research, with the end result that causeandeffect
interpretation of the results becomes unsafe. "A confound" is the confounding
variable doing the damage. Confounding Bias: This is a type of bias in which one or more initially
unrecognised confounding variables turn out to have affected the
obtained results, thus rendering causeandeffect interpretation
unsafe. Lewin (1979) identifies the following eight major sources of
confounding: awareness of the hypothesis, demand characteristics,
enlightenment effects, evaluation apprehension, experimenter
expectancy, reactance, and role expectations (two types). Confounding Variable: [See firstly variable.] This is an independent
variable NOT formally designed into a piece of research, and which, by
not being controlled, is likely to pervert the course of hypothesis
testing, perhaps by encouraging a Type 1 error. Consent: See
informed consent. Consistency:
See internal consistency. Construct:
See hypothetical construct. Construct Validity: [See firstly hypothetical
construct and validity.] Data collection instruments such as questionnaires,
test batteries, or psychometric tests may be said to have
construct validity to the extent to which they are based upon wellaccepted
psychological constructs. This is important because many psychological
constructs  e.g. telepathy  are not universally accepted. Anastasi’s (1988)
examples of established theoretical constructs include scholastic aptitude,
comprehension, verbal fluency, neuroticism, and anxiety. The notion of
construct validity derives initially from Cronbach and Meehl (1955/2005 online),
who warned that specific high correlations "may constitute either
favourable or unfavourable evidence [] depending on the theory surrounding
the construct". Example: A mindreading test which
was in other respects reliable and valid would have dubious construct
validity because the construct of mind reading was itself less than
universally accepted. Assessing: Cronbach and Meehl
further argue that the ideal assessment of construct validity would be to
have some form of "construct validity coefficient", "a
statement of the proportion of the test score variance that is attributable
to the construct variable" (p7). Unfortunately, while this is
conceptually straightforward enough (you simply have to decide whether the
suggested psychological construct actually exists, or is something else under
a new name), it is difficult to do in practice. In fact, Kline (2000a) sees
little alternative to having to put together a package of hypothesis testing
supplementary to the headline hypothesis, and that will seriously complicate
the research design and dramatically lengthen the validation process. Early
planning is therefore called for. Content Analysis: This is a method of obtaining quantitative scores for various
variables within (usually written) language. [For further details, see
the corresponding entry in our Psycholinguistics
Glossary.] Content Validity: [See firstly validity.] Data collection instruments such as questionnaires,
test batteries, or psychometric tests may be said to have
content validity to the extent that they sample "the class of situations
or subject matter about which conclusions are to be drawn" (French and
Michael, 1968, p164). Example: A mathematics test
which contained only spelling questions, or without a section covering
division, would have impaired content validity. Solution: Careful
planning and analysis of the literature, followed by more detailed
hypothesising and/or perhaps a multifactorial design with a view to
quantifying the true spread of construct complexity (e.g. reading skill,
driving skill, etc.). Another technique might be to resort to field experts
to examine the proposed test content and to quantify and report some measure
of their approval. Continuous Variable: [See firstly variable and the Section 1 entry for "discrete" vs "continuous"
variables.] This is one of the two subclasses of interval/ratio
data (the other being discrete variable). It follows that measurements
of continuous variables are always approximations, and thus have an element
of measurement error irretrievably built in. [Compare discrete
variable.] Control Group: [See firstly group.] This is a subset of a research sample
selected NOT to receive a particular treatment, thus providing a helpful
baseline or comparison measure for the dependent variable under investigation.
Alternatively, it is the "point of comparison with the group of subjects
who receive the experimental manipulation" (Bryman and Cramer, 1997,
p5). Alternatively, "the function of a control group is to provide
an observation that cannot be attributed to the variable being
manipulated" (Sarbin and Coe, 1975, p11). Historically, one of the first
recorded controlled trials was James Lind's discovery in 1753 that eating
citrus fruits could cure the condition known as "scurvy" in the mariners
of that time. The online James
Lind Library details this and a number of other pioneer uses of control
groups. Correlation:
To "correlate", of variables, is to vary in the same
direction and proportion at the same time, possibly as the result of a causeandeffect
relationship but perhaps coincidentally. The detection of correlations
is an important practical aspect of establishing a causal line, and thus the fundamental principle of the correlational philosophy of science. Correlation Coefficient: A "correlation coefficient" is a
mathematical index produced by one of the many correlational statistical techniques (such as the Pearson
product moment correlation or the Spearman rank correlation) and
indicating the extent of the relationship between two potentially related
sets of measures, and therefore, to the extent that they have been properly operationalised, of the proposed
underlying variables. The coefficient ranges from 1 (a perfect negative
correlation) to +1 (a perfect positive correlation). A strong
positive coefficient (usually accepted as 0.7 or above) indicates that one
variable typically increases as the other increases, whilst a strong negative
coefficient (usually accepted as 0.7 or below) indicates that one variable
typically decreases as the other increases. A coefficient of zero indicates
no relationship at all. Correlational Method: See correlational philosophy of science. Correlational Philosophy of Science: [Alternatively "correlational method" or
"correlational psychology".] This is one of the two alternative
approaches to quantitative research (the other being experimental
psychology), as proposed by Cronbach (1957). The method effectively plots
naturally occurring observations of one variable against another, searching
for the correlations "presented by nature" (Cronbach, 1957,
p10), that is to say, for "already existing variation" rather than
for that introduced by the experimental manipulation of an independent variable. The value of
this approach stems from its ability to supplement the experimental approach
in areas which "man has not learned to control or can never hope to
control." The problem with correlations, however, is that they are not
necessarily causal. Indeed, if we do not know the precise causal line,
it is easy for regular cooccurrence to be misinterpreted. Errors of this
sort are known as the "cum hoc fallacy" [Rational
Argument Glossary]. Cronbach points out with some justification that
correlational psychologists search out variables the experimentalists prefer
to ignore. Correlational Psychology: See correlational philosophy of science. Correlational
Statistical Techniques: Mathematical
algorithms such as the Pearson
product moment correlation or the Spearman rank correlation, intended to produce correlation
coefficients. CostEffectiveness: This is what health service and education managers
have to consider when financing intervention projects. It is a matter
of a project's effectiveness relative to its cost, the point
being that many highly effective treatments are nonetheless insupportable
financially. Relevant here because studies of cost effectiveness are
commonplace in healthcare, clinical psychology, health psychology, and
educational psychology. Counterbalancing: This is an aspect of research design intended to minimise order
effects in experimental manipulations. Participants are exposed to the
required experimental conditions in different sequences, so that overall the effects of practice or
fatigue are presumed to cancel out. Cover Story:
[See firstly briefing and deception.] This is a false statement
as to the purpose of a given study. Must have been approved by the ethics panel concerned, and must be
covered in the debriefing session. CriterionReferenced: This is one of the two basic philosophies of
behavioural or psychological assessment (the other being normreferenced),
specifically, one in which the criteria of "goodnessbadness" at
the test are publicly recorded in advance as a set of specific and
objectively assessable behavioural indicators. Example:
One of the most accessible examples of a criterionreference assessment is
the (UK) driving test, where you pass when you are judged good enough against
a ticklist of demonstrable abilities. Criterion Test: See concurrent validity. Criterion Validity: [See firstly validity.]
This is the extent to which a test correlates "with one or more external
variables considered to provide a
direct measure of the characteristic or behaviour in question" (French
and Michael, 1968, p167). In most respects, the same as predictive validity
[for a discussion of the exceptions, see Anastasi, 1990; Chapter 6)]. Criticising Previous Research: [See firstly citing
previous research.] CrossValidation: [See firstly validation.] This is the independent
determination of the validity of a test, using "a different sample of
persons from that on which the items were selected" (Anastasi, 1990,
p226). Debriefing:
[See firstly ethics.] This is an important aspect of ethicality in
research, and part of the briefingdebriefing aspect of research procedure.
In its simplest form it is a short recapitulation of what subjects have done
and why they have done it. Debriefing is especially important where the
research involved any intentional deception. Deception:
This is the deliberate concealment of the true purpose of a piece of
research, often assisted by a cover story delivered at the briefing.
Studies which involve deception must always expect to be challenged by the ethics
committee involved, and therefore demand deep initial reflection and
analysis. As far as undergraduates are concerned, they are only ethical if
the deception is necessary to avoid demand characteristics, reduce experimenter
effects, or otherwise control confounding. Degrees of Freedom (df): This is "the number
of components [of a statistic] which are free to vary" (Bryman and
Cramer, 1997, p122). Despite being a mathematically complex concept, degrees
of freedom are usually simple to determine and use. For example, the degrees
of freedom for a variable sampled at n
different intensities is simply (n1). Demand Characteristics: [See firstly bias.] This term was coined by Orne
(1962), and is one of the eight types of confounding identified by
Lewin (1977). It reflects the possibility (nay certainty) that subtle
environmental factors will interact with the motivational state of human
subjects during the research experience to render the observed behaviour
nonnatural in some important respect, the demonstrable fact being that "the
setting may well evoke other behaviour you did not intend to evoke"
(Lewin, 1977, p103). The point is that the confounding variable is
provided by the experimental setup itself, which may include the behaviour
or appearance of the experimenter(s) personally. Dependent Variable (DV): See the entry in Section 1 hereto. Descriptive Statistics: [See firstly statistics.] The phrase
"descriptive statistics" refers to a portfolio of mathematical procedures
designed to present research data in summary form without being part of
hypothesis testing. The most common descriptive statistics are mean [
= average], median, mode, range, and standard
deviation, and the most common graphical displays are the bar chart,
the boxandwhisker chart, the histogram, and the pie chart.
[Compare inferential statistics.] Design: See
research design. Developmental Delay: The phrase
"developmental delay" refers to the failure of a developing
organism to reach/achieve/display some physical, cognitive, or behavioural developmental
norm at the expected chronological age. Developmental Norm: [See firstly norm.] This is an agerelated
expectation of mental or physical ability informed by past experience or
research with the population in question, and therefore vitally
important in the detection of developmental delay, and therefore part of the standardisation
exercise prior to the marketing of major psychometric test
packages. Diagnostic Tests and Screening Procedures: These are measurements and measurement packages
designed to assist during the assessment phase of patient management. The
ability of a given test to detect someone who needs to be detected is known
as its sensitivity. The ability to exclude people who need to be
excluded is known as its specificity. The positive predictive value
of a test is a measure of how many of those who have been detected as
positive actually are positive, and its negative predictive value is a
measure of how many of those who have been detected as negative actually are
negative. Clinicians need to be aware of all four of these factors, and
recognise that the qualities are to a large extent mutually exclusive. That
is to say, a good test of one condition might be a bad test of something
else. [There is actually a highly mathematical good reason for this, as
summarised in the entry for the ROC curve.] Difference Testing: See testing for the difference of two means
and testing for the difference of more than two means. Differential Validity: See incremental and
differential validity. Discrete Variable: [See firstly variable
and the Section 1 entry for "discrete" vs "continuous"
variables.] Discrete variables are one of the two subclasses of interval/ratio
data (the other being continuous variable). Discriminant Analysis: This is one of the four types of multivariate
method. Discriminatory Power: This is an important
aspect of undertaking item analysis during the development of a
psychometric test. Dispersion: This is a measure of how tightly clustered a
distribution is around its mean. [Compare central tendency.] Double Blind:
[See firstly blind.] A double blind study is one in which BOTH
experimenters and participants are naive as to the true purpose of the
research. It might be necessary to organise things this way if experimenter
effects or other factors might bias the results. [Compare single blind.] DV: See
dependent variable. EBP: See
evidencebased practice. Effectiveness: This is a measure of the likely actual benefit arising from a given
remediation programme (that is to say, under average conditions of use)
(compare efficacy). (After Hayward, Jadad, McKibbon, and Marks, 1998.) Efficacy: This
is a measure of the theoretically maximum benefit arising from a given
remediation programme (that is to say, under ideal conditions of use)
(compare effectiveness). (After Hayward, Jadad, McKibbon, and Marks,
1998.) Efficacy Study: [See firstly efficacy.] Eigenvalue:
[See firstly principal components analysis.] Empirical Data: These are data obtained by actual observation rather than by
conjecture; data from the evidence of the senses. Enlightenment Effects: This is one of the eight types of confounding
identified by Lewin (1977). The possibility that prior exposure to the study
area might influence performance under test. Error: See measurement error. Ethics:
This is the code of practice imposed upon researchers by their professional
body and/or employer. Click
here to consult the British Psychological Society Code of Conduct. Ethics Committee: This is a formally constituted panel to which research proposals need to be
submitted for approval on ethical grounds (and hence a major defence against
legal action should the case arise). Evaluation Apprehension: This is one of the eight types of confounding
identified by Lewin (1977). The possibility that naturally apprehensive or
secretive personalities will not be performing normally on the behaviour
under test. Lewin suggests, amongst other things, that experimenters need to
watch out for comments such as "I better watch what I say in fron of
you" (op. cit., p105). EvidenceBased Practice (EBP): Evidencebased practice is properly informed
professional decision making. It is "the conscientious, explicit, and
judicious use of current best evidence in making decisions about the care of
individual patients" (Sackett et al, 1996). "It is a systematic
approach to integrating current scientific evidence" (source) [alternative
definitions]. EBP is, however, only as good as the available evidence, and
that is usually less than conclusive. The philosophy therefore requires that
practitioners are sensitive to levels of evidence. Moreover, even
where the evidence base is sound, it is being constantly extended (hourly,
indeed, in the fastest moving branches of science). [See the story of James
Lind in the entry for control group.] Experiment: See
true experiment. Experimental Methods: [See firstly research types and designs.] This
is a class of research design intended to approximate to the ideal of the
true experiment, and therefore characterised by structured observation of the
effects of one or more deliberately manipulated independent variables
on a single dependent variable, while the effects of (ideally all)
other possible causation is tightly controlled. [Now see the separate
entries for field experiment, natural experiment, quasiexperiment,
true experiment.] Experimental Psychology: This is one of the two basic types of scientific
psychology identified by Cronbach (1957) (the other being correlational
psychology). The method is based upon the scientist changing this or that
condition "in order to observe their consequences" (p10). The
experimenter is thus "interested only in the variation he himself
creates", unlike the correlator, who is interested in the variation which
is already there. Experimenter Bias Effects: Same thing as experimenter effects. Experimenter Effects: The phrase "experimenter effects" refers
to the ability of experimenters by carelessness and lack of attention to
detail to bias their research, for
example, by failing to prevent demand characteristics, the Hawthorne
effect, etc. Experimenter Expectancy: This is one of the eight types of confounding
identified by Lewin (1977). The possibility that experimenters themselves can
subtly influence their participants' behaviour. [See Pygmalion effect.] Ex Post Facto Research:
This is one of the recognised subtypes of the experimental method. External Reliability: [See firstly reliability.] This is one of
the two forms of reliability (the other being internal reliability).
"The degree of consistency of a measure over time" (Bryman
and Cramer, 1997, p63). Face Validity: [See firstly validity.] A test may be said to have face
validity if, upon simple inspection, it appears to the subject to
measure "what it claims to measure" (Kline, 2000b, p18). Kline
warns that this can sometimes be a good thing (it may motivate subjects to
perform well), and sometimes a bad thing (the target measure may be so
obvious as to promote deliberate misperformance). Factor Analysis: This is one of the two main factor analytical methods (the
other being principal components analysis). The method requires the
accumulation of scores on a number of simultaneous variables for each
subject, and then performing multiple correlations. Factor Analytical Methods: These are one of the most powerful correlational
methods of scientific research, and the method of choice when
investigating multiple causation. There are two specific statistical
procedures under this heading, namely factor analysis proper, and principal
components analysis. Enthusiasts for factor analytical methods are
quick to point out that science cannot advance by hypothesis testing alone. Factor Loading: See factor analytical methods in general and loadings
in particular. False Negative: [See firstly diagnostic tests and screening procedures.] This
is AN INCORRECT diagnostic judgement that an entity DOES NOT fall within a
target category. [See now negative predictive value.] False Positive: [See firstly diagnostic tests and screening procedures.] This
is AN INCORRECT diagnostic judgement that an entity DOES fall within a target
category. [See now positive predictive value.] Falsification: See principle of falsification. Fatigue Effect: This is a class of confounding which might be encountered with a
prolonged or physically demanding research procedure, and in which
performance on the later items will be tailing off. Fatigue effects may be
controlled for to a certain extent by going for a more sophisticated design,
perhaps with counterbalancing of trials. Ferguson's Delta: [See firstly discriminatory power.]
This is an index of discriminatory power devised by Ferguson (1949) Fisher, Sir Ronald: Sir
Ronald Aylmer Fisher (18901962) was the statistician who devised the
logic of the null hypothesis during hypothesis testing. His
book "Statistical Methods for Research Workers" (Fisher, 1925/2004 online) has
been described as "probably the most influential book on statistics of
the 20th century" (source). Frequency Data: This is a subtype of nominal data. Gaussian Distribution: See normal curve. Gosset, William: William
Sealy Gosset (18761937) was the Guinness brewery quality assurance
chemist who, under the pseudonym "Student", popularised Student's ttest
(Student, 1908) as a practical method of comparing the strength and
composition of small samples [fuller
story]. Hawthorne Effect: [See firstly bias and confounding.] This is the name
given to the phenomenon whereby the mere act of observing a behaviour can
change it. The effect was first formally documented by Mayo (1933, 1945),
following field research at the Western Electric Hawthorne Works, Chicago,
between 1927 and 1932, in which the main driver of plant productivity turned
out to be the presence of the researchers, rather than anything to do with
the working conditions [fuller
story]. The Hawthorne effect is an excellent example of attention bias
in action. Homoscedacity: See homogeneity of variance. Hypothesis:
"A hypothesis states the relationship between two (or more) variables
[and] takes a form such as 'if variable A is high, then variable B will be
low'" (Lewin, 1979, p37). Hypothesis Testing: [See firstly hypotheses.] This is the act of
putting one's theoretical beliefs to objective and peerreplicable test. Hypothesis
testing will normally be supported by inferential statistics.
Mathematically, there are a number of ways to go about this, but the most
popular method in the social sciences was devised by Fisher, Sir Ronald,
and is so structured as to involve an attempt to disprove the null
hypothesis and simultaneously to provide some estimate of confidence
level in the form of a pvalue. Hypothesis testing is also the
backbone of the hypotheticodeductive method, on which nothing less
than the scientific method itself is based [not everyone agrees
totally with this – see Cattell’s (1952) comments in the entry for factor analytical methods]. Hypothetical Construct: This is a hypothetical construct (or "construct", for
short) is a presumed internal quality of a system, beyond direct observation,
whose presumed operation accords with available empirical data.
Alternatively, it is "a construct is some postulated attribute of
people, assumed to be reflected in test performance" (Cronbach and
Meehl, 1955/2005
online). Constructs are therefore part of a theory, and may, in
turn, map onto one or more variables,
each of which may be operationalised
as observable measures in a number of different ways. Examples:
stress, memory. [See now construct
validity.] Impression: In the context of this glossary, an "impression" is a statement of best
clinical judgment, an attempt at medical diagnosis which allows for an
element of residual uncertainty. It reflects how a patient "looks"
(or, more formally, "presents"), rather than "what they have
got". [Further
details] Independent tTest: Same as unrelated ttest. Independent Variable (IV): See the entry in Section 1 hereto. Inductive Reasoning: See the entry for this topic in our Rational
Argument Glossary. Inference:
See the entry for this topic in our Rational
Argument Glossary. Inferential Statistics: [See firstly statistics.] This is a class of
mathematical procedures designed to establish the likelihood of a causal
relationship existing between blocks of empirical observations. Two major
subclasses of inferential statistic are recognised, namely correlational
methods and tests for the difference of group means methods.
[Compare descriptive statistics.] Interaction:
See analysis of variance. Internal Consistency: [See firstly reliability and validity.]
This is one of the two considerations of a test's reliability (the other
being testretest reliability). A form of validation of a multiple
item test, characterised by the fact that "the criterion is none other
than the total score on the test itself" (Anastasi, 1990, p55). Internal Reliability: [See firstly reliability.] This is one of
the two forms of reliability (the other being external reliability).
The extent to which a scale "is measuring a single idea" (Bryman
and Cramer, 1997, p63). Interval Data: [See firstly interval.] This is a collection of observations
of an intervalbased variable. Intervention: This is the act of remediation itself, that is to say, the treatment
which is actually delivered to the person in need (the alternative being to
do nothing and let nature take its course). Intervention Study: This is the research programme needed to establish
the efficacy and/or effectiveness of a remediation programme.
The simplest research design is to divide a group of sufferers into matched
Experimental (E) and Control (C) groups. The EGroup then receives the
remediation for a given period of time whilst the CGroup receives an equally
complex but medically/educationally neutral dummy treatment (itself a major
ethical problem). Improvements over time (if any) are measured, and  if bias
and confounding have been properly controlled (a massive design problem), and
if the measuring instruments are valid and reliable (another massive problem)
 any changes can only have resulted from the treatment (although it still
might not prove costeffective). (See main text for examples.) IV: See
independent variable. Kendall Coefficient of Agreement (u): This is a test to detect an underlying logical
pattern in a series of repeated paired comparisons, such as might be obtained
if a group of judges was only ever presented with two items at a time out of
a sample, rather than ranking the entire sample. Kendall Coefficient of Concordance (W): This is a test for the correlation of more
than two variables, for ordinal data. KuderRichardson Reliability Coefficient: [See firstly reliability in general and consistency
in particular.] This is a measure of interitem consistency devised by Kuder
and Richardson (1937). Longitudinal Study: This is an intervention study where the
effects are reassessed not just at the end of the initial intervention period
but at intervals over many years. Longitudinal studies are therefore the
preferred method of evaluating educational initiatives and the like, where
the final results need to work their way in real time through the normal
developmental lifecycle. Cohen and Manion (1989, p71) distinguish four
subtypes of longitudinal study, namely the cohort study, the crosssectional
study, the ethogenic study, and the trend study. MANOVA: See
analysis of variance. Measurement:
"Measurement is the numerical estimation of the ratio of a magnitude of
an attribute to a unit of the same attribute" (Michell, 1997, p383).
Alternatively, it is a quality or quantity arrived at by observation of an operationalised measure of a variable
in known circumstances. Measurement Error: "Discrepancies between the observed value of your
measurement and the 'true' value" (FifeSchaw, 1995, p45). Median:
This is the score midway between the lowest and highest score in a distribution. Mode: This
is the most common score (or band of scores) in a distribution. Mortality:
In the context of research, this is a type of bias in which subjects
do not complete the study period. This might be literal mortality (as in
medical research, for example) or figurative (as in student dropout). Multivariate Methods: "A collection of techniques appropriate for
the situation in which the random variation in several variables has
to be studied simultaneously" (Armitage and Berry, 1987, p326). The main
multivariate methods are principal components analysis, factor
analysis, discriminant analysis, and cluster analysis. Naturalistic Observation: This is one of the three basic approaches to
scientific research identified by Underwood (1966), and characterised by
"the recording of behaviour as it occurs in a more or less naturalistic
setting with no attempt to intervene" (p4; bold emphasis added). Negative Correlation: See correlation coefficient. Negative Predictive Value (NPV): [See firstly diagnostic tests and screening
procedures.] A test's NPV is a measure of how good that test is at
detecting true negatives when all its decision negatives are considered. It
is calculated by substituting empirical observations into the formula = TN /
(TN + FN). When NPV is high it indicates that the false negative problem is
under control. Noise:
Within the context of measurement theory, this is the same thing as random
error. NonParametric Statistics: [See firstly inferential statistics.]
Nonparametric tests are mathematically "less powerful" than their parametric
equivalents. This means that "given exactly the same data, a parametric
test is more likely to lead to significant results than a nonparametric
test" (Snodgrass, 1977, p357). The most commonly used nonparametric
statistics are the binomial sign test (for single sample
designs), the Mann Whitney U test (for twogroup unrelated designs),
the binomial sign test, or Wilcoxon signed ranks test (for
twogroup related designs), and the Kruskal Wallis test (for
three or more group unrelated designs). [Compare parametric statistics.] Norm: This term refers to the performance of the standardisation
sample on a given test (Anastasi, 1990), and thus that which gives
meaning to the test scores of subsequent samples. [See now developmental norms.] Normal Distribution: [Alternatively "Gaussian distribution" or
"the bell curve".] See separate dedicated handout and exercises. NPV: See negative
predictive value. Null Hypothesis: [See firstly hypothesis testing.] This is a deliberate
statement of the contrary of what you really suspect to be the case. Thus, if
one's true hypothesis is that <TALL MEN LIVE LONGER>, then the
null hypothesis is simply that <TALL ME DO NOT LIVE LONGER>. Although
this may seem an unnecessary complication in putting an argument across
(because our mental problem space is limited capacity and extra words,
especially negatives, take up that space), it is useful for technical reasons
when carrying out inferential testing. This is because group
difference statistics, by their mathematical nature, make the initial
presumption that two samples are from the same population until proved
otherwise. They then report when the means of the two samples move apart far
enough for that null presumption eventually to be dismissed. Observation: "The
essence of studying anything [is] the observation of changes in variables"
(Coolican, 1990, p15). "Observation exists at the beginning and again
at the end of the process: at the beginning, to determine more definitely and
precisely the nature of the difficulty to be dealt with; at the end, to test
the value of [the action taken]. Between those two termini of observation, we
find the more distinctively mental aspects of the entire thought
cycle: (i) inference, the suggestion of an explanation or solution; and
(ii) reasoning, the development [of] the suggestion. Reasoning
requires some experimental observation to confirm it, while experiment can be
economically and fruitfully conducted only on the basis of an idea that has
been tentatively developed by reasoning. [.....] The disciplined, or
logically trained, mind  the aim of the educative process  is the mind able
to judge how far each of these steps needs to be carried out in any
particular situation. No cast iron rules can be laid down. Each case has to be
dealt with as it arises [.....]. The trained mind is the one that best grasps
the degree of observation, forming of ideas, reasoning, and experimental
testing required in any special case, and that profits the most, in future
thinking, by mistakes made in the past. What is important is that the mind
should be sensitive to problems and skilled in methods of attack and
solution." (Ray, 1967, p157; italics original.) OneTailed Test: [See firstly tests for the difference of two means.] This is
a directional application of one of the twogroup inferential statistics,
namely the ttest, the Wilcoxon, or the MannWhitney. It
is so called, because the statistical procedure only has to deal with
separation of the two distributions down one or other of the asymptotes (or
"tails"), but not both. Operationalise, To: This is the act of assigning a particular physical
dimension as a measure of a particular research variable. Example: One might
conceptualise the hypothetical
construct stress as including autonomic changes, one of which might be
adrenalinerelated, and then operationalise
a measure of stress as serum adrenalin, heartbeat per minute, temperature of
thumb, salivary cortisol, irritable outbursts per hour, or anything you like,
providing you can defend the construct
validity of your eventual findings. Opportunity Sampling: [See firstly sampling.] This is the act of
selecting a research sample according to who is available to take part in it,
rather than according to more precisely derived criteria. Paired Comparison: See Kendall coefficient of agreement. Parallel Form: In order to avoid practice effects which
might otherwise prevent using the same assessment twice on the same subjects,
many psychometric packages offer two (or more) item sets, matched for
difficulty. These are known as the parallel forms of the test. PCA: See
principal components analysis. Pearson, Karl: Karl
Pearson (18571936) was the statistician who devised the Pearson
product moment correlation and the chisquared test. Piliavin, Rodin, and Piliavin (1969): This is the classdefining study into bystander
apathy. Placebo Group: This is a group in an intervention
study given a dummy treatment,
and (usually) kept unaware of that fact. Population:
This is all the members of a uniquely definable group of people or things.
[Compare sample.] Positive Correlation: See correlation coefficient. Positive Predictive Value (PPV): A test's PPV is a measure of how good that test is
at detecting true positives when all its decision positives are considered.
It is calculated by substituting empirical observations into the formula = TP
/ (TP + FP). When PPV is high it indicates that the false positive problem is
under control. Practice Effect: This is a class of confounding which might be encountered
when the measure in question is itself a learnable mental or physical skill. Predictive Validity: [See firstly validity and criterion validity.] Data
collection instruments such as questionnaires, test batteries,
or psychometric tests may be said to have predictive validity to the extent to which they have demonstrated
the ability to detect the people they want to find. Predictive validity is
therefore a major requirement in healthcare (where tests are used to
select/reject patients for treatment) and education (where tests are used to
select/reject students). Establishing an instrument's predictive validity
requires prolonged field data collection and analysis, but provides a very
important statistic to be able to quote. There is an enormous science of
predictive value for diagnostic tests within medical decision making, but
avoid for the moment. Predictor Variable: This is an ptional name for independent variable. Premiss: This
is an optional spelling of premise. Principal Components Analysis (PCA): [See firstly factor analytical methods.]
This is a factor analytical method of screening a large number of
simultaneous [i.e. multivariate] measures for those which  because they vary
together all or most of the time  may be better regarded as the outcome of a
broader underlying factor. What we want to end up with is new and better
variables, such that each "has the highest possible variance and so
represents better than any other linear combination of the [original
variables] the general differences between individuals" (Armitage and
Berry, 1987, p327). Principle of Falsification: Popper's (1959) assertion that the scientific
method is ultimately based on our ability to prove an assertion is false (by
finding a counterexample to it), but NOT to prove one is true. pValue:
[See firstly confidence level.] Academic journals usually adopt the standard
mathematical shorthand here, reporting probability as p [hence
"pvalues"]. The values of p can run from zero (totally improbable)
to 1.0 (certain), and are usually seen as two places of decimals in between.
The probability can be converted to a percentage by multiplying by 100. Thus
a probability of p=0.57 is likely to happen 57% of the time. Probability:
This is a mathematically expresed measure of how likely something is to
happen. [See now pvalue.] Professional Opinion: See levels of evidence. Pygmalion Effect: This is one of the classical examples of an expectancy bias,.
first studied in schoolteachers by Rosenthal and Jacobson (1968). Qualitative Research: This is research in which the critical variable
is a quality [compare quantitative research]. Quantitative Research: This is research in which the critical variable
is a quantity [compare qualitative research]. Quasi Experiment: [See firstly experimental methods.] This is a form of
experiment, developed originally in educational research, in which it is not
possible to allocate subjects to the various IV conditions. Random Error:
[See firstly measurement.] This is the inherent inaccuracy of any
scale of measurement. Randomised Controlled Trial (RCT): The RCT is a robust and welltried research
design and a key element in delivering evidencebased practice in
healthcare (or, indeed, any other professionalism). It is
"randomised", because it does not preselect subjects who are in
some way likely to fit "the treatment" being evaluated. Instead,
participants are drawn at random from the largest practicable pool. It is
"controlled" in the sense that it includes control groups
who do NOT receive the treatment in question but who IN EVERY OTHER RESPECT
are treated identically. This is to insure against making what are known as Type
I errors should a variable other than the treatment be surreptitiously at
work. What you want to see is an improvement in the treated group but no
change in the controls. RCTs are also expected to be blind or doubleblind
where necessary, and generally avoid bias, confounding, to do their best to
design out practice effects, order effects, fatigue effects, and ceiling
effects, and to maximise (or at least quantify) the many subtypes of research
validity and reliability. Range:
[See firstly distribution.] The
range of a distribution is the difference between that distribution’s lower extreme and its upper extreme. xample: If the
lowest value in a distribution is 43 and its highest value is 93, then
subtracting the former from the latter gives us a range of 50. [To see how the range can be used in
descriptive statistics, see boxandwhisker plot.] RCT: See randomised
controlled trial. Reactence:
This is one of the eight types of confounding identified by Lewin
(1977). It reflects the possibility that stubbornness on the part of subjects
will cause responses deliberately opposite to that which might otherwise have
been made. Receiver Operator Characteristics (ROC): [See firstly diagnostic tests and screening
procedures.] If you adjust the cutoff to a lower value, then it makes
the test more sensitive, but only at the cost of having to put up with
more false positives. If you set it higher then it makes the test more
specific, but only at the cost of having to put up with more false
negatives. Reflective Practice: This is a state of perpetual critical appraisal in
professionals which attempts to exclude those irksome errors of omission by enhancing
the vision of exactly what full clinical autonomy actually involves.
Reflective practitioners are seen as preventers who constantly question their
means of prevention, as assessors who constantly question their methods of
assessment, as interveners who constantly question their proposed point of
intervention, and so on. Reliability:
This is a measure of how well a given test reflects "'true' differences
in the characteristics under consideration" (Anastasi, 1990, p109).
Reliability is usually considered under two subheadings, namely internal
consistency (e.g. Bryman and Cramer, 1997; Kline, 2000b), and its stability
over time (or "testretest reliability") (e.g. French and
Michael, 1968; Kline, 2000b). Bryman and Cramer (1997) distinguish internal
reliability from external reliability, and Armitage and Berry
(1987) discuss the whole area of diagnostic tests and screening procedures
as a reliability issue. It may also be appropriate, depending upon the
particular research setup, to assess scorer reliability. There
are three "offthepeg" tests of reliability, namely splithalf
reliability, testretest reliability, and item consistency,
but there is usually also scope for some projectspecific hypothesis
testing using the full range of inferential statistics. Repeated Measures Design: [See firstly research design]. This is a
class of experimental designs in which the scores from two or more
different groups of subjects are compared using one of the available tests
for the difference of means. Example (1): In testing
the hypothesis that men are better than women at mathematics, you have no
choice but to test different groups of subjects (because men cannot be women
at the same time). Example (2): In testing the hypothesis that sober men are
better at mathematics than drunk ones, you could (a) test the same subjects
on different occasions (a repeated measures design), or (b) test
different groups (an independent samples design). By and large,
repeated measures designs are more powerful than independent samples designs,
because there is less within between subjects variance. Replicability: This is the ease with which a piece of
research can be precisely repeated using only the original writeup to go by.
Given the requirements of the principle
of falsification that good science is ultimately all about failing to
find counterexamples of a test proposition, it follows that any author's
research should be as precisely replicable as possible. [See now citing
previous research.] ROC: See
receiver operator characteristics. Role Expectations: Role expectations account for two of the eight types of confounding
identified by Lewin (1977). With "good subject" role expectations,
the possibility is that we try to do what we think we ought to do, and in the
"bad subject" role we set out to respond anything but
normally. Rosenthal and Jacobson (1968): [See firstly bias and Pygmalion effect.]
This is the classic study of expectancy bias. Teachers were given
fabricated cover stories leading them to believe that some of their pupils
were likely to be "academic spurters" during the coming year. Such
students, who in fact were chosen at random from the available classroom population,
showed an increase of 12 IQ points during the research year compared to 8
points in their "less gifted" colleagues. [Further
discussion.] Sample:
This is the subset of the target population of objects or participants
which is selected for research investigation. There are many practical
procedures available for selecting one's sample, each with its own advantages
and disadvantages. Sampling Bias: [See firstly bias.] This is bias arising from logically
flawed or carelessly executed sampling, resulting in a sample
which does not fairly represent the population in question. Scorer Reliability: This is a measure of how consistently a particular
scorer will score the same raw data on different
occasions. Selection Bias: Same thing as sampling bias. Sensitivity:
[See firstly diagnostic tests and screening procedures.] This is a
mathematically derived index of how good a test is at detecting true
positives, that is to say, of how good that test is at detecting positives
in a population of condition positives. It is calculated by substituting
empirical observations into the formula TP / (TP+FN). High sensitivity is
called for in tests where false negatives are either expensive or
downright dangerous. False negatives in medicine result in missed
opportunities for treatment, and in education they result in delayed or lost
opportunities for personal development. In practice, however, highly
sensitive tests often give high numbers of false positives, so in
isolation they are less than perfect measures. [See now receiver operator
characteristics.] Sign Test: See
binomial sign test. Single Blind:
[See firstly blind.] A single blind study is one in which EITHER
the experimenter(s) OR the participants are naive as to the true purpose of
the research. [Compare double blind.] Single Case Research: This is an experimental variant of the case
study method used within correlational research (Cohen and Manion,
1989) Solomon Four Group Design: This is an experimental design introduced by
Soloman (1946) which uses three control groups to avoid the
possibility that any pretest might itself affect the dependent variable.
The traditional (Lewin, 1979) experimental group is given the
pretest, the main treatment, and the posttest in the normal way, and
the traditional control group is given the pretest and posttest but given
only a control treatment. There is then a second control group which is not
pretested, but does get the main treatment and the posttest, and a third
control group which just gets the posttest. Differences between the
experimental group and the second control group, or between the first and
third control groups, "must be caused by pretesting" (Lewin, 1979,
p107). Spearman, Charles: Charles
Edward Spearman (18631945) was the statistician who devised factor
analysis, and who then famously applied this method to the analysis of
the factors of human intelligence. He also devised the Spearman's
rank correlation. Specificity:
[See firstly diagnostic tests and screening procedures.] A test's
specificity is a measure of how good that test is at detecting negatives in a
population of condition negatives. It is calculated by substituting empirical
observations into the formula TN / (TN + FP). High specificity is called for
in tests where false positives are either expensive or downright dangerous.
False positives in medicine result in inappropriate treatment or unnecessary
referral, and in education they result at best in a harder than necessary
student experience and at worst in course failure. SplitHalf Reliability: [See firstly reliability]. This is a
statistical procedure by which a random half of the test items is correlated
with the other half. This is often a more useful measure than testretest reliability, because the
data comes from single test session, so there will be no practice, illness
progression, recovery, or mood or similar effects. Standard Score: [See firstly standard
deviation.] This is a score which somehow indicates its relative position
within a distribution. One which expresses "the individual's distance
from the mean in terms of the standard deviation of the distribution"
(Anastasi, 1990, p84). This typically requires that raw scores are
mathematically converted in some way, and that the distribution approximates
to normal. Standardisation: This is the process of calibrating a measure to a given population
so that the future performance of other samples can be normreferenced.
Stanine:
[Abbreviation of Standard NINE point scale.] This is a measurement system in
which the distribution in question is divided into nine subranges. If
a normal distribution, then the stanines take up 4%, 7%, 12%, 17%, 20%,17%,
12%, 7%, and 4% respectively. Stanine #1 is then classified as
"poor", #2 and #3 are grouped together as "below
average", #4, #5, and #6 are "average", #7 and #8 are
"above average", and #9 is "superior". (After Durost,
1968.) Structured Interview: See interview. "Student": See Gosset, William. Student's tTest: See ttest. Syllogism:
See this entry in our Rational
Argument Glossary. TestRetest Reliability: [See firstly reliability]. This is one of
the two aspects of reliability identified by French and Michael (1968)
[the other being internal consistency].
A "timeassociated reliability", which can readily be quantified
using a correlation coefficient
derived from correlating results from the same test administered twice to the
same subjects, at least three months apart. The square of the correlation
coefficient gives the "degree of agreement". [See now the advantage
of having parallel forms of the
test in question.] Tests for the Difference of Two Means: This is one of the two basic types of tests for
the difference of means (the other being tests for the difference of more
than two means). Theory: A
theory is a body of empirically verified observation, plus a particular
interpretation. It is thus an attempt to make sense of a number of confirmed
hypotheses by drawing them together into a more meaningful whole (Lewin,
1979). Thurstone, Louis: Louis
Leon Thurstone (18871955) was a psychometrician who used factor
analysis techniques in the study of intelligence and its internal
structures. True Experiment: [See firstly experimental methods.] This is the
"ideal" form of experiment. One in which the researcher has the
power, money, ethical approval, and ability to manipulate all the necessary independent
variables. True Negative: [See firstly diagnostic tests and screening procedures.] This
is A CORRECT diagnostic judgement that an entity DOES NOT fall within a
target category. [See now negative predictive value.] True Positive: [See firstly diagnostic tests and screening procedures.] This
is A CORRECT diagnostic judgement that an entity DOES fall within a target
category. [See now positive predictive value.] True Score/Value: [See firstly measurement.] This is the (infactunattainable)
ideal of a score which has no measurement error. TwoTailed Test: [See firstly tests for the difference of two means.] A
nondirectional application of one of the twogroup inferential statistics,
namely the ttest, the Wilcoxon, or the MannWhitney. So
called, because the statistical procedure has to deal with separation of the
two distributions down BOTH of the asymptotes (or "tails").
[Compare onetailed test.] Type 1 Error:
[See firstly hypothesis testing.] This is a class of very bad science
in which the null hypothesis is rejected when in fact it is true, thus
causing its inverse, the hypothesis, to be accepted when in fact it should be
rejected. In simple situations, one way to reduce the likelihood of Type 1
error is to select a tight confidence level (at the 0.01 level, say,
rather than at the 0.05 level). [Compare Type 2 error, and see also confounding.] Types of Research: Underwood (1966) identifies three basic types,
namely naturalistic observation, the correlational method, and the
experimental method. Cronbach (1957), however, disagrees that there is such a
thing as an observational method, seeing observation as a type of
measurement, not as a basic type of research. He recognises only correlational
psychology and experimental psychology. Unrelated tTest: This is one of the two possible types of ttest (the other
being the related ttest). The test of choice when a research
design delivers two columns of scores from comparison groups (that is to
say, a betweengroups design). [For an etutorial on how to use SPSS to carry
out your unrelated ttests, click here.
Note the suggested format for the final writeup.] Validity:
This is the issue of whether a test is measuring what you think it is
measuring, and thus whether the piece of research in question is valuable
science or not. Unfortunately, there are a large number of ways in which
research can be invalid. To start with, poor conceptualisation (i.e. iffy
psychological construct), poor build (i.e. iffy items, ambiguous, unclear,
incomplete), or poor administration (bias). There are many aspects to this
major issue of scientific quality, including construct validity, content
validity, criterion validity, face validity, and predictive
validity. Referring specifically to psychometric tests, rather than
to scientific conclusions in general, French and Michael (1968) consider only
content validity, criterion validity, and construct validity.
Kline (2000b), on the other hand, emphasises face validity, concurrent
validity, predictive validity, content validity, incremental and differential
validity, and construct validity. Variable:
See the entry in Section 1 hereto. 
3  References
See
the Master References List
[Home]