Glossary - Research Methods and Psychometrics
Copyright Notice: This material was
written and published in Wales by Derek J. Smith (Chartered Engineer). It forms
part of a multifile e-learning resource, and subject only to acknowledging Derek
J. Smith's rights under international copyright law to be identified as author
may be freely downloaded and printed off in single complete copies solely for
the purposes of private study and/or review. Commercial exploitation rights are
reserved. The remote hyperlinks have been selected for the academic appropriacy
of their contents; they were free of offensive and litigious content when
selected, and will be periodically checked to have remained so. Copyright © 2006-2018, Derek J. Smith.
|
First
published [v1.0] 14:00 BST 19th June 2006. This version [2.0 - copyright] 09:00 BST 8th July 2018.
BUT
UNDER CONSTANT EXTENSION, SO CHECK AGAIN SOON
1 - Introduction
This glossary is an
alphabetically sorted series of short cross-indexed definitions, cumulatively
explaining how the scientific method in general and research statistics in
particular are typically applied to psychological research. The cross-indexing
has been done in such a way that if the individual entries were to be loaded
into a semantic network they would produce a navigable encyclopaedia
on the chosen subject. There are also half a dozen Key Concept definitions which are so pivotal to the entire study
area that we need to deal with them right now .....
Key Concept – "Causation": One of the
fundamental notions of science is that of "causation",
the idea that some things just happen whilst others are contingent upon prior
events taking place or specified antecedent conditions being reached. We know
the former class of events as "random" or "chaotic, and the
latter as "regular" or "ordered". Science is thus about finding causation where previously we suspected
chaos.
Key Concept – "The Causal Line": What makes the
study of causation really challenging is the fact that when we inspect
processes more closely they often turn out to involve a succession of lesser
cause-effect events. Event A causes State B, which triggers Event C, and so on.
This sequencing of causes and effects is known as a "causal line", and causal lines have been formally
defined as "temporal series of events so related that, given some of them,
something can be inferred about the others whatever may be happening
elsewhere" (Russell, 1948, p459). The
principal scientific skill is accordingly that of unravelling causal lines, and
the pay-off is that the resulting predictability gives us some semblance of
control over our world.
Key Concept - Variable: A variable is a
quality or attribute, physical or conceptual, which may exist in two or more
discrete states or at two or more intensities. It is thus a dimension of observation, and observation, as we
shall be explaining in Section 2, is another fundamental building block of the scientific method.
Key Concept – "Discrete" vs
"Continuous" Variables: "Variables are tricky things" (Coolican, 1990, p12). To start with, they move in different
ways, some jerkily and some smoothly. Variables
which advance in integral steps are known as "discrete" variables. Examples: A runner’s
position in a race can be first, second, third, etc. but not fractional values
thereof. Discrete variables thus have a limit to their arithmetical precision,
that is to say, there is no point in measuring them to more places of decimals
than the steps themselves allow. On the other hand, there are many variables
which have no theoretical limit to their arithmetical precision, and can be
measured to as many places of decimals as you like. These are known as "continuous" variables. Examples:
Time, distance, and mass, or complexes thereof, such as velocity,
acceleration, force, and pressure.
Key Concept - Independent Variable (IV): [Sometimes
"predictor variable".] Some variables also act upon other variables
within causal lines, causing the
latter to vary in turn. These are known as "independent" variables,
and IVs are important because they help us to understand particular causal
lines. Examples: (1) Gender is a two-state discrete
variable capable of differentially
determining an organism’s behaviour through a complex
causal line which includes a number of anatomical, physiological, and
psychological variables. (2) Temperature is a continuous
variable which will directly influence many physical processes.
Key Concept - Dependent Variable (DV): [Sometimes
"criterion variable".] Variables which are acted upon by IVs in a
causal line are known as "dependent" variables [because what they do
"depends on" what the IV does]. In the experimental method,
DVs are the variables which are monitored (i.e. observed and measured), whilst
IVs are those which are manipulated in the hope that by varying a cause you
will become better able to understand its effects.
For more on how
variables fit into the topics of philosophy of science and research design,
start with the Section 2 entries for hypothesis
testing, inference, and inferential testing, and follow the
links from there.
2 - The Glossary
Entries
Action Research: A research philosophy originally developed by Lewin (1946) in order
to avoid "research that produces nothing but books" (p35), and
Corey (1949, 1953) to help support the cyclical improvement of educational
initiatives [more
history], but now of proven utility in any similar large institution,
including social services [example],
healthcare [example],
and IT [example].
Alternatively, "a deliberate, solution-oriented investigation that is
[characterised] by spiraling cycles of problem identification,
systematic data collection, reflection, analysis, data-driven action taken,
and, finally, problem redefinition" (Beverley, 1993/2004 online).
For further details, see Wilson and Streatfield (2004 online).
[See also participatory action research.] ANOVA: See
analysis of variance. Analysis of Variance (ANOVA): [See firstly variance and tests for the
difference of more than two means.] The ANOVAs are a class of parametric
statistical procedures capable of processing more than two columns of
group-difference data in one row [the "one way" analysis of
variance] or two or more columns of such data in two or more rows [the "two
(or more) way" analysis of variance]. Analysis of Variance, "One Way": [See firstly analysis
of variance.] The simple, or "one way", ANOVA is the
statistical analysis of choice for data which sit naturally in one-row tables
of three or more cells. The statistical procedure computes variance between
cells as well as in total. The statistical algorithm itself need not concern
us, but the statistic it produces - known as an F-value - is a
valuable index of (1) the amount of variance attributable to
differences between the cells, and (2) random, or "residual",
variance. Significance tables may then be used to convert the
F-value and its degrees of freedom intersect into an equivalent p-value.
Example: In testing the general hypothesis that exercise
depresses the appetite, we might record the daily calorific intake of five
groups, graded by the amount of exercise taken. If we then arranged for each
group to contain around a dozen subjects, you could tabulate their calorific
intakes (in kilocalories) into five horizontally aligned cells, numbered 1 to
5. Analysis of Variance, "Two (or More) Way": [See firstly analysis
of variance.] Two- (or more-) way ANOVAs are capable of simultaneously
coping with two- (or more-) independent variables. The popular
shorthand description for such procedures reflects the number of conditions
on each variable as an integer. Thus a 2x2 (pronounced
"two-by-two") ANOVA has two independent variables, each of whose
effects on the dependent variable are sampled under two conditions. The
dependent variables are accumulated under the appropriate cell heading, and
the analysis carried out once enough observations have been made. A 2x2x3
(pronounced "two-by-two-by-three") ANOVA has three independent
variables, two sampled under two conditions, and the third sampled under
three conditions. Example: In testing the general
hypothesis that exercise depresses the appetite in women but increases it in
men, we might record the daily calorific intake of five groups of men and
five groups of women, both graded by the amount of exercise taken. If we then
arranged for each group to contain around a dozen subjects, you could
tabulate their calorific intakes (in kilocalories) into five horizontally
aligned cells for the men, themselves vertically aligned over five further
horizontally aligned cells for the women. Since the resulting table would
then be five cells wide by two cells deep, it would be a 5x2 (pronounced
"five-by-two") ANOVA. Artefact: Same
thing as artifact. Artifact:
[Optionally artefact.] Generally, "a thing made by art, an
artificial product" (OED), and thus, in the present context, a research
conclusion which turns out upon critical methodological examination to have
arisen thanks to a bias or confound of some sort, rather than
because of the causal relationship under test. A serious source of
research error. Lewin (1977) gives the following advice on handling
artifacts: "Highly contrived laboratory situations magnify artifacts.
The researcher should ask, 'How else can I study this topic?' [and one good
way] to reduce confounding variables is by imaginative and creative research
design" (p110). Another way is to increase "experimental
realism". Moreover, specific types of artifacts go with specific types
of research, so it is down to the analytical skills of the author(s)
concerned to spot them in advance (i.e. before your peer-reviewer, to your
cost, does it for you). Pre-testing may itself be the cause of an
artifact, and can be better managed by adopting the Solomon four group
design. Attention Bias: This is a type of bias in which the act of observation
itself becomes an important (if not the most important) independent
variable. The Hawthorne effect is a good example of what can then
happen. Awareness of the Hypothesis: This is one of the eight types of confounding
identified by Lewin (1977). It reflects the possibility that a research
participant's understanding of what a given piece of research is for might,
consciously or unconsciously, influence the behaviour being measured. Between-Subjects Design: See repeated measures design. Between-Subjects Variance: [See firstly variance.] This is variance
arising from chance or bias in the groups participating in an independent
groups design. Bias: [See firstly
measurement error.] In everyday
usage, a bias is a deviation from an intended path (OED). In scientific
research, it is the systematic
deviation of measurements from their true value (as opposed to random deviation, which is due to statistical "noise").
Unfortunately, bias can arise for a large number of reasons, and originate at
all stages of the research cycle. It therefore has no single definition nor
method of detection nor standard remedy. The
following subtypes of bias are therefore dealt with separately: attention
bias, centripetal bias, confounding, cultural bias, demand
characteristics, expectancy bias, measurement bias, recall
bias, sampling bias, volunteer bias, and withdrawal
bias. [For additional discussion, see Palmer (1996/2004
online).] Binomial
Sign Test: This is a single sample inferential statistic
for non-parametric data at the nominal or ordinal level.
By "single sample" is meant any design which is attempting to judge
whether a sample population is, or is not, drawn from a reference population,
whose distribution is already a matter of record and thus need not be
re-sampled. If the sample population is representative of the
reference population, then the sample distribution and the reference
distribution will, save for sampling noise, be the same. The test can
also be used as a two-sample inferential statistic for non-parametric data in
a related design. Blind: This is the technical
term for the giving or taking of measurements without knowledge of the true
purpose of the research, and possibly under the deliberate influence of a cover
story. [See now double blind and single blind.] Bonferroni's
Correction (for Multiple Comparisons): [See firstly confidence levels and Type 1 error.] The
Bonferroni correction is an adjustment to the confidence level required when
a single scientific hypothesis is being investigated using multiple inferential
statistics. The risk on such occasions lies in the fact that the p-value,
the probability of a Type 1 error, increases with every new
statistical procedure, in much the same way that the odds of throwing a 6 go
up when you are allowed to keep throwing your die. The solution proposed by
the Italian statistician Carlo Emilio Bonferroni (1892-1960) was to recalculate the
multiple p-values into a single adjusted p-value. This was done by a statistical
algorithm, and a simple online procedure at the SISA website will nowadays do
this calculation for you [click here to use the algorithm and here to see the user
instructions]. Box-and-Whisker
Plot: The box-and-whisker plot (or "box-plot" for short) is an
easily drawn graphical aid, designed to display both the central tendency and the dispersal
of a given distribution. The
graphic is produced by inspecting said distribution, and identifying five
values. The first two are the lower
extreme and the upper extreme,
and the difference between these values gives us the range of the distribution. The next important value is the median. The median is important
because it locates the midpoint of the distribution on a linear scale for the
variable in question. It divides the range into upper and lower halves, and
the two half ranges are then further subdivided by locating the two quartiles, the lower quartile and the upper
quartile. A rectangle is now drawn above the scale, such that it begins
at the lower quartile, ends at the upper quartile, and is vertically divided
into two at the median. This is the "box" element of the graphic.
The "whiskers" are now added to the box by adding horizontal lines
from the lower and upper limits of the box to the lower and upper ends of the
box. Box Plot: See box-and-whisker
plot. Briefing: [See firstly ethics
and deception.] This is the stage in the standard research procedure
at which participants are told what the research is about (either truthfully,
or – ethical deception having been approved - as part of a cover story),
and given the opportunity to withdraw their consent to take part, as required
by the codes of practice on ethical research laid down by the various
institutions involved. In all research philosophies, it is wise to regard the
briefing as a potential source of demand
characteristics, and in the experimental philosophy it may also need to
be regarded as a treatment as
well. even if it there is no element of deception
involved. [See now debriefing.] Burt, Sir Cyril
(1883-1971): [Selected Internet biography] British intelligence theorist, initially
acclaimed for his contribution toward the g-factor theory of
intelligence (e.g. Burt, 1917). Burt’s academic reputation suffered after Hearnshaw (1979) exposed a number of inconsistencies in
his data handling, and he was subsequently adjudged by the British
Psychological Society to have falsified his results. More recent papers have
defended Burt, but the official ruling remains in effect nonetheless. Bystander Apathy: This is the name given
to an unwillingness to get involved on the part of
persons close to an apparent ongoing emergency. It is ignoring one's duty in
favour of a "quiet life". It is "bad Samaritanism". This
phenomenon was investigated by a classic social psychological study, Piliavin, Rodin, and Piliavin
(1969). Causal Line: See the entry for this
topic in Section 1 above. Cause and Effect: See the entry for this topic in the companion Rational
Argument Glossary. Central Tendency: [See firstly distribution.] This is
a measure of where the centre of a given distribution lies with reference to
its lower extreme and upper extreme. Among the graphical
displays of central tendency we have the box-and-whisker plot, and
among the computed measures we have the mean, the median, and
the mode. [Compare Dispersion.] Centrifugal Bias: This is a type of bias in which the research centre itself -
say a failing hospital or a non-prestige university - is avoided by
individuals who can get in somewhere better, thus rendering a sample
of those who are left subtly unrepresentative of the population at
large. [Compare centripetal bias.] Centripetal Bias: This is a type of bias in which the research centre itself -
say a specialist hospital or a prestige university - attracts individuals
with particular strengths and attributes, thus rendering a sample
thereof subtly unrepresentative of the population at large. [Compare centrifugal
bias.] Chi-Squared Test: This is the most common method of statistical analysis for
frequency data. Given an array of actual cell frequencies, the
statistical procedure computes a null
hypothesis expected distribution, and then tests
whether the actual-expected difference is big enough to have occurred by
chance. [Full
tutorial] Citing Previous Research: This is making due reference to the literature when deriving and
justifying one's research
argument. [Now see criticising
previous research.] Clever Hans:
This is a classic example of a procedural confounding bias, in which a
circus horse - Hans - would answer simple arithmetic questions by tapping so many
times with his hoof. Upon closer inspection, however, it turned out that Hans
was not numerate at all - merely sensitive to his trainer's body language.
The secret was that Hans had learned to start tapping when given one type of
behavioural cue, and would stop when given another [full details]. Research which does not thoroughly avoid
confounds of this sort at the early planning stage is likely to be deeply
flawed. Clinical Effectiveness: This is the general concept of value for money – i.e.
demonstrable benefit - in clinical treatment of any kind. Specifically, a
series of initiatives during the early 1990s to maximise value for money in
the British NHS by raising consciousness of cost-of-outcome amongst clinical
professionals, and which therefore inspired (and still inspires) a large
number of efficacy studies to prove things one way or the other. The
search for maximum clinical effectiveness in the UK is overseen by the National Institute for
Clinical Excellence (NICE). Clinical Judgement: This is one of the two basic types of
clinical assessment (the other being the use of formally standardised psychometric
tests). To reach a clinical judgement requires a combination of observation,
adhoc diagnostic tests, and prior
professional experience. Cluster Analysis: This is one of the four recognised types of multivariate method
(the others being principal components
analysis, factor analysis, and
discriminate analysis). Cluster Sampling: This is one of the standard optional methods of sampling. The
method relies on sampling selected clusters of potential subjects within the
target population, rather than the population
as a whole (thus saving time and expense, but at the risk of introducing some
kind of sampling bias). Coefficient Alpha: Same as Cronbach’s alpha. Cohort Study:
This is a type of longitudinal study in which the group(s)
being studied are monitored over a suitable period of time. Concurrent Validity: [See firstly validity.] Data collection
instruments such as questionnaires, test batteries, or psychometric
tests may be said to have concurrent validity to the extent to which
their findings correlate with other tests – criterion tests - of the same construct.
Concurrent validity can be assessed by an appropriate statistical technique,
and expressed as a correlation
coefficient. Unfortunately, since suitable criterion tests are in fact
surprisingly rare, this sort of validity can actually be difficult to
quantify. Indeed, Kline (2000b) warns that "almost the only field where
accepted tests exist such that high correlations with them indicate validity
is intelligence […..] In most other fields confusion
reigns" (p20). Kline also warns that the reliability of the selected criterion test also needs to be taken
into account, because if you select a test with low reliability to validate a
new test against, then it may be the validating test which is misbehaving,
not the new one. Confidence Level: [See firstly hypothesis testing and inferential statistics.]
This is an expression of the probability of accepting a hypothesis
without committing a Type 1 error, conventionally expressed by a
p-value. The usual confidence boundaries in psychological research are
"not significant" (a p-value greater than 5%),
"significant" (a p-value between 1% and 5%), "very
significant" (a p-value between 0.1% and 1%), and "highly
significant" (a p-value less than 0.1%). The conventional shorthand for
expressing these levels of significance in social science research is to add
the code "p > 0.05", "p < 0.05", "p <
0.01", and "p < 0.001", respectively. Confound: In
the current context, "to confound" is to fail to detect a confounding
bias prior to carrying out a piece of research, with the end result that cause-and-effect
interpretation of the results becomes unsafe. "A confound" is the confounding
variable doing the damage. Confounding Bias: This is a type of bias in which one or more initially
unrecognised confounding variables turn out to have affected the
obtained results, thus rendering cause-and-effect interpretation
unsafe. Lewin (1979) identifies the following eight major sources of
confounding: awareness of the hypothesis, demand characteristics,
enlightenment effects, evaluation apprehension, experimenter
expectancy, reactance, and role expectations (two types). Confounding Variable: [See firstly variable.] This is an independent
variable NOT formally designed into a piece of research, and which, by
not being controlled, is likely to pervert the course of hypothesis
testing, perhaps by encouraging a Type 1 error. Consent: See
informed consent. Consistency:
See internal consistency. Construct:
See hypothetical construct. Construct Validity: [See firstly hypothetical
construct and validity.] Data collection instruments such as questionnaires,
test batteries, or psychometric tests may be said to have
construct validity to the extent to which they are based upon well-accepted
psychological constructs. This is important because many psychological
constructs - e.g. telepathy - are not universally accepted. Anastasi’s (1988)
examples of established theoretical constructs include scholastic aptitude,
comprehension, verbal fluency, neuroticism, and anxiety. The notion of
construct validity derives initially from Cronbach and Meehl
(1955/2005
online), who warned that specific high correlations "may constitute
either favourable or unfavourable evidence [] depending on the theory
surrounding the construct". Example: A mind-reading
test which was in other respects reliable and valid would have dubious
construct validity because the construct of mind reading was itself less than
universally accepted. Assessing: Cronbach and Meehl further argue that the ideal assessment of
construct validity would be to have some form of "construct validity
coefficient", "a statement of the proportion of the test score
variance that is attributable to the construct variable" (p7).
Unfortunately, while this is conceptually straightforward enough (you simply
have to decide whether the suggested psychological construct actually exists,
or is something else under a new name), it is difficult to do in practice. In
fact, Kline (2000a) sees little alternative to having to put together a
package of hypothesis testing supplementary to the headline hypothesis, and
that will seriously complicate the research design and dramatically lengthen
the validation process. Early planning is therefore called for. Content Analysis: This is a method of obtaining quantitative scores for various variables
within (usually written) language. [For further details, see the
corresponding entry in our Psycholinguistics
Glossary.] Content Validity: [See firstly validity.] Data collection instruments such as questionnaires,
test batteries, or psychometric tests may be said to have
content validity to the extent that they sample "the class of situations
or subject matter about which conclusions are to be drawn" (French and
Michael, 1968, p164). Example: A mathematics test
which contained only spelling questions, or without a section covering
division, would have impaired content validity. Solution: Careful
planning and analysis of the literature, followed by more detailed
hypothesising and/or perhaps a multifactorial design with a view to
quantifying the true spread of construct complexity (e.g. reading skill,
driving skill, etc.). Another technique might be to resort to field experts
to examine the proposed test content and to quantify and report some measure
of their approval. Continuous Variable: [See firstly variable and the Section 1 entry for "discrete" vs "continuous"
variables.] This is one of the two sub-classes of interval/ratio
data (the other being discrete variable). It follows that measurements
of continuous variables are always approximations, and thus have an element
of measurement error irretrievably built in. [Compare discrete
variable.] Control Group:
[See firstly group.] This is a subset of a research sample
selected NOT to receive a particular treatment, thus providing a helpful
baseline or comparison measure for the dependent variable under investigation.
Alternatively, it is the "point of comparison with the group of subjects
who receive the experimental manipulation" (Bryman and Cramer, 1997,
p5). Alternatively, "the function of a control group is to provide
an observation that cannot be attributed to the variable being
manipulated" (Sarbin and Coe, 1975, p11).
Historically, one of the first recorded controlled trials was James Lind's
discovery in 1753 that eating citrus fruits could cure the condition known as
"scurvy" in the mariners of that time. The online James Lind Library details this
and a number of other pioneer uses of control groups. Correlation:
To "correlate", of variables, is to vary in the same
direction and proportion at the same time, possibly as the result of a cause-and-effect
relationship but perhaps coincidentally. The detection of correlations
is an important practical aspect of establishing a causal line, and thus the fundamental principle of the correlational philosophy of science. Correlation Coefficient: A "correlation coefficient" is a
mathematical index produced by one of the many correlational statistical techniques (such as the Pearson
product moment correlation or the Spearman rank correlation) and
indicating the extent of the relationship between two potentially related
sets of measures, and therefore, to the extent that they have been properly operationalised, of the proposed
underlying variables. The coefficient ranges from -1 (a perfect negative
correlation) to +1 (a perfect positive correlation). A strong
positive coefficient (usually accepted as 0.7 or above) indicates that one
variable typically increases as the other increases, whilst a strong negative
coefficient (usually accepted as -0.7 or below) indicates that one variable
typically decreases as the other increases. A coefficient of zero indicates
no relationship at all. Correlational Method: See correlational philosophy of science. Correlational Philosophy of Science: [Alternatively "correlational method" or
"correlational psychology".] This is one of the two alternative
approaches to quantitative research (the other being experimental
psychology), as proposed by Cronbach (1957). The method effectively plots
naturally occurring observations of one variable against another, searching
for the correlations "presented by nature" (Cronbach, 1957,
p10), that is to say, for "already existing variation" rather than
for that introduced by the experimental manipulation of an independent variable. The value of
this approach stems from its ability to supplement the experimental approach
in areas which "man has not learned to control or can never hope to
control." The problem with correlations, however, is that they are not
necessarily causal. Indeed, if we do not know the precise causal line,
it is easy for regular co-occurrence to be misinterpreted. Errors of this
sort are known as the "cum hoc fallacy" [Rational
Argument Glossary]. Cronbach points out with some justification that
correlational psychologists search out variables the experimentalists prefer
to ignore. Correlational Psychology: See correlational philosophy of science. Correlational
Statistical Techniques: Mathematical
algorithms such as the Pearson
product moment correlation or the Spearman rank correlation, intended to produce correlation
coefficients. Cost-Effectiveness: This is what health service and education managers
have to consider when financing intervention projects. It is a matter
of a project's effectiveness relative to its cost, the point
being that many highly effective treatments are nonetheless insupportable
financially. Relevant here because studies of cost effectiveness are
commonplace in healthcare, clinical psychology, health psychology, and
educational psychology. Counterbalancing: This is an aspect of research design intended to minimise order
effects in experimental manipulations. Participants are exposed to the
required experimental conditions in different sequences, so that overall the effects of practice or
fatigue are presumed to cancel out. Cover Story:
[See firstly briefing and deception.] This is a false statement
as to the purpose of a given study. Must have been approved by the ethics panel concerned, and must be
covered in the debriefing session. Criterion-Referenced: This is one of the two basic philosophies of
behavioural or psychological assessment (the other being norm-referenced),
specifically, one in which the criteria of "goodness-badness" at
the test are publicly recorded in advance as a set of specific and
objectively assessable behavioural indicators. Example:
One of the most accessible examples of a criterion-reference assessment is
the (UK) driving test, where you pass when you are judged good
enough against a tick-list of demonstrable abilities. Criterion Test: See concurrent validity. Criterion Validity: [See firstly validity.]
This is the extent to which a test correlates "with one or more external
variables considered to provide a
direct measure of the characteristic or behaviour in question" (French
and Michael, 1968, p167). In most respects, the same as predictive validity
[for a discussion of the exceptions, see Anastasi, 1990; Chapter 6)]. Criticising Previous Research: [See firstly citing
previous research.] Cross-Validation: [See firstly validation.] This is the independent
determination of the validity of a test, using "a different sample of
persons from that on which the items were selected" (Anastasi, 1990,
p226). Debriefing:
[See firstly ethics.] This is an important aspect of ethicality in
research, and part of the briefing-debriefing aspect of research procedure.
In its simplest form it is a short recapitulation of what subjects have done
and why they have done it. Debriefing is especially important where the
research involved any intentional deception. Deception:
This is the deliberate concealment of the true purpose of a piece of
research, often assisted by a cover story delivered at the briefing.
Studies which involve deception must always expect to be challenged by the ethics
committee involved, and therefore demand deep initial reflection and
analysis. As far as undergraduates are concerned, they are only ethical if
the deception is necessary to avoid demand characteristics, reduce experimenter
effects, or otherwise control confounding. Degrees of Freedom (df): This is "the number
of components [of a statistic] which are free to vary" (Bryman and
Cramer, 1997, p122). Despite being a mathematically complex concept, degrees
of freedom are usually simple to determine and use. For example, the degrees of freedom for a variable sampled at n different intensities is simply (n-1). Demand Characteristics: [See firstly bias.] This term was coined by Orne (1962), and is one of the eight types of confounding
identified by Lewin (1977). It reflects the possibility (nay certainty) that
subtle environmental factors will interact with the motivational state of
human subjects during the research experience to render the observed
behaviour non-natural in some important respect, the demonstrable fact being
that "the setting may well evoke other behaviour you did not intend to
evoke" (Lewin, 1977, p103). The point is that the confounding
variable is provided by the experimental set-up itself, which may include
the behaviour or appearance of the experimenter(s) personally. Dependent Variable (DV): See the entry in Section 1 hereto. Descriptive Statistics: [See firstly statistics.] The phrase
"descriptive statistics" refers to a portfolio of mathematical procedures
designed to present research data in summary form without being part of
hypothesis testing. The most common descriptive statistics are mean [
= average], median, mode, range, and standard
deviation, and the most common graphical displays are the bar chart,
the box-and-whisker chart, the histogram, and the pie chart.
[Compare inferential statistics.] Design: See
research design. Developmental Delay: The phrase
"developmental delay" refers to the failure of a developing
organism to reach/achieve/display some physical, cognitive, or behavioural developmental
norm at the expected chronological age. Developmental Norm: [See firstly norm.] This is an age-related expectation of
mental or physical ability informed by past experience or research with the population
in question, and therefore vitally important in the
detection of developmental delay, and therefore part of the standardisation
exercise prior to the marketing of major psychometric test
packages. Diagnostic Tests and Screening Procedures: These are measurements and measurement packages
designed to assist during the assessment phase of patient management. The
ability of a given test to detect someone who needs to be detected is known
as its sensitivity. The ability to exclude people who need to be
excluded is known as its specificity. The positive predictive value
of a test is a measure of how many of those who have been detected as
positive actually are positive, and its negative predictive value is a
measure of how many of those who have been detected as negative actually are
negative. Clinicians need to be aware of all four of these factors, and
recognise that the qualities are to a large extent mutually exclusive. That
is to say, a good test of one condition might be a bad test of something
else. [There is actually a highly mathematical good reason for this, as
summarised in the entry for the ROC curve.] Difference Testing: See testing for the difference of two means and testing for
the difference of more than two means. Differential Validity: See incremental and
differential validity. Discrete Variable: [See firstly variable
and the Section 1 entry for "discrete" vs "continuous"
variables.] Discrete variables are one of the two sub-classes of interval/ratio
data (the other being continuous variable). Discriminant Analysis: This is one of the four types of multivariate
method. Discriminatory Power: This is an important
aspect of undertaking item analysis during the development of a
psychometric test. Dispersion: This is a measure of how tightly clustered a
distribution is around its mean. [Compare central tendency.] Double Blind:
[See firstly blind.] A double blind study is one in which BOTH
experimenters and participants are naive as to the true purpose of the
research. It might be necessary to organise things this way if experimenter
effects or other factors might bias the results. [Compare single blind.] DV: See
dependent variable. EBP: See
evidence-based practice. Effectiveness: This is a measure of the likely actual benefit arising from a given
remediation programme (that is to say, under average conditions of use)
(compare efficacy). (After Hayward, Jadad, McKibbon, and Marks, 1998.) Efficacy: This
is a measure of the theoretically maximum benefit arising from a given
remediation programme (that is to say, under ideal conditions of use)
(compare effectiveness). (After Hayward, Jadad,
McKibbon, and Marks, 1998.) Efficacy Study: [See firstly efficacy.] Eigenvalue:
[See firstly principal components analysis.] Empirical Data: These are data obtained by actual observation rather than by
conjecture; data from the evidence of the senses. Enlightenment Effects: This is one of the eight types of confounding
identified by Lewin (1977). The possibility that prior exposure to the study
area might influence performance under test. Error: See measurement error. Ethics: This
is the code of practice imposed upon researchers by their professional body
and/or employer. Click
here to consult the British Psychological Society Code of Conduct. Ethics Committee: This is a
formally constituted panel to which research proposals need to be submitted
for approval on ethical grounds (and hence a major defence against legal
action should the case arise). Evaluation Apprehension: This is one of the eight types of confounding
identified by Lewin (1977). The possibility that naturally apprehensive or
secretive personalities will not be performing normally on the behaviour
under test. Lewin suggests, amongst other things, that experimenters need to
watch out for comments such as "I better watch what I say in fron of you" (op. cit., p105). Evidence-Based Practice (EBP): Evidence-based practice is properly informed
professional decision making. It is "the conscientious, explicit, and
judicious use of current best evidence in making decisions about the care of
individual patients" (Sackett et al, 1996). "It is a systematic
approach to integrating current scientific evidence" (source) [alternative
definitions]. EBP is, however, only as good as the available evidence, and
that is usually less than conclusive. The philosophy therefore requires that
practitioners are sensitive to levels of evidence. Moreover, even
where the evidence base is sound, it is being constantly extended (hourly,
indeed, in the fastest moving branches of science). [See the story of James
Lind in the entry for control group.] Experiment: See
true experiment. Experimental Methods: [See firstly research types and designs.] This
is a class of research design intended to approximate to the ideal of the
true experiment, and therefore characterised by structured observation of the
effects of one or more deliberately manipulated independent variables
on a single dependent variable, while the effects of (ideally all)
other possible causation is tightly controlled. [Now see the separate
entries for field experiment, natural experiment, quasi-experiment,
true experiment.] Experimental Psychology: This is one of the two basic types of scientific
psychology identified by Cronbach (1957) (the other being correlational
psychology). The method is based upon the scientist changing this or that
condition "in order to observe their consequences" (p10). The
experimenter is thus "interested only in the variation he himself
creates", unlike the correlator, who is interested in the variation which
is already there. Experimenter Bias Effects: Same thing as experimenter effects. Experimenter Effects: The phrase "experimenter effects" refers
to the ability of experimenters by carelessness and lack of attention to
detail to bias their research, for
example, by failing to prevent demand characteristics, the Hawthorne
effect, etc. Experimenter Expectancy: This is one of the eight types of confounding
identified by Lewin (1977). The possibility that experimenters themselves can
subtly influence their participants' behaviour. [See Pygmalion effect.] Ex Post Facto Research:
This is one of the recognised subtypes of the experimental method. External Reliability: [See firstly reliability.] This is one of the
two forms of reliability (the other being internal reliability).
"The degree of consistency of a measure over time" (Bryman
and Cramer, 1997, p63). Face Validity:
[See firstly validity.] A test may be said to have face validity if,
upon simple inspection, it appears to the subject to measure
"what it claims to measure" (Kline, 2000b, p18). Kline warns that
this can sometimes be a good thing (it may motivate subjects to perform
well), and sometimes a bad thing (the target measure may be so obvious as to promote deliberate mis-performance). Factor Analysis: This is one of the two main factor analytical methods (the
other being principal components analysis). The method requires the
accumulation of scores on a number of simultaneous variables for each
subject, and then performing multiple correlations. Factor Analytical Methods: These are one of the most powerful correlational
methods of scientific research, and the method of choice when
investigating multiple causation. There are
two specific statistical procedures under this heading, namely factor analysis
proper, and principal components analysis. Enthusiasts for factor
analytical methods are quick to point out that science cannot advance by
hypothesis testing alone. Factor Loading: See factor analytical methods in general and loadings
in particular. False Negative: [See firstly diagnostic tests and screening procedures.] This
is AN INCORRECT diagnostic judgement that an entity DOES NOT fall within a
target category. [See now negative predictive value.] False Positive: [See firstly diagnostic tests and screening procedures.] This
is AN INCORRECT diagnostic judgement that an entity DOES fall within a target
category. [See now positive predictive value.] Falsification: See principle of falsification. Fatigue Effect: This is a class of confounding which might be encountered with a
prolonged or physically demanding research procedure, and in which
performance on the later items will be tailing off. Fatigue effects may be
controlled for to a certain extent by going for a more sophisticated design,
perhaps with counterbalancing of trials. Ferguson's Delta: [See firstly discriminatory power.]
This is an index of discriminatory power devised by Ferguson (1949) Fisher, Sir Ronald: Sir
Ronald Aylmer Fisher (1890-1962) was the statistician who devised the
logic of the null hypothesis during hypothesis testing. His
book "Statistical Methods for Research Workers" (Fisher, 1925/2004 online) has
been described as "probably the most influential book on statistics of
the 20th century" (source). Frequency Data: This is a subtype of nominal data. Gaussian Distribution: See normal curve. Gosset, William: William
Sealy Gosset (1876-1937) was the Guinness
brewery quality assurance chemist who, under the pseudonym
"Student", popularised Student's t-test (Student, 1908) as a
practical method of comparing the strength and composition of small samples [fuller
story]. Hawthorne Effect: [See firstly bias and confounding.] This is the name
given to the phenomenon whereby the mere act of observing a
behaviour can change it. The effect was first formally documented by
Mayo (1933, 1945), following field research at the Western Electric Hawthorne
Works, Chicago, between 1927 and 1932, in which the main driver of plant
productivity turned out to be the presence of the researchers, rather than
anything to do with the working conditions [fuller
story]. The Hawthorne effect is an excellent example of attention bias
in action. Homoscedacity: See
homogeneity of variance. Hypothesis:
"A hypothesis states the relationship between two (or more) variables
[and] takes a form such as 'if variable A is high, then variable B will be
low'" (Lewin, 1979, p37). Hypothesis Testing: [See firstly hypotheses.] This is the act of putting one's
theoretical beliefs to objective and peer-replicable test. Hypothesis testing
will normally be supported by inferential statistics. Mathematically,
there are a number of ways to go about this, but the most popular method in
the social sciences was devised by Fisher, Sir Ronald, and is so
structured as to involve an attempt to disprove the null hypothesis
and simultaneously to provide some estimate of confidence level in the form
of a p-value. Hypothesis testing is also the backbone of the hypothetico-deductive method, on which
nothing less than the scientific method itself is based [not everyone
agrees totally with this – see Cattell’s (1952) comments in the entry for factor analytical methods]. Hypothetical Construct: This is a hypothetical construct (or "construct", for
short) is a presumed internal quality of a system, beyond direct observation,
whose presumed operation accords with available empirical data.
Alternatively, it is "a construct is some postulated attribute of
people, assumed to be reflected in test performance" (Cronbach and Meehl, 1955/2005 online).
Constructs are therefore part of a theory, and may, in turn, map onto
one or more variables, each of
which may be operationalised as
observable measures in a number of different ways. Examples:
stress, memory. [See now construct validity.] Impression: In the context of this glossary, an "impression" is a statement of best
clinical judgment, an attempt at medical diagnosis which allows for an
element of residual uncertainty. It reflects how a patient "looks"
(or, more formally, "presents"), rather than "what they have
got". Independent t-Test: Same as unrelated t-test. Independent Variable (IV): See the entry in Section 1 hereto. Inductive Reasoning: See the entry for this topic in our Rational
Argument Glossary. Inference:
See the entry for this topic in our Rational
Argument Glossary. Inferential Statistics: [See firstly statistics.] This is a class of
mathematical procedures designed to establish the likelihood of a causal
relationship existing between blocks of empirical observations. Two major
subclasses of inferential statistic are recognised, namely correlational
methods and tests for the difference of group means methods.
[Compare descriptive statistics.] Interaction:
See analysis of variance. Internal Consistency: [See firstly reliability and validity.]
This is one of the two considerations of a test's reliability (the other
being test-retest reliability). A form of validation of a multiple
item test, characterised by the fact that "the
criterion is none other than the total score on the test itself"
(Anastasi, 1990, p55). Internal Reliability: [See firstly reliability.] This is one of the
two forms of reliability (the other being external reliability). The
extent to which a scale "is measuring a single idea" (Bryman and
Cramer, 1997, p63). Interval Data:
[See firstly interval.] This is a collection of observations of an
interval-based variable. Intervention: This
is the act of remediation itself, that is to say, the treatment which is
actually delivered to the person in need (the alternative being to do nothing
and let nature take its course). Intervention Study: This is the research programme needed to establish
the efficacy and/or effectiveness of a remediation programme.
The simplest research design is to divide a group of sufferers into matched
Experimental (E) and Control (C) groups. The E-Group then receives the
remediation for a given period of time whilst the C-Group receives an equally
complex but medically/educationally neutral dummy treatment (itself a major
ethical problem). Improvements over time (if any) are measured, and - if bias
and confounding have been properly controlled (a massive design problem), and
if the measuring instruments are valid and reliable (another massive problem)
- any changes can only have resulted from the treatment (although it still
might not prove cost-effective). (See main text for examples.) IV: See
independent variable. Kendall Coefficient of Agreement (u): This is a test to detect an underlying logical
pattern in a series of repeated paired comparisons, such as might be obtained
if a group of judges was only ever presented with two items at a time out of
a sample, rather than ranking the entire sample. Kendall Coefficient of Concordance (W): This is a test for the correlation of more
than two variables, for ordinal data. Kuder-Richardson Reliability Coefficient: [See firstly reliability in general and consistency
in particular.] This is a measure of inter-item consistency devised by Kuder and Richardson (1937). Longitudinal Study: This is an intervention study where the
effects are reassessed not just at the end of the initial intervention period
but at intervals over many years. Longitudinal studies are therefore the
preferred method of evaluating educational initiatives and the like, where
the final results need to work their way in real time through the normal
developmental life-cycle. Cohen and Manion (1989,
p71) distinguish four subtypes of longitudinal study, namely the cohort
study, the cross-sectional study, the ethogenic
study, and the trend study. MANOVA: See
analysis of variance. Measurement:
"Measurement is the numerical estimation of the ratio of a magnitude of
an attribute to a unit of the same attribute" (Michell,
1997, p383). Alternatively, it is a quality or quantity arrived at by
observation of an operationalised
measure of a variable in known circumstances. Measurement Error: "Discrepancies between the observed value of your
measurement and the 'true' value" (Fife-Schaw,
1995, p45). Median: This
is the score midway between the lowest and highest score in a distribution. Mode: This
is the most common score (or band of scores) in a distribution. Mortality:
In the context of research, this is a type of bias in which subjects
do not complete the study period. This might be literal mortality (as in
medical research, for example) or figurative (as in student drop-out). Multivariate Methods: "A collection of techniques appropriate for the
situation in which the random variation in several variables has to be
studied simultaneously" (Armitage and Berry, 1987, p326). The main
multivariate methods are principal components analysis, factor
analysis, discriminant analysis, and cluster analysis. Naturalistic Observation: This is one of the three basic approaches to
scientific research identified by Underwood (1966), and characterised by
"the recording of behaviour as it occurs in a more or less naturalistic
setting with no attempt to intervene" (p4; bold emphasis added). Negative Correlation: See correlation coefficient. Negative Predictive Value (NPV): [See firstly diagnostic tests and screening
procedures.] A test's NPV is a measure of how good that test is at
detecting true negatives when all its decision negatives are considered. It
is calculated by substituting empirical observations into the formula = TN /
(TN + FN). When NPV is high it indicates that the false negative problem is
under control. Noise:
Within the context of measurement theory, this is the same thing as random
error. Non-Parametric Statistics: [See firstly inferential statistics.]
Non-parametric tests are mathematically "less powerful" than their parametric
equivalents. This means that "given exactly the same data, a parametric
test is more likely to lead to significant results than a non-parametric
test" (Snodgrass, 1977, p357). The most commonly used non-parametric
statistics are the binomial sign test (for single sample
designs), the Mann Whitney U test (for two-group unrelated designs),
the binomial sign test, or Wilcoxon signed ranks test (for
two-group related designs), and the Kruskal
Wallis test (for three- or more- group unrelated designs). [Compare parametric
statistics.] Norm: This term refers to the performance of the standardisation
sample on a given test (Anastasi, 1990), and thus that which gives
meaning to the test scores of subsequent samples. [See now developmental norms.] Normal Distribution: [Alternatively "Gaussian distribution" or
"the bell curve".] See separate dedicated handout and exercises. NPV: See negative
predictive value. Null Hypothesis: [See firstly hypothesis testing.] This is a deliberate
statement of the contrary of what you really suspect to be the case. Thus, if
one's true hypothesis is that <TALL MEN LIVE LONGER>, then the
null hypothesis is simply that <TALL ME DO NOT LIVE LONGER>. Although
this may seem an unnecessary complication in putting an argument across
(because our mental problem space is limited capacity and extra words,
especially negatives, take up that space), it is useful for technical reasons
when carrying out inferential testing. This is because group
difference statistics, by their mathematical nature, make the initial
presumption that two samples are from the same population until proved
otherwise. They then report when the means of the two samples move apart far
enough for that null presumption eventually to be dismissed. Observation: "The
essence of studying anything [is] the observation of changes in variables"
(Coolican, 1990, p15). "Observation exists at the beginning and again
at the end of the process: at the beginning, to determine more definitely and
precisely the nature of the difficulty to be dealt with; at the end, to test
the value of [the action taken]. Between those two termini of observation, we
find the more distinctively mental aspects of the entire thought
cycle: (i) inference, the suggestion of an
explanation or solution; and (ii) reasoning, the development [of] the
suggestion. Reasoning requires some experimental observation to confirm it,
while experiment can be economically and fruitfully conducted only on the
basis of an idea that has been tentatively developed by reasoning. [.....]
The disciplined, or logically trained, mind - the aim of the educative
process - is the mind able to judge how far each of these steps needs to be
carried out in any particular situation. No cast iron rules can be laid down.
Each case has to be dealt with as it arises [.....]. The trained mind is the
one that best grasps the degree of observation, forming of ideas, reasoning,
and experimental testing required in any special case, and that profits the
most, in future thinking, by mistakes made in the past. What is important is
that the mind should be sensitive to problems and skilled in methods of
attack and solution." (Ray, 1967, p157; italics original.) One-Tailed Test: [See firstly tests for the difference of two means.] This is a
directional application of one of the two-group inferential statistics,
namely the t-test, the Wilcoxon, or the Mann-Whitney. It
is so called, because the statistical procedure only has to deal with
separation of the two distributions down one or other of the asymptotes (or
"tails"), but not both. Operationalise, To: This is the act of assigning a particular physical dimension as a
measure of a particular research
variable. Example: One might
conceptualise the hypothetical
construct stress as including autonomic changes, one of which might be
adrenaline-related, and then operationalise
a measure of stress as serum adrenalin, heartbeat per minute, temperature of
thumb, salivary cortisol, irritable outbursts per hour, or anything you like,
providing you can defend the construct
validity of your eventual findings. Opportunity Sampling: [See firstly sampling.] This is the act of
selecting a research sample according to who is available to take part in it,
rather than according to more precisely derived criteria. Paired Comparison: See Kendall coefficient of agreement. Parallel Form: In order to avoid practice effects which might
otherwise prevent using the same assessment twice on the same subjects, many
psychometric packages offer two (or more) item sets, matched for difficulty.
These are known as the parallel forms of the test. PCA: See
principal components analysis. Pearson, Karl:
Karl
Pearson (1857-1936) was the statistician who devised the Pearson
product moment correlation and the chi-squared test. Piliavin, Rodin, and Piliavin
(1969): This is the class-defining
study into bystander apathy. Placebo Group:
This is a group in an intervention
study given a dummy treatment,
and (usually) kept unaware of that fact. Population:
This is all the members of a uniquely definable group of people or things.
[Compare sample.] Positive Correlation: See correlation coefficient. Positive Predictive Value (PPV): A test's PPV is a measure of how good that test is
at detecting true positives when all its decision positives are considered.
It is calculated by substituting empirical observations into the formula = TP
/ (TP + FP). When PPV is high it indicates that the false positive problem is
under control. Practice Effect: This is a class of confounding which might be encountered when
the measure in question is itself a learnable mental or physical skill. Predictive Validity: [See firstly validity and criterion validity.] Data collection instruments such
as questionnaires, test batteries, or psychometric tests
may be said to have predictive validity
to the extent to which they have demonstrated the ability to detect the people
they want to find. Predictive validity is therefore a major requirement in
healthcare (where tests are used to select/reject patients for treatment) and
education (where tests are used to select/reject students). Establishing an
instrument's predictive validity requires prolonged field data collection and
analysis, but provides a very important statistic to be able to quote. There
is an enormous science of predictive value for diagnostic tests within
medical decision making, but avoid for the moment. Predictor Variable: This is an ptional
name for independent variable. Premiss: This is an
optional spelling of premise. Principal Components Analysis (PCA): [See firstly factor analytical methods.] This
is a factor analytical method of screening a large number of simultaneous
[i.e. multivariate] measures for those which - because they vary together all
or most of the time - may be better regarded as the outcome of a broader
underlying factor. What we want to end up with is new and better variables, such
that each "has the highest possible variance and so represents better
than any other linear combination of the [original variables] the general
differences between individuals" (Armitage and Berry, 1987, p327). Principle of Falsification: Popper's (1959) assertion that the scientific method
is ultimately based on our ability to prove an assertion is false (by finding
a counterexample to it), but NOT to prove one is true. p-Value: [See
firstly confidence level.] Academic journals usually adopt the standard
mathematical shorthand here, reporting probability as p [hence
"p-values"]. The values of p can run from zero (totally improbable)
to 1.0 (certain), and are usually seen as two places of decimals in between.
The probability can be converted to a percentage by multiplying by 100. Thus
a probability of p=0.57 is likely to happen 57% of the time. Probability:
This is a mathematically expresed measure of how
likely something is to happen. [See now p-value.] Professional Opinion: See levels of evidence. Pygmalion Effect: This is one of the classical examples of an expectancy bias,. first studied in
schoolteachers by Rosenthal and Jacobson (1968). Qualitative Research: This is research in which the critical variable
is a quality [compare quantitative research]. Quantitative Research: This is research in which the critical variable
is a quantity [compare qualitative research]. Quasi Experiment: [See firstly experimental methods.] This is a form of
experiment, developed originally in educational research, in which it is not
possible to allocate subjects to the various IV conditions. Random Error:
[See firstly measurement.] This is the inherent inaccuracy of any
scale of measurement. Randomised Controlled Trial (RCT): The RCT is a robust and well-tried research
design and a key element in delivering evidence-based practice in
healthcare (or, indeed, any other professionalism). It is
"randomised", because it does not pre-select subjects who are in
some way likely to fit "the treatment" being evaluated. Instead,
participants are drawn at random from the largest practicable pool. It is
"controlled" in the sense that it includes control groups
who do NOT receive the treatment in question but who IN EVERY OTHER RESPECT
are treated identically. This is to insure against making what are known as Type
I errors should a variable other than the treatment be surreptitiously at
work. What you want to see is an improvement in the treated group but no
change in the controls. RCTs are also expected to be blind or double-blind
where necessary, and generally avoid bias, confounding, to do their best to
design out practice effects, order effects, fatigue effects, and ceiling
effects, and to maximise (or at least quantify) the many subtypes of research
validity and reliability. Range: [See
firstly distribution.] The range
of a distribution is the difference between that distribution’s lower extreme and its upper extreme. Example: If the
lowest value in a distribution is 43 and its highest value is 93, then
subtracting the former from the latter gives us a range of 50. [To see how the range can be used in
descriptive statistics, see box-and-whisker plot.] RCT: See randomised
controlled trial. Reactance:
This is one of the eight types of confounding identified by Lewin
(1977). It reflects the possibility that stubbornness on the part of subjects
will cause responses deliberately opposite to that which might otherwise have
been made. Receiver Operator Characteristics (ROC): [See firstly diagnostic tests and screening
procedures.] If you adjust the cut-off to a lower value, then it makes
the test more sensitive, but only at the cost of having to put up with
more false positives. If you set it higher then it makes the test more
specific, but only at the cost of having to put up with more false
negatives. Reflective Practice: This is a state of perpetual critical appraisal in
professionals which attempts to exclude those irksome errors of omission by enhancing
the vision of exactly what full clinical autonomy actually involves.
Reflective practitioners are seen as preventers who constantly question their
means of prevention, as assessors who constantly question their methods of
assessment, as interveners who constantly question their proposed point of
intervention, and so on. Reliability:
This is a measure of how well a given test reflects "'true' differences
in the characteristics under consideration" (Anastasi, 1990, p109).
Reliability is usually considered under two subheadings, namely internal
consistency (e.g. Bryman and Cramer, 1997; Kline, 2000b), and its stability
over time (or "test-retest reliability") (e.g. French and
Michael, 1968; Kline, 2000b). Bryman and Cramer (1997) distinguish internal
reliability from external reliability, and Armitage and Berry
(1987) discuss the whole area of diagnostic tests and screening procedures
as a reliability issue. It may also be appropriate, depending upon the
particular research set-up, to assess scorer reliability. There
are three "off-the-peg" tests of reliability, namely split-half
reliability, test-retest reliability, and item consistency,
but there is usually also scope for some project-specific hypothesis
testing using the full range of inferential statistics. Repeated Measures Design: [See firstly research design]. This is a
class of experimental designs in which the scores from two or more
different groups of subjects are compared using one of the available tests
for the difference of means. Example (1): In testing
the hypothesis that men are better than women at mathematics, you have no
choice but to test different groups of subjects (because men cannot be women
at the same time). Example (2): In testing the hypothesis that sober men are
better at mathematics than drunk ones, you could (a) test the same subjects
on different occasions (a repeated measures design), or (b) test
different groups (an independent samples design). By and large,
repeated measures designs are more powerful than independent samples designs,
because there is less within between subjects variance. Replicability: This is the ease with which a piece of
research can be precisely repeated using only the original write-up to go by.
Given the requirements of the principle
of falsification that good science is ultimately all about failing to
find counterexamples of a test proposition, it follows that any author's
research should be as precisely replicable as possible. [See now citing
previous research.] ROC: See
receiver operator characteristics. Role Expectations: Role expectations account for two of the eight types of confounding
identified by Lewin (1977). With "good subject" role expectations,
the possibility is that we try to do what we think we ought to do, and in the
"bad subject" role we set out to respond anything but
normally. Rosenthal and Jacobson (1968): [See firstly bias and Pygmalion effect.]
This is the classic study of expectancy bias. Teachers were given
fabricated cover stories leading them to believe that some of their pupils
were likely to be "academic spurters"
during the coming year. Such students, who in fact were chosen at random from
the available classroom population, showed an increase of 12 IQ points
during the research year compared to 8 points in their "less
gifted" colleagues. [Further
discussion.] Sample: This
is the subset of the target population of objects or participants
which is selected for research investigation. There are many practical
procedures available for selecting one's sample, each with its own advantages
and disadvantages. Sampling Bias:
[See firstly bias.] This is bias arising from logically flawed or
carelessly executed sampling, resulting in a sample which does
not fairly represent the population in question. Scorer Reliability: This is a measure of how consistently a particular
scorer will score the same raw data on different
occasions. Selection Bias: Same thing as sampling bias. Sensitivity:
[See firstly diagnostic tests and screening procedures.] This is a
mathematically derived index of how good a test is at detecting true
positives, that is to say, of how good that test is at detecting positives
in a population of condition positives. It is calculated by substituting
empirical observations into the formula TP / (TP+FN). High sensitivity is
called for in tests where false negatives are either expensive or
downright dangerous. False negatives in medicine result in missed
opportunities for treatment, and in education they
result in delayed or lost opportunities for personal development. In
practice, however, highly sensitive tests often give high numbers of false
positives, so in isolation they are less than perfect measures. [See now receiver
operator characteristics.] Sign Test: See
binomial sign test. Single Blind:
[See firstly blind.] A single blind study is one in which EITHER
the experimenter(s) OR the participants are naive as to the true purpose of
the research. [Compare double blind.] Single Case Research: This is an experimental variant of the case study
method used within correlational research (Cohen and Manion, 1989) Solomon Four Group Design: This is an experimental design introduced by Soloman (1946) which uses three control groups to
avoid the possibility that any pre-test might itself affect the dependent
variable. The traditional (Lewin, 1979) experimental group is
given the pre-test, the main treatment, and the post-test in the
normal way, and the traditional control group is given the pre-test and
post-test but given only a control treatment. There is then a second control
group which is not pre-tested, but does get the main treatment and the
post-test, and a third control group which just gets the post-test.
Differences between the experimental group and the second control group, or
between the first and third control groups, "must be caused by pretesting"
(Lewin, 1979, p107). Spearman, Charles: Charles
Edward Spearman (1863-1945) was the statistician who devised factor
analysis, and who then famously applied this method to the analysis of
the factors of human intelligence. He also devised the Spearman's
rank correlation. Specificity:
[See firstly diagnostic tests and screening procedures.] A test's
specificity is a measure of how good that test is at detecting negatives in a
population of condition negatives. It is calculated by substituting empirical
observations into the formula TN / (TN + FP). High specificity is called for
in tests where false positives are either expensive or downright dangerous.
False positives in medicine result in inappropriate treatment or unnecessary
referral, and in education they result at best in a harder than necessary
student experience and at worst in course failure. Split-Half Reliability: [See firstly reliability]. This is a
statistical procedure by which a random half of the test items is correlated
with the other half. This is often a more useful measure than test-retest reliability, because the
data comes from single test session, so there will be no practice, illness
progression, recovery, or mood or similar effects. Standard Score: [See firstly standard
deviation.] This is a score which somehow indicates its relative position
within a distribution. One which expresses "the individual's distance
from the mean in terms of the standard deviation of the distribution"
(Anastasi, 1990, p84). This typically requires that raw scores are
mathematically converted in some way, and that the distribution approximates
to normal. Standardisation: This is the process of calibrating a measure to a given population
so that the future performance of other samples can be norm-referenced.
Stanine:
[Abbreviation of Standard NINE point scale.] This is a measurement system in
which the distribution in question is divided into nine subranges. If
a normal distribution, then the stanines take up 4%, 7%, 12%, 17%, 20%,17%, 12%, 7%, and 4% respectively. Stanine #1 is then
classified as "poor", #2 and #3 are grouped together as "below
average", #4, #5, and #6 are "average", #7 and #8 are
"above average", and #9 is "superior". (After Durost, 1968.) Structured Interview: See interview. "Student": See Gosset, William. Student's t-Test: See t-test. Syllogism:
See this entry in our Rational
Argument Glossary. Test-Retest Reliability: [See firstly reliability]. This is one of the
two aspects of reliability identified by French and Michael (1968)
[the other being internal consistency].
A "time-associated reliability", which can readily be quantified
using a correlation coefficient
derived from correlating results from the same test administered twice to the
same subjects, at least three months apart. The square of the correlation
coefficient gives the "degree of agreement". [See now the advantage
of having parallel forms of the
test in question.] Tests for the Difference of Two Means: This is one of the two basic types of tests for
the difference of means (the other being tests for the difference of more
than two means). Theory: A
theory is a body of empirically verified observation, plus a particular
interpretation. It is thus an attempt to make sense of a number of confirmed
hypotheses by drawing them together into a more meaningful whole (Lewin,
1979). Thurstone, Louis: Louis
Leon Thurstone (1887-1955) was a
psychometrician who used factor analysis techniques in the study of
intelligence and its internal structures. True Experiment: [See firstly experimental methods.] This is the
"ideal" form of experiment. One in which the researcher has the
power, money, ethical approval, and ability to manipulate all the necessary independent
variables. True Negative:
[See firstly diagnostic tests and screening procedures.] This is A
CORRECT diagnostic judgement that an entity DOES NOT fall within a target
category. [See now negative predictive value.] True Positive:
[See firstly diagnostic tests and screening procedures.] This is A
CORRECT diagnostic judgement that an entity DOES fall within a target
category. [See now positive predictive value.] True Score/Value: [See firstly measurement.] This is the (in-fact-unattainable)
ideal of a score which has no measurement error. Two-Tailed Test: [See firstly tests for the difference of two means.] A
non-directional application of one of the two-group inferential statistics,
namely the t-test, the Wilcoxon, or the Mann-Whitney. So
called, because the statistical procedure has to deal with separation of the
two distributions down BOTH of the asymptotes (or "tails").
[Compare one-tailed test.] Type 1 Error:
[See firstly hypothesis testing.] This is a class of very bad science
in which the null hypothesis is rejected when in fact it is true, thus
causing its inverse, the hypothesis, to be accepted when in fact it should be
rejected. In simple situations, one way to reduce the likelihood of Type 1
error is to select a tight confidence level (at the 0.01 level, say,
rather than at the 0.05 level). [Compare Type 2 error, and see also confounding.] Types of Research: Underwood (1966) identifies three basic types, namely naturalistic
observation, the correlational method, and the experimental method.
Cronbach (1957), however, disagrees that there is such a thing as an
observational method, seeing observation as a type of measurement, not as a
basic type of research. He recognises only correlational psychology
and experimental psychology. Unrelated t-Test: This is one of the two possible types of t-test (the other
being the related t-test). The test of choice when a research
design delivers two columns of scores from comparison groups (that is to
say, a between-groups design). [For an e-tutorial on how to use SPSS to carry
out your unrelated t-tests, click here.
Note the suggested format for the final write-up.] Validity:
This is the issue of whether a test is measuring what you think it is
measuring, and thus whether the piece of research in question is valuable
science or not. Unfortunately, there are a large number of ways in which
research can be invalid. To start with, poor conceptualisation (i.e. iffy
psychological construct), poor build (i.e. iffy items, ambiguous, unclear,
incomplete), or poor administration (bias). There are many aspects to this
major issue of scientific quality, including construct validity, content
validity, criterion validity, face
validity, and predictive validity. Referring specifically to
psychometric tests, rather than to scientific conclusions in general, French
and Michael (1968) consider only content validity, criterion
validity, and construct validity. Kline (2000b), on the other
hand, emphasises face validity, concurrent validity, predictive validity,
content validity, incremental and differential validity, and construct
validity. Variable:
See the entry in Section 1 hereto. |