research-methods-glossary

Glossary - Research Methods and Psychometrics

Copyright Notice: This material was written and published in Wales by Derek J. Smith (Chartered Engineer). It forms part of a multifile e-learning resource, and subject only to acknowledging Derek J. Smith's rights under international copyright law to be identified as author may be freely downloaded and printed off in single complete copies solely for the purposes of private study and/or review. Commercial exploitation rights are reserved. The remote hyperlinks have been selected for the academic appropriacy of their contents; they were free of offensive and litigious content when selected, and will be periodically checked to have remained so. Copyright © 2006-2018, Derek J. Smith.

First published [v1.0] 14:00 BST 19th June 2006. This version [2.0 - copyright] 09:00 BST 8th July 2018.

BUT UNDER CONSTANT EXTENSION, SO CHECK AGAIN SOON

1 - Introduction

This glossary is an alphabetically sorted series of short cross-indexed definitions, cumulatively explaining how the scientific method in general and research statistics in particular are typically applied to psychological research. The cross-indexing has been done in such a way that if the individual entries were to be loaded into a semantic network they would produce a navigable encyclopaedia on the chosen subject. There are also half a dozen Key Concept definitions which are so pivotal to the entire study area that we need to deal with them right now .....

Key Concept – "Causation": One of the fundamental notions of science is that of "causation", the idea that some things just happen whilst others are contingent upon prior events taking place or specified antecedent conditions being reached. We know the former class of events as "random" or "chaotic, and the latter as "regular" or "ordered". Science is thus about finding causation where previously we suspected chaos.

Key Concept – "The Causal Line": What makes the study of causation really challenging is the fact that when we inspect processes more closely they often turn out to involve a succession of lesser cause-effect events. Event A causes State B, which triggers Event C, and so on. This sequencing of causes and effects is known as a "causal line", and causal lines have been formally defined as "temporal series of events so related that, given some of them, something can be inferred about the others whatever may be happening elsewhere" (Russell, 1948, p459). The principal scientific skill is accordingly that of unravelling causal lines, and the pay-off is that the resulting predictability gives us some semblance of control over our world.

Key Concept - Variable: A variable is a quality or attribute, physical or conceptual, which may exist in two or more discrete states or at two or more intensities. It is thus a dimension of observation, and observation, as we shall be explaining in Section 2, is another fundamental building block of the scientific method.

Key Concept – "Discrete" vs "Continuous" Variables: "Variables are tricky things" (Coolican, 1990, p12). To start with, they move in different ways, some jerkily and some smoothly. Variables which advance in integral steps are known as "discrete" variables. Examples: A runner’s position in a race can be first, second, third, etc. but not fractional values thereof. Discrete variables thus have a limit to their arithmetical precision, that is to say, there is no point in measuring them to more places of decimals than the steps themselves allow. On the other hand, there are many variables which have no theoretical limit to their arithmetical precision, and can be measured to as many places of decimals as you like. These are known as "continuous" variables. Examples: Time, distance, and mass, or complexes thereof, such as velocity, acceleration, force, and pressure.

Key Concept - Independent Variable (IV): [Sometimes "predictor variable".] Some variables also act upon other variables within causal lines, causing the latter to vary in turn. These are known as "independent" variables, and IVs are important because they help us to understand particular causal lines. Examples: (1) Gender is a two-state discrete variable capable of differentially determining an organism’s behaviour through a complex causal line which includes a number of anatomical, physiological, and psychological variables. (2) Temperature is a continuous variable which will directly influence many physical processes.

Key Concept - Dependent Variable (DV): [Sometimes "criterion variable".] Variables which are acted upon by IVs in a causal line are known as "dependent" variables [because what they do "depends on" what the IV does]. In the experimental method, DVs are the variables which are monitored (i.e. observed and measured), whilst IVs are those which are manipulated in the hope that by varying a cause you will become better able to understand its effects.

For more on how variables fit into the topics of philosophy of science and research design, start with the Section 2 entries for hypothesis testing, inference, and inferential testing, and follow the links from there.

2 - The Glossary Entries

Action Research: A research philosophy originally developed by Lewin (1946) in order to avoid "research that produces nothing but books" (p35), and Corey (1949, 1953) to help support the cyclical improvement of educational initiatives [more history], but now of proven utility in any similar large institution, including social services [example], healthcare [example], and IT [example]. Alternatively, "a deliberate, solution-oriented investigation that is [characterised] by spiraling cycles of problem identification, systematic data collection, reflection, analysis, data-driven action taken, and, finally, problem redefinition" (Beverley, 1993/2004 online). For further details, see Wilson and Streatfield (2004 online). [See also participatory action research.]

ANOVA: See analysis of variance.

Analysis of Variance (ANOVA): [See firstly variance and tests for the difference of more than two means.] The ANOVAs are a class of parametric statistical procedures capable of processing more than two columns of group-difference data in one row [the "one way" analysis of variance] or two or more columns of such data in two or more rows [the "two (or more) way" analysis of variance].

Analysis of Variance, "One Way": [See firstly analysis of variance.] The simple, or "one way", ANOVA is the statistical analysis of choice for data which sit naturally in one-row tables of three or more cells. The statistical procedure computes variance between cells as well as in total. The statistical algorithm itself need not concern us, but the statistic it produces - known as an F-value - is a valuable index of (1) the amount of variance attributable to differences between the cells, and (2) random, or "residual", variance. Significance tables may then be used to convert the F-value and its degrees of freedom intersect into an equivalent p-value. Example: In testing the general hypothesis that exercise depresses the appetite, we might record the daily calorific intake of five groups, graded by the amount of exercise taken. If we then arranged for each group to contain around a dozen subjects, you could tabulate their calorific intakes (in kilocalories) into five horizontally aligned cells, numbered 1 to 5.

Analysis of Variance, "Two (or More) Way": [See firstly analysis of variance.] Two- (or more-) way ANOVAs are capable of simultaneously coping with two- (or more-) independent variables. The popular shorthand description for such procedures reflects the number of conditions on each variable as an integer. Thus a 2x2 (pronounced "two-by-two") ANOVA has two independent variables, each of whose effects on the dependent variable are sampled under two conditions. The dependent variables are accumulated under the appropriate cell heading, and the analysis carried out once enough observations have been made. A 2x2x3 (pronounced "two-by-two-by-three") ANOVA has three independent variables, two sampled under two conditions, and the third sampled under three conditions. Example: In testing the general hypothesis that exercise depresses the appetite in women but increases it in men, we might record the daily calorific intake of five groups of men and five groups of women, both graded by the amount of exercise taken. If we then arranged for each group to contain around a dozen subjects, you could tabulate their calorific intakes (in kilocalories) into five horizontally aligned cells for the men, themselves vertically aligned over five further horizontally aligned cells for the women. Since the resulting table would then be five cells wide by two cells deep, it would be a 5x2 (pronounced "five-by-two") ANOVA.

Artefact: Same thing as artifact.

Artifact: [Optionally artefact.] Generally, "a thing made by art, an artificial product" (OED), and thus, in the present context, a research conclusion which turns out upon critical methodological examination to have arisen thanks to a bias or confound of some sort, rather than because of the causal relationship under test. A serious source of research error. Lewin (1977) gives the following advice on handling artifacts: "Highly contrived laboratory situations magnify artifacts. The researcher should ask, 'How else can I study this topic?' [and one good way] to reduce confounding variables is by imaginative and creative research design" (p110). Another way is to increase "experimental realism". Moreover, specific types of artifacts go with specific types of research, so it is down to the analytical skills of the author(s) concerned to spot them in advance (i.e. before your peer-reviewer, to your cost, does it for you). Pre-testing may itself be the cause of an artifact, and can be better managed by adopting the Solomon four group design.

Attention Bias: This is a type of bias in which the act of observation itself becomes an important (if not the most important) independent variable. The Hawthorne effect is a good example of what can then happen.

Awareness of the Hypothesis: This is one of the eight types of confounding identified by Lewin (1977). It reflects the possibility that a research participant's understanding of what a given piece of research is for might, consciously or unconsciously, influence the behaviour being measured.

Between-Subjects Design: See repeated measures design.

Between-Subjects Variance: [See firstly variance.] This is variance arising from chance or bias in the groups participating in an independent groups design.

Bias: [See firstly measurement error.] In everyday usage, a bias is a deviation from an intended path (OED). In scientific research, it is the systematic deviation of measurements from their true value (as opposed to random deviation, which is due to statistical "noise"). Unfortunately, bias can arise for a large number of reasons, and originate at all stages of the research cycle. It therefore has no single definition nor method of detection nor standard remedy. The following subtypes of bias are therefore dealt with separately: attention bias, centripetal bias, confounding, cultural bias, demand characteristics, expectancy bias, measurement bias, recall bias, sampling bias, volunteer bias, and withdrawal bias. [For additional discussion, see Palmer (1996/2004 online).]

Binomial Sign Test: This is a single sample inferential statistic for non-parametric data at the nominal or ordinal level. By "single sample" is meant any design which is attempting to judge whether a sample population is, or is not, drawn from a reference population, whose distribution is already a matter of record and thus need not be re-sampled. If the sample population is representative of the reference population, then the sample distribution and the reference distribution will, save for sampling noise, be the same. The test can also be used as a two-sample inferential statistic for non-parametric data in a related design.

Blind: This is the technical term for the giving or taking of measurements without knowledge of the true purpose of the research, and possibly under the deliberate influence of a cover story. [See now double blind and single blind.]

Bonferroni's Correction (for Multiple Comparisons): [See firstly confidence levels and Type 1 error.] The Bonferroni correction is an adjustment to the confidence level required when a single scientific hypothesis is being investigated using multiple inferential statistics. The risk on such occasions lies in the fact that the p-value, the probability of a Type 1 error, increases with every new statistical procedure, in much the same way that the odds of throwing a 6 go up when you are allowed to keep throwing your die. The solution proposed by the Italian statistician Carlo Emilio Bonferroni (1892-1960) was to recalculate the multiple p-values into a single adjusted p-value. This was done by a statistical algorithm, and a simple online procedure at the SISA website will nowadays do this calculation for you [click here to use the algorithm and here to see the user instructions].

Box-and-Whisker Plot: The box-and-whisker plot (or "box-plot" for short) is an easily drawn graphical aid, designed to display both the central tendency and the dispersal of a given distribution. The graphic is produced by inspecting said distribution, and identifying five values. The first two are the lower extreme and the upper extreme, and the difference between these values gives us the range of the distribution. The next important value is the median. The median is important because it locates the midpoint of the distribution on a linear scale for the variable in question. It divides the range into upper and lower halves, and the two half ranges are then further subdivided by locating the two quartiles, the lower quartile and the upper quartile. A rectangle is now drawn above the scale, such that it begins at the lower quartile, ends at the upper quartile, and is vertically divided into two at the median. This is the "box" element of the graphic. The "whiskers" are now added to the box by adding horizontal lines from the lower and upper limits of the box to the lower and upper ends of the box.

Box Plot: See box-and-whisker plot.

Briefing: [See firstly ethics and deception.] This is the stage in the standard research procedure at which participants are told what the research is about (either truthfully, or – ethical deception having been approved - as part of a cover story), and given the opportunity to withdraw their consent to take part, as required by the codes of practice on ethical research laid down by the various institutions involved. In all research philosophies, it is wise to regard the briefing as a potential source of demand characteristics, and in the experimental philosophy it may also need to be regarded as a treatment as well. even if it there is no element of deception involved. [See now debriefing.]

Burt, Sir Cyril (1883-1971): [Selected Internet biography] British intelligence theorist, initially acclaimed for his contribution toward the g-factor theory of intelligence (e.g. Burt, 1917). Burt’s academic reputation suffered after Hearnshaw (1979) exposed a number of inconsistencies in his data handling, and he was subsequently adjudged by the British Psychological Society to have falsified his results. More recent papers have defended Burt, but the official ruling remains in effect nonetheless.

Bystander Apathy: This is the name given to an unwillingness to get involved on the part of persons close to an apparent ongoing emergency. It is ignoring one's duty in favour of a "quiet life". It is "bad Samaritanism". This phenomenon was investigated by a classic social psychological study, Piliavin, Rodin, and Piliavin (1969).

Causal Line: See the entry for this topic in Section 1 above.

Cause and Effect: See the entry for this topic in the companion Rational Argument Glossary.

Central Tendency: [See firstly distribution.] This is a measure of where the centre of a given distribution lies with reference to its lower extreme and upper extreme. Among the graphical displays of central tendency we have the box-and-whisker plot, and among the computed measures we have the mean, the median, and the mode. [Compare Dispersion.]

Centrifugal Bias: This is a type of bias in which the research centre itself - say a failing hospital or a non-prestige university - is avoided by individuals who can get in somewhere better, thus rendering a sample of those who are left subtly unrepresentative of the population at large. [Compare centripetal bias.]

Centripetal Bias: This is a type of bias in which the research centre itself - say a specialist hospital or a prestige university - attracts individuals with particular strengths and attributes, thus rendering a sample thereof subtly unrepresentative of the population at large. [Compare centrifugal bias.]

Chi-Squared Test: This is the most common method of statistical analysis for frequency data. Given an array of actual cell frequencies, the statistical procedure computes a null hypothesis expected distribution, and then tests whether the actual-expected difference is big enough to have occurred by chance. [Full tutorial]

Citing Previous Research: This is making due reference to the literature when deriving and justifying one's research argument. [Now see criticising previous research.]

Clever Hans: This is a classic example of a procedural confounding bias, in which a circus horse - Hans - would answer simple arithmetic questions by tapping so many times with his hoof. Upon closer inspection, however, it turned out that Hans was not numerate at all - merely sensitive to his trainer's body language. The secret was that Hans had learned to start tapping when given one type of behavioural cue, and would stop when given another [full details]. Research which does not thoroughly avoid confounds of this sort at the early planning stage is likely to be deeply flawed.

Clinical Effectiveness: This is the general concept of value for money – i.e. demonstrable benefit - in clinical treatment of any kind. Specifically, a series of initiatives during the early 1990s to maximise value for money in the British NHS by raising consciousness of cost-of-outcome amongst clinical professionals, and which therefore inspired (and still inspires) a large number of efficacy studies to prove things one way or the other. The search for maximum clinical effectiveness in the UK is overseen by the National Institute for Clinical Excellence (NICE).

Clinical Judgement: This is one of the two basic types of clinical assessment (the other being the use of formally standardised psychometric tests). To reach a clinical judgement requires a combination of observation, adhoc diagnostic tests, and prior professional experience.

Cluster Analysis: This is one of the four recognised types of multivariate method (the others being principal components analysis, factor analysis, and discriminate analysis).

Cluster Sampling: This is one of the standard optional methods of sampling. The method relies on sampling selected clusters of potential subjects within the target population, rather than the population as a whole (thus saving time and expense, but at the risk of introducing some kind of sampling bias).

Coefficient Alpha: Same as Cronbach’s alpha.

Cohort Study: This is a type of longitudinal study in which the group(s) being studied are monitored over a suitable period of time.

Concurrent Validity: [See firstly validity.] Data collection instruments such as questionnaires, test batteries, or psychometric tests may be said to have concurrent validity to the extent to which their findings correlate with other tests – criterion tests - of the same construct. Concurrent validity can be assessed by an appropriate statistical technique, and expressed as a correlation coefficient. Unfortunately, since suitable criterion tests are in fact surprisingly rare, this sort of validity can actually be difficult to quantify. Indeed, Kline (2000b) warns that "almost the only field where accepted tests exist such that high correlations with them indicate validity is intelligence […..] In most other fields confusion reigns" (p20). Kline also warns that the reliability of the selected criterion test also needs to be taken into account, because if you select a test with low reliability to validate a new test against, then it may be the validating test which is misbehaving, not the new one.

Confidence Level: [See firstly hypothesis testing and inferential statistics.] This is an expression of the probability of accepting a hypothesis without committing a Type 1 error, conventionally expressed by a p-value. The usual confidence boundaries in psychological research are "not significant" (a p-value greater than 5%), "significant" (a p-value between 1% and 5%), "very significant" (a p-value between 0.1% and 1%), and "highly significant" (a p-value less than 0.1%). The conventional shorthand for expressing these levels of significance in social science research is to add the code "p > 0.05", "p < 0.05", "p < 0.01", and "p < 0.001", respectively.

Confound: In the current context, "to confound" is to fail to detect a confounding bias prior to carrying out a piece of research, with the end result that cause-and-effect interpretation of the results becomes unsafe. "A confound" is the confounding variable doing the damage.

Confounding Bias: This is a type of bias in which one or more initially unrecognised confounding variables turn out to have affected the obtained results, thus rendering cause-and-effect interpretation unsafe. Lewin (1979) identifies the following eight major sources of confounding: awareness of the hypothesis, demand characteristics, enlightenment effects, evaluation apprehension, experimenter expectancy, reactance, and role expectations (two types).

Confounding Variable: [See firstly variable.] This is an independent variable NOT formally designed into a piece of research, and which, by not being controlled, is likely to pervert the course of hypothesis testing, perhaps by encouraging a Type 1 error.

Consent: See informed consent.

Consistency: See internal consistency.

Construct: See hypothetical construct.

Construct Validity: [See firstly hypothetical construct and validity.] Data collection instruments such as questionnaires, test batteries, or psychometric tests may be said to have construct validity to the extent to which they are based upon well-accepted psychological constructs. This is important because many psychological constructs - e.g. telepathy - are not universally accepted. Anastasi’s (1988) examples of established theoretical constructs include scholastic aptitude, comprehension, verbal fluency, neuroticism, and anxiety. The notion of construct validity derives initially from Cronbach and Meehl (1955/2005 online), who warned that specific high correlations "may constitute either favourable or unfavourable evidence [] depending on the theory surrounding the construct". Example: A mind-reading test which was in other respects reliable and valid would have dubious construct validity because the construct of mind reading was itself less than universally accepted. Assessing: Cronbach and Meehl further argue that the ideal assessment of construct validity would be to have some form of "construct validity coefficient", "a statement of the proportion of the test score variance that is attributable to the construct variable" (p7). Unfortunately, while this is conceptually straightforward enough (you simply have to decide whether the suggested psychological construct actually exists, or is something else under a new name), it is difficult to do in practice. In fact, Kline (2000a) sees little alternative to having to put together a package of hypothesis testing supplementary to the headline hypothesis, and that will seriously complicate the research design and dramatically lengthen the validation process. Early planning is therefore called for.

Content Analysis: This is a method of obtaining quantitative scores for various variables within (usually written) language. [For further details, see the corresponding entry in our Psycholinguistics Glossary.]

Content Validity: [See firstly validity.] Data collection instruments such as questionnaires, test batteries, or psychometric tests may be said to have content validity to the extent that they sample "the class of situations or subject matter about which conclusions are to be drawn" (French and Michael, 1968, p164). Example: A mathematics test which contained only spelling questions, or without a section covering division, would have impaired content validity. Solution: Careful planning and analysis of the literature, followed by more detailed hypothesising and/or perhaps a multifactorial design with a view to quantifying the true spread of construct complexity (e.g. reading skill, driving skill, etc.). Another technique might be to resort to field experts to examine the proposed test content and to quantify and report some measure of their approval.

Continuous Variable: [See firstly variable and the Section 1 entry for "discrete" vs "continuous" variables.] This is one of the two sub-classes of interval/ratio data (the other being discrete variable). It follows that measurements of continuous variables are always approximations, and thus have an element of measurement error irretrievably built in. [Compare discrete variable.]

Control Group: [See firstly group.] This is a subset of a research sample selected NOT to receive a particular treatment, thus providing a helpful baseline or comparison measure for the dependent variable under investigation. Alternatively, it is the "point of comparison with the group of subjects who receive the experimental manipulation" (Bryman and Cramer, 1997, p5). Alternatively, "the function of a control group is to provide an observation that cannot be attributed to the variable being manipulated" (Sarbin and Coe, 1975, p11). Historically, one of the first recorded controlled trials was James Lind's discovery in 1753 that eating citrus fruits could cure the condition known as "scurvy" in the mariners of that time. The online James Lind Library details this and a number of other pioneer uses of control groups.

Correlation: To "correlate", of variables, is to vary in the same direction and proportion at the same time, possibly as the result of a cause-and-effect relationship but perhaps coincidentally. The detection of correlations is an important practical aspect of establishing a causal line, and thus the fundamental principle of the correlational philosophy of science.

Correlation Coefficient: A "correlation coefficient" is a mathematical index produced by one of the many correlational statistical techniques (such as the Pearson product moment correlation or the Spearman rank correlation) and indicating the extent of the relationship between two potentially related sets of measures, and therefore, to the extent that they have been properly operationalised, of the proposed underlying variables. The coefficient ranges from -1 (a perfect negative correlation) to +1 (a perfect positive correlation). A strong positive coefficient (usually accepted as 0.7 or above) indicates that one variable typically increases as the other increases, whilst a strong negative coefficient (usually accepted as -0.7 or below) indicates that one variable typically decreases as the other increases. A coefficient of zero indicates no relationship at all.

Correlational Method: See correlational philosophy of science.

Correlational Philosophy of Science: [Alternatively "correlational method" or "correlational psychology".] This is one of the two alternative approaches to quantitative research (the other being experimental psychology), as proposed by Cronbach (1957). The method effectively plots naturally occurring observations of one variable against another, searching for the correlations "presented by nature" (Cronbach, 1957, p10), that is to say, for "already existing variation" rather than for that introduced by the experimental manipulation of an independent variable. The value of this approach stems from its ability to supplement the experimental approach in areas which "man has not learned to control or can never hope to control." The problem with correlations, however, is that they are not necessarily causal. Indeed, if we do not know the precise causal line, it is easy for regular co-occurrence to be misinterpreted. Errors of this sort are known as the "cum hoc fallacy" [Rational Argument Glossary]. Cronbach points out with some justification that correlational psychologists search out variables the experimentalists prefer to ignore.

Correlational Psychology: See correlational philosophy of science.

Correlational Statistical Techniques: Mathematical algorithms such as the Pearson product moment correlation or the Spearman rank correlation, intended to produce correlation coefficients.

Cost-Effectiveness: This is what health service and education managers have to consider when financing intervention projects. It is a matter of a project's effectiveness relative to its cost, the point being that many highly effective treatments are nonetheless insupportable financially. Relevant here because studies of cost effectiveness are commonplace in healthcare, clinical psychology, health psychology, and educational psychology.

Counterbalancing: This is an aspect of research design intended to minimise order effects in experimental manipulations. Participants are exposed to the required experimental conditions in different sequences, so that overall the effects of practice or fatigue are presumed to cancel out.

Cover Story: [See firstly briefing and deception.] This is a false statement as to the purpose of a given study. Must have been approved by the ethics panel concerned, and must be covered in the debriefing session.

Criterion-Referenced: This is one of the two basic philosophies of behavioural or psychological assessment (the other being norm-referenced), specifically, one in which the criteria of "goodness-badness" at the test are publicly recorded in advance as a set of specific and objectively assessable behavioural indicators. Example: One of the most accessible examples of a criterion-reference assessment is the (UK) driving test, where you pass when you are judged good enough against a tick-list of demonstrable abilities.

Criterion Test: See concurrent validity.

Criterion Validity: [See firstly validity.] This is the extent to which a test correlates "with one or more external variables considered to provide a direct measure of the characteristic or behaviour in question" (French and Michael, 1968, p167). In most respects, the same as predictive validity [for a discussion of the exceptions, see Anastasi, 1990; Chapter 6)].

Criticising Previous Research: [See firstly citing previous research.]

Cross-Validation: [See firstly validation.] This is the independent determination of the validity of a test, using "a different sample of persons from that on which the items were selected" (Anastasi, 1990, p226).

Debriefing: [See firstly ethics.] This is an important aspect of ethicality in research, and part of the briefing-debriefing aspect of research procedure. In its simplest form it is a short recapitulation of what subjects have done and why they have done it. Debriefing is especially important where the research involved any intentional deception.

Deception: This is the deliberate concealment of the true purpose of a piece of research, often assisted by a cover story delivered at the briefing. Studies which involve deception must always expect to be challenged by the ethics committee involved, and therefore demand deep initial reflection and analysis. As far as undergraduates are concerned, they are only ethical if the deception is necessary to avoid demand characteristics, reduce experimenter effects, or otherwise control confounding.

Degrees of Freedom (df): This is "the number of components [of a statistic] which are free to vary" (Bryman and Cramer, 1997, p122). Despite being a mathematically complex concept, degrees of freedom are usually simple to determine and use. For example, the degrees of freedom for a variable sampled at n different intensities is simply (n-1).

Demand Characteristics: [See firstly bias.] This term was coined by Orne (1962), and is one of the eight types of confounding identified by Lewin (1977). It reflects the possibility (nay certainty) that subtle environmental factors will interact with the motivational state of human subjects during the research experience to render the observed behaviour non-natural in some important respect, the demonstrable fact being that "the setting may well evoke other behaviour you did not intend to evoke" (Lewin, 1977, p103). The point is that the confounding variable is provided by the experimental set-up itself, which may include the behaviour or appearance of the experimenter(s) personally.

Dependent Variable (DV): See the entry in Section 1 hereto.

Descriptive Statistics: [See firstly statistics.] The phrase "descriptive statistics" refers to a portfolio of mathematical procedures designed to present research data in summary form without being part of hypothesis testing. The most common descriptive statistics are mean [ = average], median, mode, range, and standard deviation, and the most common graphical displays are the bar chart, the box-and-whisker chart, the histogram, and the pie chart. [Compare inferential statistics.]

Design: See research design.

Developmental Delay: The phrase "developmental delay" refers to the failure of a developing organism to reach/achieve/display some physical, cognitive, or behavioural developmental norm at the expected chronological age.

Developmental Norm: [See firstly norm.] This is an age-related expectation of mental or physical ability informed by past experience or research with the population in question, and therefore vitally important in the detection of developmental delay, and therefore part of the standardisation exercise prior to the marketing of major psychometric test packages.

Diagnostic Tests and Screening Procedures: These are measurements and measurement packages designed to assist during the assessment phase of patient management. The ability of a given test to detect someone who needs to be detected is known as its sensitivity. The ability to exclude people who need to be excluded is known as its specificity. The positive predictive value of a test is a measure of how many of those who have been detected as positive actually are positive, and its negative predictive value is a measure of how many of those who have been detected as negative actually are negative. Clinicians need to be aware of all four of these factors, and recognise that the qualities are to a large extent mutually exclusive. That is to say, a good test of one condition might be a bad test of something else. [There is actually a highly mathematical good reason for this, as summarised in the entry for the ROC curve.]

Difference Testing: See testing for the difference of two means and testing for the difference of more than two means.

Differential Validity: See incremental and differential validity.

Discrete Variable: [See firstly variable and the Section 1 entry for "discrete" vs "continuous" variables.] Discrete variables are one of the two sub-classes of interval/ratio data (the other being continuous variable).

Discriminant Analysis: This is one of the four types of multivariate method.

Discriminatory Power: This is an important aspect of undertaking item analysis during the development of a psychometric test.

Dispersion: This is a measure of how tightly clustered a distribution is around its mean. [Compare central tendency.]

Double Blind: [See firstly blind.] A double blind study is one in which BOTH experimenters and participants are naive as to the true purpose of the research. It might be necessary to organise things this way if experimenter effects or other factors might bias the results. [Compare single blind.]

DV: See dependent variable.

EBP: See evidence-based practice.

Effectiveness: This is a measure of the likely actual benefit arising from a given remediation programme (that is to say, under average conditions of use) (compare efficacy). (After Hayward, Jadad, McKibbon, and Marks, 1998.)

Efficacy: This is a measure of the theoretically maximum benefit arising from a given remediation programme (that is to say, under ideal conditions of use) (compare effectiveness). (After Hayward, Jadad, McKibbon, and Marks, 1998.)

Efficacy Study: [See firstly efficacy.]

Eigenvalue: [See firstly principal components analysis.]

Empirical Data: These are data obtained by actual observation rather than by conjecture; data from the evidence of the senses.

Enlightenment Effects: This is one of the eight types of confounding identified by Lewin (1977). The possibility that prior exposure to the study area might influence performance under test.

Error: See measurement error.

Ethics: This is the code of practice imposed upon researchers by their professional body and/or employer. Click here to consult the British Psychological Society Code of Conduct.

Ethics Committee: This is a formally constituted panel to which research proposals need to be submitted for approval on ethical grounds (and hence a major defence against legal action should the case arise).

Evaluation Apprehension: This is one of the eight types of confounding identified by Lewin (1977). The possibility that naturally apprehensive or secretive personalities will not be performing normally on the behaviour under test. Lewin suggests, amongst other things, that experimenters need to watch out for comments such as "I better watch what I say in fron of you" (op. cit., p105).

Evidence-Based Practice (EBP): Evidence-based practice is properly informed professional decision making. It is "the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients" (Sackett et al, 1996). "It is a systematic approach to integrating current scientific evidence" (source) [alternative definitions]. EBP is, however, only as good as the available evidence, and that is usually less than conclusive. The philosophy therefore requires that practitioners are sensitive to levels of evidence. Moreover, even where the evidence base is sound, it is being constantly extended (hourly, indeed, in the fastest moving branches of science). [See the story of James Lind in the entry for control group.]

Experiment: See true experiment.

Experimental Methods: [See firstly research types and designs.] This is a class of research design intended to approximate to the ideal of the true experiment, and therefore characterised by structured observation of the effects of one or more deliberately manipulated independent variables on a single dependent variable, while the effects of (ideally all) other possible causation is tightly controlled. [Now see the separate entries for field experiment, natural experiment, quasi-experiment, true experiment.]

Experimental Psychology: This is one of the two basic types of scientific psychology identified by Cronbach (1957) (the other being correlational psychology). The method is based upon the scientist changing this or that condition "in order to observe their consequences" (p10). The experimenter is thus "interested only in the variation he himself creates", unlike the correlator, who is interested in the variation which is already there.

Experimenter Bias Effects: Same thing as experimenter effects.

Experimenter Effects: The phrase "experimenter effects" refers to the ability of experimenters by carelessness and lack of attention to detail to bias their research, for example, by failing to prevent demand characteristics, the Hawthorne effect, etc.

Experimenter Expectancy: This is one of the eight types of confounding identified by Lewin (1977). The possibility that experimenters themselves can subtly influence their participants' behaviour. [See Pygmalion effect.]

Ex Post Facto Research: This is one of the recognised subtypes of the experimental method.

External Reliability: [See firstly reliability.] This is one of the two forms of reliability (the other being internal reliability). "The degree of consistency of a measure over time" (Bryman and Cramer, 1997, p63).

Face Validity: [See firstly validity.] A test may be said to have face validity if, upon simple inspection, it appears to the subject to measure "what it claims to measure" (Kline, 2000b, p18). Kline warns that this can sometimes be a good thing (it may motivate subjects to perform well), and sometimes a bad thing (the target measure may be so obvious as to promote deliberate mis-performance).

Factor Analysis: This is one of the two main factor analytical methods (the other being principal components analysis). The method requires the accumulation of scores on a number of simultaneous variables for each subject, and then performing multiple correlations.

Factor Analytical Methods: These are one of the most powerful correlational methods of scientific research, and the method of choice when investigating multiple causation. There are two specific statistical procedures under this heading, namely factor analysis proper, and principal components analysis. Enthusiasts for factor analytical methods are quick to point out that science cannot advance by hypothesis testing alone.

Factor Loading: See factor analytical methods in general and loadings in particular.

False Negative: [See firstly diagnostic tests and screening procedures.] This is AN INCORRECT diagnostic judgement that an entity DOES NOT fall within a target category. [See now negative predictive value.]

False Positive: [See firstly diagnostic tests and screening procedures.] This is AN INCORRECT diagnostic judgement that an entity DOES fall within a target category. [See now positive predictive value.]

Falsification: See principle of falsification.

Fatigue Effect: This is a class of confounding which might be encountered with a prolonged or physically demanding research procedure, and in which performance on the later items will be tailing off. Fatigue effects may be controlled for to a certain extent by going for a more sophisticated design, perhaps with counterbalancing of trials.

Ferguson's Delta: [See firstly discriminatory power.] This is an index of discriminatory power devised by Ferguson (1949)

Fisher, Sir Ronald: Sir Ronald Aylmer Fisher (1890-1962) was the statistician who devised the logic of the null hypothesis during hypothesis testing. His book "Statistical Methods for Research Workers" (Fisher, 1925/2004 online) has been described as "probably the most influential book on statistics of the 20th century" (source).

Frequency Data: This is a subtype of nominal data.

Gaussian Distribution: See normal curve.

Gosset, William: William Sealy Gosset (1876-1937) was the Guinness brewery quality assurance chemist who, under the pseudonym "Student", popularised Student's t-test (Student, 1908) as a practical method of comparing the strength and composition of small samples [fuller story].

Hawthorne Effect: [See firstly bias and confounding.] This is the name given to the phenomenon whereby the mere act of observing a behaviour can change it. The effect was first formally documented by Mayo (1933, 1945), following field research at the Western Electric Hawthorne Works, Chicago, between 1927 and 1932, in which the main driver of plant productivity turned out to be the presence of the researchers, rather than anything to do with the working conditions [fuller story]. The Hawthorne effect is an excellent example of attention bias in action.

Homoscedacity: See homogeneity of variance.

Hypothesis: "A hypothesis states the relationship between two (or more) variables [and] takes a form such as 'if variable A is high, then variable B will be low'" (Lewin, 1979, p37).

Hypothesis Testing: [See firstly hypotheses.] This is the act of putting one's theoretical beliefs to objective and peer-replicable test. Hypothesis testing will normally be supported by inferential statistics. Mathematically, there are a number of ways to go about this, but the most popular method in the social sciences was devised by Fisher, Sir Ronald, and is so structured as to involve an attempt to disprove the null hypothesis and simultaneously to provide some estimate of confidence level in the form of a p-value. Hypothesis testing is also the backbone of the hypothetico-deductive method, on which nothing less than the scientific method itself is based [not everyone agrees totally with this – see Cattell’s (1952) comments in the entry for factor analytical methods].

Hypothetical Construct: This is a hypothetical construct (or "construct", for short) is a presumed internal quality of a system, beyond direct observation, whose presumed operation accords with available empirical data. Alternatively, it is "a construct is some postulated attribute of people, assumed to be reflected in test performance" (Cronbach and Meehl, 1955/2005 online). Constructs are therefore part of a theory, and may, in turn, map onto one or more variables, each of which may be operationalised as observable measures in a number of different ways. Examples: stress, memory. [See now construct validity.]

Impression: In the context of this glossary, an "impression" is a statement of best clinical judgment, an attempt at medical diagnosis which allows for an element of residual uncertainty. It reflects how a patient "looks" (or, more formally, "presents"), rather than "what they have got".

Independent t-Test: Same as unrelated t-test.

Independent Variable (IV): See the entry in Section 1 hereto.

Inductive Reasoning: See the entry for this topic in our Rational Argument Glossary.

Inference: See the entry for this topic in our Rational Argument Glossary.

Inferential Statistics: [See firstly statistics.] This is a class of mathematical procedures designed to establish the likelihood of a causal relationship existing between blocks of empirical observations. Two major subclasses of inferential statistic are recognised, namely correlational methods and tests for the difference of group means methods. [Compare descriptive statistics.]

Interaction: See analysis of variance.

Internal Consistency: [See firstly reliability and validity.] This is one of the two considerations of a test's reliability (the other being test-retest reliability). A form of validation of a multiple item test, characterised by the fact that "the criterion is none other than the total score on the test itself" (Anastasi, 1990, p55).

Internal Reliability: [See firstly reliability.] This is one of the two forms of reliability (the other being external reliability). The extent to which a scale "is measuring a single idea" (Bryman and Cramer, 1997, p63).

Interval Data: [See firstly interval.] This is a collection of observations of an interval-based variable.

Intervention: This is the act of remediation itself, that is to say, the treatment which is actually delivered to the person in need (the alternative being to do nothing and let nature take its course).

Intervention Study: This is the research programme needed to establish the efficacy and/or effectiveness of a remediation programme. The simplest research design is to divide a group of sufferers into matched Experimental (E) and Control (C) groups. The E-Group then receives the remediation for a given period of time whilst the C-Group receives an equally complex but medically/educationally neutral dummy treatment (itself a major ethical problem). Improvements over time (if any) are measured, and - if bias and confounding have been properly controlled (a massive design problem), and if the measuring instruments are valid and reliable (another massive problem) - any changes can only have resulted from the treatment (although it still might not prove cost-effective). (See main text for examples.)

IV: See independent variable.

Kendall Coefficient of Agreement (u): This is a test to detect an underlying logical pattern in a series of repeated paired comparisons, such as might be obtained if a group of judges was only ever presented with two items at a time out of a sample, rather than ranking the entire sample.

Kendall Coefficient of Concordance (W): This is a test for the correlation of more than two variables, for ordinal data.

Kuder-Richardson Reliability Coefficient: [See firstly reliability in general and consistency in particular.] This is a measure of inter-item consistency devised by Kuder and Richardson (1937).

Longitudinal Study: This is an intervention study where the effects are reassessed not just at the end of the initial intervention period but at intervals over many years. Longitudinal studies are therefore the preferred method of evaluating educational initiatives and the like, where the final results need to work their way in real time through the normal developmental life-cycle. Cohen and Manion (1989, p71) distinguish four subtypes of longitudinal study, namely the cohort study, the cross-sectional study, the ethogenic study, and the trend study.

MANOVA: See analysis of variance.

Measurement: "Measurement is the numerical estimation of the ratio of a magnitude of an attribute to a unit of the same attribute" (Michell, 1997, p383). Alternatively, it is a quality or quantity arrived at by observation of an operationalised measure of a variable in known circumstances.

Measurement Error: "Discrepancies between the observed value of your measurement and the 'true' value" (Fife-Schaw, 1995, p45).

Median: This is the score midway between the lowest and highest score in a distribution.

Mode: This is the most common score (or band of scores) in a distribution.

Mortality: In the context of research, this is a type of bias in which subjects do not complete the study period. This might be literal mortality (as in medical research, for example) or figurative (as in student drop-out).

Multivariate Methods: "A collection of techniques appropriate for the situation in which the random variation in several variables has to be studied simultaneously" (Armitage and Berry, 1987, p326). The main multivariate methods are principal components analysis, factor analysis, discriminant analysis, and cluster analysis.

Naturalistic Observation: This is one of the three basic approaches to scientific research identified by Underwood (1966), and characterised by "the recording of behaviour as it occurs in a more or less naturalistic setting with no attempt to intervene" (p4; bold emphasis added).

Negative Correlation: See correlation coefficient.

Negative Predictive Value (NPV): [See firstly diagnostic tests and screening procedures.] A test's NPV is a measure of how good that test is at detecting true negatives when all its decision negatives are considered. It is calculated by substituting empirical observations into the formula = TN / (TN + FN). When NPV is high it indicates that the false negative problem is under control.

Noise: Within the context of measurement theory, this is the same thing as random error.

Non-Parametric Statistics: [See firstly inferential statistics.] Non-parametric tests are mathematically "less powerful" than their parametric equivalents. This means that "given exactly the same data, a parametric test is more likely to lead to significant results than a non-parametric test" (Snodgrass, 1977, p357). The most commonly used non-parametric statistics are the binomial sign test (for single sample designs), the Mann Whitney U test (for two-group unrelated designs), the binomial sign test, or Wilcoxon signed ranks test (for two-group related designs), and the Kruskal Wallis test (for three- or more- group unrelated designs). [Compare parametric statistics.]

Norm: This term refers to the performance of the standardisation sample on a given test (Anastasi, 1990), and thus that which gives meaning to the test scores of subsequent samples. [See now developmental norms.]

Normal Distribution: [Alternatively "Gaussian distribution" or "the bell curve".] See separate dedicated handout and exercises.

NPV: See negative predictive value.

Null Hypothesis: [See firstly hypothesis testing.] This is a deliberate statement of the contrary of what you really suspect to be the case. Thus, if one's true hypothesis is that <TALL MEN LIVE LONGER>, then the null hypothesis is simply that <TALL ME DO NOT LIVE LONGER>. Although this may seem an unnecessary complication in putting an argument across (because our mental problem space is limited capacity and extra words, especially negatives, take up that space), it is useful for technical reasons when carrying out inferential testing. This is because group difference statistics, by their mathematical nature, make the initial presumption that two samples are from the same population until proved otherwise. They then report when the means of the two samples move apart far enough for that null presumption eventually to be dismissed.

Observation: "The essence of studying anything [is] the observation of changes in variables" (Coolican, 1990, p15).

"Observation exists at the beginning and again at the end of the process: at the beginning, to determine more definitely and precisely the nature of the difficulty to be dealt with; at the end, to test the value of [the action taken]. Between those two termini of observation, we find the more distinctively mental aspects of the entire thought cycle: (i) inference, the suggestion of an explanation or solution; and (ii) reasoning, the development [of] the suggestion. Reasoning requires some experimental observation to confirm it, while experiment can be economically and fruitfully conducted only on the basis of an idea that has been tentatively developed by reasoning. [.....] The disciplined, or logically trained, mind - the aim of the educative process - is the mind able to judge how far each of these steps needs to be carried out in any particular situation. No cast iron rules can be laid down. Each case has to be dealt with as it arises [.....]. The trained mind is the one that best grasps the degree of observation, forming of ideas, reasoning, and experimental testing required in any special case, and that profits the most, in future thinking, by mistakes made in the past. What is important is that the mind should be sensitive to problems and skilled in methods of attack and solution." (Ray, 1967, p157; italics original.)

One-Tailed Test: [See firstly tests for the difference of two means.] This is a directional application of one of the two-group inferential statistics, namely the t-test, the Wilcoxon, or the Mann-Whitney. It is so called, because the statistical procedure only has to deal with separation of the two distributions down one or other of the asymptotes (or "tails"), but not both.

Operationalise, To: This is the act of assigning a particular physical dimension as a measure of a particular research variable. Example: One might conceptualise the hypothetical construct stress as including autonomic changes, one of which might be adrenaline-related, and then operationalise a measure of stress as serum adrenalin, heartbeat per minute, temperature of thumb, salivary cortisol, irritable outbursts per hour, or anything you like, providing you can defend the construct validity of your eventual findings.

Opportunity Sampling: [See firstly sampling.] This is the act of selecting a research sample according to who is available to take part in it, rather than according to more precisely derived criteria.

Paired Comparison: See Kendall coefficient of agreement.

Parallel Form: In order to avoid practice effects which might otherwise prevent using the same assessment twice on the same subjects, many psychometric packages offer two (or more) item sets, matched for difficulty. These are known as the parallel forms of the test.

PCA: See principal components analysis.

Pearson, Karl: Karl Pearson (1857-1936) was the statistician who devised the Pearson product moment correlation and the chi-squared test.

Piliavin, Rodin, and Piliavin (1969): This is the class-defining study into bystander apathy.

Placebo Group: This is a group in an intervention study given a dummy treatment, and (usually) kept unaware of that fact.

Population: This is all the members of a uniquely definable group of people or things. [Compare sample.]

Positive Correlation: See correlation coefficient.

Positive Predictive Value (PPV): A test's PPV is a measure of how good that test is at detecting true positives when all its decision positives are considered. It is calculated by substituting empirical observations into the formula = TP / (TP + FP). When PPV is high it indicates that the false positive problem is under control.

Practice Effect: This is a class of confounding which might be encountered when the measure in question is itself a learnable mental or physical skill.

Predictive Validity: [See firstly validity and criterion validity.] Data collection instruments such as questionnaires, test batteries, or psychometric tests may be said to have predictive validity to the extent to which they have demonstrated the ability to detect the people they want to find. Predictive validity is therefore a major requirement in healthcare (where tests are used to select/reject patients for treatment) and education (where tests are used to select/reject students). Establishing an instrument's predictive validity requires prolonged field data collection and analysis, but provides a very important statistic to be able to quote. There is an enormous science of predictive value for diagnostic tests within medical decision making, but avoid for the moment.

Predictor Variable: This is an ptional name for independent variable.

Premiss: This is an optional spelling of premise.

Principal Components Analysis (PCA): [See firstly factor analytical methods.] This is a factor analytical method of screening a large number of simultaneous [i.e. multivariate] measures for those which - because they vary together all or most of the time - may be better regarded as the outcome of a broader underlying factor. What we want to end up with is new and better variables, such that each "has the highest possible variance and so represents better than any other linear combination of the [original variables] the general differences between individuals" (Armitage and Berry, 1987, p327).

Principle of Falsification: Popper's (1959) assertion that the scientific method is ultimately based on our ability to prove an assertion is false (by finding a counterexample to it), but NOT to prove one is true.

p-Value: [See firstly confidence level.] Academic journals usually adopt the standard mathematical shorthand here, reporting probability as p [hence "p-values"]. The values of p can run from zero (totally improbable) to 1.0 (certain), and are usually seen as two places of decimals in between. The probability can be converted to a percentage by multiplying by 100. Thus a probability of p=0.57 is likely to happen 57% of the time.

Probability: This is a mathematically expresed measure of how likely something is to happen. [See now p-value.]

Professional Opinion: See levels of evidence.

Pygmalion Effect: This is one of the classical examples of an expectancy bias,. first studied in schoolteachers by Rosenthal and Jacobson (1968).

Qualitative Research: This is research in which the critical variable is a quality [compare quantitative research].

Quantitative Research: This is research in which the critical variable is a quantity [compare qualitative research].

Quasi Experiment: [See firstly experimental methods.] This is a form of experiment, developed originally in educational research, in which it is not possible to allocate subjects to the various IV conditions.

Random Error: [See firstly measurement.] This is the inherent inaccuracy of any scale of measurement.

Randomised Controlled Trial (RCT): The RCT is a robust and well-tried research design and a key element in delivering evidence-based practice in healthcare (or, indeed, any other professionalism). It is "randomised", because it does not pre-select subjects who are in some way likely to fit "the treatment" being evaluated. Instead, participants are drawn at random from the largest practicable pool. It is "controlled" in the sense that it includes control groups who do NOT receive the treatment in question but who IN EVERY OTHER RESPECT are treated identically. This is to insure against making what are known as Type I errors should a variable other than the treatment be surreptitiously at work. What you want to see is an improvement in the treated group but no change in the controls. RCTs are also expected to be blind or double-blind where necessary, and generally avoid bias, confounding, to do their best to design out practice effects, order effects, fatigue effects, and ceiling effects, and to maximise (or at least quantify) the many subtypes of research validity and reliability.

Range: [See firstly distribution.] The range of a distribution is the difference between that distribution’s lower extreme and its upper extreme. Example: If the lowest value in a distribution is 43 and its highest value is 93, then subtracting the former from the latter gives us a range of 50. [To see how the range can be used in descriptive statistics, see box-and-whisker plot.]

RCT: See randomised controlled trial.

Reactance: This is one of the eight types of confounding identified by Lewin (1977). It reflects the possibility that stubbornness on the part of subjects will cause responses deliberately opposite to that which might otherwise have been made.

Receiver Operator Characteristics (ROC): [See firstly diagnostic tests and screening procedures.] If you adjust the cut-off to a lower value, then it makes the test more sensitive, but only at the cost of having to put up with more false positives. If you set it higher then it makes the test more specific, but only at the cost of having to put up with more false negatives.

Reflective Practice: This is a state of perpetual critical appraisal in professionals which attempts to exclude those irksome errors of omission by enhancing the vision of exactly what full clinical autonomy actually involves. Reflective practitioners are seen as preventers who constantly question their means of prevention, as assessors who constantly question their methods of assessment, as interveners who constantly question their proposed point of intervention, and so on.

Reliability: This is a measure of how well a given test reflects "'true' differences in the characteristics under consideration" (Anastasi, 1990, p109). Reliability is usually considered under two subheadings, namely internal consistency (e.g. Bryman and Cramer, 1997; Kline, 2000b), and its stability over time (or "test-retest reliability") (e.g. French and Michael, 1968; Kline, 2000b). Bryman and Cramer (1997) distinguish internal reliability from external reliability, and Armitage and Berry (1987) discuss the whole area of diagnostic tests and screening procedures as a reliability issue. It may also be appropriate, depending upon the particular research set-up, to assess scorer reliability. There are three "off-the-peg" tests of reliability, namely split-half reliability, test-retest reliability, and item consistency, but there is usually also scope for some project-specific hypothesis testing using the full range of inferential statistics.

Repeated Measures Design: [See firstly research design]. This is a class of experimental designs in which the scores from two or more different groups of subjects are compared using one of the available tests for the difference of means. Example (1): In testing the hypothesis that men are better than women at mathematics, you have no choice but to test different groups of subjects (because men cannot be women at the same time). Example (2): In testing the hypothesis that sober men are better at mathematics than drunk ones, you could (a) test the same subjects on different occasions (a repeated measures design), or (b) test different groups (an independent samples design). By and large, repeated measures designs are more powerful than independent samples designs, because there is less within between subjects variance.

Replicability: This is the ease with which a piece of research can be precisely repeated using only the original write-up to go by. Given the requirements of the principle of falsification that good science is ultimately all about failing to find counterexamples of a test proposition, it follows that any author's research should be as precisely replicable as possible. [See now citing previous research.]

ROC: See receiver operator characteristics.

Role Expectations: Role expectations account for two of the eight types of confounding identified by Lewin (1977). With "good subject" role expectations, the possibility is that we try to do what we think we ought to do, and in the "bad subject" role we set out to respond anything but normally.

Rosenthal and Jacobson (1968): [See firstly bias and Pygmalion effect.] This is the classic study of expectancy bias. Teachers were given fabricated cover stories leading them to believe that some of their pupils were likely to be "academic spurters" during the coming year. Such students, who in fact were chosen at random from the available classroom population, showed an increase of 12 IQ points during the research year compared to 8 points in their "less gifted" colleagues. [Further discussion.]

Sample: This is the subset of the target population of objects or participants which is selected for research investigation. There are many practical procedures available for selecting one's sample, each with its own advantages and disadvantages.

Sampling Bias: [See firstly bias.] This is bias arising from logically flawed or carelessly executed sampling, resulting in a sample which does not fairly represent the population in question.

Scorer Reliability: This is a measure of how consistently a particular scorer will score the same raw data on different occasions.

Selection Bias: Same thing as sampling bias.

Sensitivity: [See firstly diagnostic tests and screening procedures.] This is a mathematically derived index of how good a test is at detecting true positives, that is to say, of how good that test is at detecting positives in a population of condition positives. It is calculated by substituting empirical observations into the formula TP / (TP+FN). High sensitivity is called for in tests where false negatives are either expensive or downright dangerous. False negatives in medicine result in missed opportunities for treatment, and in education they result in delayed or lost opportunities for personal development. In practice, however, highly sensitive tests often give high numbers of false positives, so in isolation they are less than perfect measures. [See now receiver operator characteristics.]

Sign Test: See binomial sign test.

Single Blind: [See firstly blind.] A single blind study is one in which EITHER the experimenter(s) OR the participants are naive as to the true purpose of the research. [Compare double blind.]

Single Case Research: This is an experimental variant of the case study method used within correlational research (Cohen and Manion, 1989)

Solomon Four Group Design: This is an experimental design introduced by Soloman (1946) which uses three control groups to avoid the possibility that any pre-test might itself affect the dependent variable. The traditional (Lewin, 1979) experimental group is given the pre-test, the main treatment, and the post-test in the normal way, and the traditional control group is given the pre-test and post-test but given only a control treatment. There is then a second control group which is not pre-tested, but does get the main treatment and the post-test, and a third control group which just gets the post-test. Differences between the experimental group and the second control group, or between the first and third control groups, "must be caused by pretesting" (Lewin, 1979, p107).

Spearman, Charles: Charles Edward Spearman (1863-1945) was the statistician who devised factor analysis, and who then famously applied this method to the analysis of the factors of human intelligence. He also devised the Spearman's rank correlation.

Specificity: [See firstly diagnostic tests and screening procedures.] A test's specificity is a measure of how good that test is at detecting negatives in a population of condition negatives. It is calculated by substituting empirical observations into the formula TN / (TN + FP). High specificity is called for in tests where false positives are either expensive or downright dangerous. False positives in medicine result in inappropriate treatment or unnecessary referral, and in education they result at best in a harder than necessary student experience and at worst in course failure.

Split-Half Reliability: [See firstly reliability]. This is a statistical procedure by which a random half of the test items is correlated with the other half. This is often a more useful measure than test-retest reliability, because the data comes from single test session, so there will be no practice, illness progression, recovery, or mood or similar effects.

Standard Score: [See firstly standard deviation.] This is a score which somehow indicates its relative position within a distribution. One which expresses "the individual's distance from the mean in terms of the standard deviation of the distribution" (Anastasi, 1990, p84). This typically requires that raw scores are mathematically converted in some way, and that the distribution approximates to normal.

Standardisation: This is the process of calibrating a measure to a given population so that the future performance of other samples can be norm-referenced.

Stanine: [Abbreviation of Standard NINE point scale.] This is a measurement system in which the distribution in question is divided into nine subranges. If a normal distribution, then the stanines take up 4%, 7%, 12%, 17%, 20%,17%, 12%, 7%, and 4% respectively. Stanine #1 is then classified as "poor", #2 and #3 are grouped together as "below average", #4, #5, and #6 are "average", #7 and #8 are "above average", and #9 is "superior". (After Durost, 1968.)

Structured Interview: See interview.

"Student": See Gosset, William.

Student's t-Test: See t-test.

Syllogism: See this entry in our Rational Argument Glossary.

Test-Retest Reliability: [See firstly reliability]. This is one of the two aspects of reliability identified by French and Michael (1968) [the other being internal consistency]. A "time-associated reliability", which can readily be quantified using a correlation coefficient derived from correlating results from the same test administered twice to the same subjects, at least three months apart. The square of the correlation coefficient gives the "degree of agreement". [See now the advantage of having parallel forms of the test in question.]

Tests for the Difference of Two Means: This is one of the two basic types of tests for the difference of means (the other being tests for the difference of more than two means).

Theory: A theory is a body of empirically verified observation, plus a particular interpretation. It is thus an attempt to make sense of a number of confirmed hypotheses by drawing them together into a more meaningful whole (Lewin, 1979).

Thurstone, Louis: Louis Leon Thurstone (1887-1955) was a psychometrician who used factor analysis techniques in the study of intelligence and its internal structures.

True Experiment: [See firstly experimental methods.] This is the "ideal" form of experiment. One in which the researcher has the power, money, ethical approval, and ability to manipulate all the necessary independent variables.

True Negative: [See firstly diagnostic tests and screening procedures.] This is A CORRECT diagnostic judgement that an entity DOES NOT fall within a target category. [See now negative predictive value.]

True Positive: [See firstly diagnostic tests and screening procedures.] This is A CORRECT diagnostic judgement that an entity DOES fall within a target category. [See now positive predictive value.]

True Score/Value: [See firstly measurement.] This is the (in-fact-unattainable) ideal of a score which has no measurement error.

Two-Tailed Test: [See firstly tests for the difference of two means.] A non-directional application of one of the two-group inferential statistics, namely the t-test, the Wilcoxon, or the Mann-Whitney. So called, because the statistical procedure has to deal with separation of the two distributions down BOTH of the asymptotes (or "tails"). [Compare one-tailed test.]

Type 1 Error: [See firstly hypothesis testing.] This is a class of very bad science in which the null hypothesis is rejected when in fact it is true, thus causing its inverse, the hypothesis, to be accepted when in fact it should be rejected. In simple situations, one way to reduce the likelihood of Type 1 error is to select a tight confidence level (at the 0.01 level, say, rather than at the 0.05 level). [Compare Type 2 error, and see also confounding.]

Types of Research: Underwood (1966) identifies three basic types, namely naturalistic observation, the correlational method, and the experimental method. Cronbach (1957), however, disagrees that there is such a thing as an observational method, seeing observation as a type of measurement, not as a basic type of research. He recognises only correlational psychology and experimental psychology.

Unrelated t-Test: This is one of the two possible types of t-test (the other being the related t-test). The test of choice when a research design delivers two columns of scores from comparison groups (that is to say, a between-groups design). [For an e-tutorial on how to use SPSS to carry out your unrelated t-tests, click here. Note the suggested format for the final write-up.]

Validity: This is the issue of whether a test is measuring what you think it is measuring, and thus whether the piece of research in question is valuable science or not. Unfortunately, there are a large number of ways in which research can be invalid. To start with, poor conceptualisation (i.e. iffy psychological construct), poor build (i.e. iffy items, ambiguous, unclear, incomplete), or poor administration (bias). There are many aspects to this major issue of scientific quality, including construct validity, content validity, criterion validity, face validity, and predictive validity. Referring specifically to psychometric tests, rather than to scientific conclusions in general, French and Michael (1968) consider only content validity, criterion validity, and construct validity. Kline (2000b), on the other hand, emphasises face validity, concurrent validity, predictive validity, content validity, incremental and differential validity, and construct validity.

Variable: See the entry in Section 1 hereto.