Course Handout - An Introduction to
Data Modelling for Semantic Network Designers
Copyright Notice: This material was
written and published in Wales by Derek J. Smith (Chartered Engineer). It forms
part of a multifile e-learning resource, and subject
only to acknowledging Derek J. Smith's rights under international copyright law
to be identified as author may be freely downloaded and printed off in single
complete copies solely for the purposes of private study and/or review.
Commercial exploitation rights are reserved. The remote hyperlinks have been
selected for the academic appropriacy of their
contents; they were free of offensive and litigious content when selected, and
will be periodically checked to have remained so. Copyright
© 2010, High Tower Consultants Limited.
|
|
First published online 08:30 BST 1st October
2004, Copyright Derek J. Smith (Chartered Engineer). This
version [HT.1 - transfer of copyright] dated 12:00
13th January 2010
|
Although this paper is reasonably self-contained, it is primarily designed to be read as a subordinate file to our e-paper on "Short-Term Memory Subtypes in Computing and Artificial Intelligence" (especially, Parts 5 to 7 thereof). |
1 - Introduction
"Only about 16% of IT
projects can be considered truly successful" (Computer Weekly, 27th April 2004).
The
"In no case can one yet
say that medical care of ill patients actually depends upon a computer or
information system. Why is this? [Firstly,] medical people have been extremely
slow to spell out in a cohesive and organised form
the conditions under which they wish to work with an information system. [And
secondly,] the flagrant and consistent 'over sell' of capability on the part of
manufacturers and computer enthusiasts." (Lindberg, 1967.)
Lindberg
was closely followed by G. Octo Barnett, an American cardiologist, who had been
involved during the late 1960s in trials of an early
time-sharing system at
So
why are we so bad at following advice. Well one of the secrets of getting a
computer system to work is to understand that the management failures are only
ever the secondary cause of a systems disaster, for even the weakest
manager can look good if nothing ever goes wrong. What really does the damage
is an underlying technical problem combined with slack
management, for then all hell breaks loose. Which brings us to the point of
this section, which is that modern systems are by and large only as successful
as their "data model" .....
Key Concept - The Data
Model: We shall be looking at the concept of data models in
considerable detail over the coming pages, so here is a temporary definition to
be getting on with: A data model is a formally commissioned compilation of
empirical research findings concerning the data known to a system, such that it
becomes the sum total of an organisation's
understanding of the informational content of its world (or at least of a
particular precisely delineated part of that world). It is "a way of
representing a business enterprise (or other similar organisation
such as a government agency) on paper, much the same as a street map is a way
or representing a city on paper" (Relational Systems Corporation, 1989).
And if your metaphorical map is substandard, of course, then you will get lost.
In short, it is a system's investment in a high-specification data model which
most prevents the sort of technical problems referred to above.
A
data model is thus an excellent example of what modern commentators like
to call "metadata" .....
Key Concept -
Metadata: Metadata is data about data. It is facts about the
facts themselves, as when we support a simple proposition with a number of
ancillary propositions [Example: To say that oxygen is the eighth
element by atomic number is to state a bald chemical fact, but to add that it
was Joseph Priestley who discovered that fact is to have supporting knowledge].
The role of metadata in data modelling is vital and
includes such ancillary information as field sizes, character set, hierarchical
structure, usage, and retention period. For more on the systems analyst's view of data see Section 1.4 of our e-tutorial
on "IT Basic Concepts", and for more on
the programming aspects see Section 2.3 of our e-paper on "Short-Term Memory Subtypes in
Computing and Artificial Intelligence" (Part 6).
This
paper traces the evolution of the data model over the last half century, and
then considers the cross-disciplinary relevance of the underlying concepts to
two particular areas of cognitive scientific research, namely the philosophy of
mind and the use of semantic networks within artificial intelligence. We begin
with some historical perspective .....
2 - Historical Background
"The chain [Marley] drew
was clasped about his middle. It was long, and wound about him like a tail; and
it was made (for Scrooge observed it closely) of cash-boxes, keys, padlocks,
ledgers, deeds, and heavy purses wrought in steel." (Charles Dickens:
"A Christmas Carol"; bold emphasis added. Note the nature of
commercial "data processing" half a century before Herman Hollerith's
1890 punched card system ushered in the modern age.)
During
the Second World War, computer programmers tended to be university
mathematicians attached to the military. They worked typically on top-secret
projects like automated gun-laying, the atom bomb, or code-breaking [illustrative history],
and they were their own harshest critics because they knew exactly what they
were trying to achieve (in many instances better than anyone else on the planet
at the time). The unit of software development was therefore the individual
computer program, and the software development process relied on a technique
known as "functional decomposition" [tutorial],
in which the overall "functionality" of the system (as defined
by those who were paying for it) was progressively broken down into chunks of
logic precise enough to be coded. Only at the last moment was either the
location or the nature of the associated data ever taken into account.
As
the years went by, however, increasing hardware capacities allowed systems to
become more complex, and it eventually became clear that there were flaws
inherent in the function-first (or "function-driven") approach. On
any or all of three counts, systems whose component programs had been
constructed in this way were nightmares to keep integrated. Firstly, the data
output from program A was never what was needed by programs B, C, and D, either
because program A was breaking new ground, or because their respective boffins had failed to keep each other fully informed.
Secondly, when changes were required to one program it would bring down
perfectly good programs elsewhere, requiring compensatory changes to them as
well; which then brought down yet other programs in turn, and so on. And
thirdly, data duplications across the various programs were allowed to become
the rule, meaning that when you amended a data definition at one location you
had also to update all its duplications (providing you could remember where they
all were, and providing you could cope with the practical difficulties of
amending card- and tape-based files).
To
start with, bugs and irritations of this sort were dealt with reactively,
that is to say, as and when they arose; programmers spent all their time
correcting their programs anyway, so an additional stream of data definition
problems was just another occupational hazard. There was no proactive
error prevention, in short, because there were simply not enough systems around
to justify the effort. Then things started to change. By the mid-1950s, computers like the IBM650,
the UNIVACs, and the LEOs
were starting to make major inroads into the commercial data processing
marketplace [illustrative
history], whereupon the demand for programming skills suddenly exceeded the
available supply of boffins. New programmers were
therefore recruited from amongst the technically minded and trained to order.
Unfortunately, being now just ordinary folks, they were no longer able to work
out for themselves what they needed to do - you could make them good at the
mechanics of programming, sure, but they nevertheless lacked the boffins' instinctive feel for the systems end of things. So
out went programming as the largely self-specifying side of applied
electronics, and in came the skills of "systems analysis" to
help draw up the specifications for the programmers to work to .....
Key Concept -
Systems Analysis: Systems analysis is the search for abstract
principles in real-world systems, especially information processing systems
[see alternative Wikipedia Definition]. It is
"operational research" [glossary] taken from the
factory floor and applied instead to the world of information flow, and to do
it well calls for many of the skills and mind sets of "systems
thinking", as detailed in our e-paper on "Systems Thinking: The Knowledge
Structures and the Cognitive Processes". Systems analysis
also requires a portfolio of information reduction tools, many of which
represent their conclusions in diagrams and tables [e-tutorial; further detail]. For a detailed
history of the "systems men", see Haigh
(2001).
The
shortage of programmers also meant that errors now needed to be more diligently
avoided, and it was soon realised how many of them
had been self-inflicted all along and would disappear if just a little more
attention was paid to the underlying data structure at the beginning of the
development process. Data "definitions" accordingly became part of
standard program documentation, central registries started to be set up to
co-ordinate the hundreds, if not thousands, of data fields found in modern organisations, and constructing one's "data
divisions" became a major aspect of program writing.
ASIDE: To see what the
Data Division looks like in the COBOL programming language, see Section 2.3 of
our e-paper on "Short-Term Memory Subtypes in
Computing and Artificial Intelligence" (Part 6).
At
the same time, it became important for senior designers to ensure that data
files were set up with as little duplication as possible, and then used in the
most efficient sequence as possible; and the best way to achieve this was to
set up a carefully integrated central copy of the data. This data
"pool" or "bank" or "base" could then be
administered by a dedicated team of suitably experienced professionals, and
made available on a "shared access" basis to whoever had a valid
need. Individual applications programs dealt only with the particular subset of
the central pool that they happened to be concerned with, whilst locks and
privacy mechanisms protected the integrity and availability of the rest like
gold dust. We mention all this because these are the principles of what we
know today as the "database", and the teams who administer the
process are known as "database administrators" (DBAs).
ASIDE: To a
not-inconsiderable extent, the modern world is its databases, for
without them the banking, point of sale, e-commerce, logistics, and knowledge
industries would be put back half a century at a stroke. Nevertheless, the
battle against the duplication and sloppy definition of data has only partly
been won, for although large systems are now meticulously analysed
as a matter of routine [albeit this does not stop the five-out-of-six failure
rate mentioned in Section 1], the proliferation of desktop and mobile systems
in the modern corporate world has destroyed any pretence of data standardisation (at least we know of no organisation
on earth who has this under control).
As
things worked out, systems analysis skills were not the only factor missing
from the systems development equation. IT projects now involved so many people
that far greater managerial discipline was needed as well, and so "IT
project management" was born .....
ASIDE: Amongst the first civilian
IT project managers were the Lyons Company's John Simmons and David Caminer [full story]. For an introduction to the
skills and qualities demanded of the role see our e-tutorial
on "IT Project Management", and to see what happens when
the managers get it wrong [which is what really accounts for the
five-out-of-six failure rate mentioned in Section 1], see our "IT
Project Mismanagement Disasters" database.
What
the early project managers did was to chart the activities and deliverables
involved in putting a system together, and then simply to arrange for
everything to be done in the least wasteful sequence, using the "critical
path analysis" techniques being developed around that time in the
aerospace industry [full
story]. This established an infrastructure of standards to observe and
controls to administer, into which were firmly embedded the all-important
technical analyses of data and function. Systems development now had an
"industry best practice" of its own, in which everyone knew what
needed to be done and in what order, and this, by the 1980s,
gave us what came to be known as the "structured development
methodology" .....
Key Concept - The
Structured Development Methodologies: Here is how we
defined this topic in our e-tutorial on "IT Project Management": "A
structured methodology is a staged development, with the final stage producing
the finished product, and each preceding stage producing a logical subset of
prior components, in much the same way that a car is put together on a factory
assembly line. It is the staged machining of components. It is the engineering
of systems. It is a philosophy of system development. Key features
are that it is simultaneously user-oriented (being driven from the outset by
carefully derived statements of user requirement) as well as product-oriented
(in that it then carefully specifies what is to be delivered, and when and
how). While there may well be minor differences across the market range of
structured development methodologies, their common denominator has always been
their unrelenting emphasis on developmental sequence." The production
stages which were standardised upon were Feasibility
Assessment, Project Set-Up (or "Initiation"), Requirements
Specification, Logical Design, Physical Design, Development, Delivery, and
Operation.
In
Britain, the most widely known of the structured methodologies came from the
Central Communications and Telecommunications Agency (now part of the Office of
Government Commerce) in the shape of SSADM
[short for "Structured Systems Analysis and Design Method"] and its
associated project management package called PRINCE [short for
"Projects in Controlled Environments"].
ASIDE: SSADM was a government-sponsored development of LSDM [short for Learmonth
Structured Development Method], itself arguably the first structured
development methodology. LSDM was put together by Learmonth and Burchett Management Systems [LBMS - now part of Computer Associates International, Inc.]
in the late 1970s. SSADM
first appeared in January 1981 after a formal evaluation of no less than 47
competing products, and has been the gold standard for British civil service
and commercial applications ever since. For further details, some help is
available online [click
here], and the booklet SSADM Version 4
is a good introduction (CCTA, 1991). The current
version of the control methodology is PRINCE 2. To read a brief note from the
methodology's sponsor, click here, for more of the detail, click
here, and for routes to a formal qualification in the method, click
here.
So
there we have it - systems with inferior data models will always fail, no
matter how clever their programmers, because sooner or later the accumulation
of individually small problems will exceed the team's ability to cope, whilst
systems with superior data models, although not totally immune to disaster, are
at least immune to the greatest single error stream, namely that arising from
confusion as to what the subject matter of one's computation actually is.
3 - The Logical-Physical Divide
"Once management realises the relationship of reliable data to corporate
well-being, they will treat their data with the same care used to handle their
cash" (Cahill, 1970, p23; cited in Haigh, 2004/2004 online).
Structured
development, then, is essentially staged development, with management
break-points separating requirements specification and logical design, logical
design and physical design, physical design and development, and development
and delivery. All these breakpoints are important, but by far the most
fundamental is the one separating the "logical" and the
"physical" stages of design, because this is the one which allows
analysts not to have to worry about the physical design decisions which will
eventually be based on their findings. It allows them time to get their data
structures right in the abstract, and both allows and requires the wise
project management team to keep the entire project firmly "on hold"
until that abstract understanding is complete. The ensuing challenge lies in
then "implementing" the logical design, that is to say, in
turning the logical design documentation into a particular physical system
mounted on a particular physical platform in a particular physical way.
Key Concept -
"Logical Design" vs "Physical
Design": The logical design stage of a computerisation project allows the data and function within
the systems area in question to be thoroughly researched and analysed. The practical value of the resulting reference
documentation is that it allows better physical design decisions to be
made when the time comes, and thus more successful systems to be built. The
process of giving physical dimensions to a previously logical design is known
as "implementing" that logical design, and we shall be saying a lot
more about what is involved in Section 8.
And
this is where the data model comes in. Remembering what we said earlier about
data being more fundamental than process, the most important technique at the
logical design stage turns out to be the creation of a record of the real world
objects the organisation is concerned with, how they
behave, and what their properties are. Here are some definitions from the
literature .....
"Data models are
techniques for representing information, and are at the same time sufficiently
structured and simplistic as to fit well into computer technology" (Kent,
1978, p93).
"[Data modelling is] the process of structuring real world facts
into real world concepts [and of] deriving a conceptual model as a result"
(Oxborrow, 1989, pp15/19).
"The primary purpose of any
data model [.....] is of course to provide a formal means of representing
information and a formal means of manipulating such a representation. For a
particular model to be useful for a particular application, there must exist
some simple correspondence between the components of that model and the
elements of that application; that is, the process of mapping elements of the
application into constructs of the model must be reasonably
straightforward." (Date, 1983, pp182-183.)
Figure
1 shows us how the data model fits in to the broader process of systems
development .....
|
Figure 1 - The Sequence of Events during Structured Development: With the arrival of the structured development methodologies in the early 1980s, it became standard practice to partition systems development both "vertically" (with the early developmental stages routinely diagrammed to the left of the later ones) and "horizontally" (in terms of whether you were focussing on the supporting data or the supported function). Here are the four Johari quadrants [bottom right] which emerge when the two stages of development (i.e. the logical and the physical) intersect with the two fundamental design aspects (i.e. the data and the function). The logical view of a system's function is set out in the "functional decomposition" [upper left quadrant], whilst the logical view of its data is set out in the DATA MODEL [lower left quadrant]. These map in due course onto the program structures [upper right quadrant], and the program data definitions [lower right quadrant], respectively. We have set the data model quadrant into RED-BOLD, because it is the key to doing the whole job properly. |
|||
|
|
Developmental Phase |
||
|
Logical Design |
Physical Design |
||
|
Function |
FUNCTIONAL DECOMPOSITION |
WE'LL TALK ABOUT WHAT GOES ON IN THIS TRANSITIONAL BOX IN SECTION 8 |
PROGRAM STRUCTURES |
|
Data |
DATA MODEL |
DATA DEFINITIONS |
|
Figure
1 is important because it gives us a context against which we can state the purpose
of the present paper quite precisely. For the reasons set out in the caption
above, we shall be concerning ourselves only with the lower left quadrant of
Figure 1 (the data-logical quadrant), and looking in greater detail at how to
construct one of these data models. We shall then reflect upon the skills
needed during the construction process, and consider what cross-disciplinary
relevance those skills might have for psychology.
4 - The Bachman Diagram and the Database Management
System
"The Data Base Management
System (DBMS) is the foundation of almost every modern business information
system. Virtually every administrative process in business, science, or
government relies on a data base. The rise of the Internet has only accelerated
this trend - today a flurry of database transactions powers each content update
of a major website, literature search, or internet shopping trip. Yet very
little research addresses the history of this vital technology, or that of the
ideas behind it. We know little about its technical evolution, and still less
about its usage." (Haigh, 2004/2004 online; bold emphasis added.)
|
This section shares some content with Section 3.1 of our e-paper on "Short-Term Memory Subtypes in Computing and Artificial Intelligence" (Part 5). |
To
summarise our argument so far, we have tried to view
the process of data modelling against the broader
context of systems development, so that those outside the systems industry can
see how and why so much time and money needs to be spent on this particular
activity. We now look in more detail at what this obsession with data actually
entails. How, for example, did the industry progress from a simple sheaf of
data definition slips to the database as we know it today, and when did the
data-first (or "data driven") approach first start to take hold? The
answer, in most tellings of the story, takes us back
to the early 1960s, and to the General Electric
Corporation's computing laboratories in New York, where one of GE's recent
recruits, Charles W. Bachman, was more or less single-handedly inventing
data modelling as an adjunct to developing GE's in-house
"Integrated Data Store" (IDS) database management system .....
Biographical Aside - Charles
W. Bachman (1924-): [See fuller biography] Famous in several
areas, we mention one time "triple-A" technician Charles Bachman here
for having devised a version of the data-structure diagram known as the "Bachman
Diagram" [more on which below], and for showing how such diagrams
could make for highly effective use of "direct access" data
storage devices. Bachman was awarded the 1973 Turing
Award by the Association for Computing Machinery for this achievement. <<AUTHOR'S
NOTE: Never underestimate the importance of gunnery control computation in the
history of the world. As we explained in detail in Section 2 of our e-paper on "Short-Term Memory Subtypes in
Computing and Artificial Intelligence" (Part 2), the
demand for effective anti-aircraft predictor technology was a major driving
force for computer development during the 1940s.
Bachman was with the US Army Anti-Aircraft Artillery Corp from 1943 to 1946,
and gained early computing experience with the predictor systems on the 90mm anti-aircraft gun [picture
and specifications].>>
Historical Aside - The
"Database Management System": A Database Management System,
or "DBMS", is a complex software product designed to manage large
pooled stocks of data for you, and especially to allow that data to be accessed
by lesser software products called "application programs" [tutorial]. Databases are
thus the computer equivalent of the old-fashioned file index systems, but with
the advantage of very rapid search times. (Haigh,
2004/2004 online) argues that we should view
the DBMS as a coming together of three originally separate earlier trends,
namely (a) the idea of a common pool of data, (b) the development of "file
management" software, and (c) the growing sophistication of "report
generator" software. Bachman himself claims that GE's 1957 "Report
Generator System" "was the first production data base management
system" (Bachman, 1980, p7), and was himself
responsible for building a similar product at the Dow Chemical Company in 1958.
What
made the IDS system tick was a clever combination of two ideas. On the one hand
there was the then-brand-new "direct access" facility provided
by disk storage devices [which we have described in detail in Section 1.2 of
our e-paper on
"Short-Term Memory Subtypes in Computing and Artificial Intelligence"
(Part 4)], and on the other hand there was what Bachman called the "data
structure set", a method of preparing your data so that it could make
the best possible use of that access technology. The essence of the data
structure set was that each record could be "logically associated, as a
trailer, with multiple header records" (Bachman, 1980, p7),
that is to say, on a set owner/set member basis. And when you brought
these two ideas together, the result was sheer engineering elegance - you could
store the owner record using the direct access technology and then pick up its
related members using externally invisible addressing. Above all, you could
toss a particular record in amongst a million similar ones, and still go
straight to it when you needed it again! The following worked example will
illustrate the power of this new method of access .....
|
Exercise 1 - An
Everyday Example of Direct Access Storage and Retrieval in a Set-Structured
"Database" 1
In the psychology office, a few feet from where we are typing this in, there
is a row of six conventional four-drawer filing cabinets. Each of the 24
drawers contains alphabetically sequenced hammock-slung student record files,
perhaps four files per hammock, and perhaps 60 hammocks per drawer. Five
separate alphabetically sorted runs of files are maintained, as follows: (1)
all extant first year students, (2) all extant second year students, (3) all
extant third year students, (4) all successful graduates, and (5) all
failures and withdrawals. 2
This sort of paper-based filing system is notoriously bad at coping with
certain types of update and retrieval transaction. For example, it has no
overall list of the members of any one category of student. That sort of list
must be carefully maintained elsewhere, and the two systems regularly
cross-checked. The risk is, of course, that any given file might be
"out" or misfiled at any given time. And when a file cannot be
found, the only real option is to go looking for it, scanning through the
files themselves one by one in what is termed a "serial", or
"exhaustive", search. It follows that while alphabetic misfiles are
bad enough (since they invite a serial search of at least a hammock or two), category
misfiles are even worse, since they invite a serial search of the entire six
cabinets! 3
Devise an indexing system capable of locating (a) the student you wrote to
three hours ago (into whose file you suspect you misfiled your pay cheque,
but whose name you have momentarily forgotten), (b) all 2001 graduates, (c)
the mathematics grades of all withdrawals within their first year of study.
[ANSWERS AT END.] 4 Have a look at Morton, Hammersley, and Bekerian (1985), noting especially their notion of "headings" as the access keys for biological memories. |
The
first lesson of data modelling is therefore that the
long-term success of a system is proportional to the amount of careful thought
put in during early development. Specifically, you need to keep careful track
of which records are set owners and which are set members (and also,
incidentally, to decide how you are going to handle any record types which are
both owners and members [like the hospital ward mentioned in Section
7]). Bachman's solution came in two parts - what to do, and how to document it.
The first part of the solution was to identify all the owner-member
relationships. This involved (1) identifying the attributes that mattered, (2)
deciding how these clustered together into entities, and (3) considering how
these entities might be related. The second part of the solution was to
display this priceless metadata graphically, combining all the individual data
structure sets into a single larger diagram known as a "data structure
diagram" (soon to became famous as the "Bachman Diagram").
The real beauty of the data structure diagram is that being a diagram it has
all the traditional advantages of pictorial matter for the rapid communication
of ideas - once you have grasped the visual "syntax", each picture
(without the slightest exaggeration) speaks a thousand words.
Key Concept - The
Bachman Diagram: This was Bachman's personal notation for data
structure diagrams. It shows the record types needing to be stored in the
proposed system, together with their storage arguments and their owner-member
set relationships. Bachman Diagrams treat attributes as the properties of
things; as atomic items of data, each of which is capable of being named, but
of not being further divided. Entities are treated as the things which
matter to, and therefore need to be identified by, the system. Each entity is
thus a collection of attributes, and relationships are the reasons entities may
be associated. Relationships are assertions of truth about the subject area [we
shall therefore meet them again in Section 10 when discussing "propositional
networks"], and take the form "a man can own many dogs".
Both the subject and object of this truth are themselves entities, and there is
usually a one-to-many relationship between them. The notational conventions are
few: attributes are usually relegated to the supporting documentation for
clarity's sake, classes of entities are represented by suitably captioned
boxes, relationships by lines drawn between the entity boxes concerned, and the
pluralities by adding arrowheads or so-called "crows' feet" symbols
at the "many" end of these lines. The rule is that "the arrow
points from the entity class that owns the sets to the entity class that makes
up the membership of the sets" (Bachman, 1969, p5).
Here
is a specimen Bachman Diagram .....
|
Figure 2 - A Bachman Data Structure Diagram: Here is a small but nonetheless illustrative example of the Bachman Diagram expression of a typical data model. At a structural level it shows four entity types [the boxes] and three relationships [the arrows]. More specifically, it shows the books in a library, and the mechanism of their reservation by library users. Note how the natural pluralities of the real world are represented primarily by one-to-many relationships. For example, a library may hold many copies of a single catalogued title, as shown by the <library-has> relationship towards the left of the diagram. Similarly, there may be a queue of many reservations for each title, each one of which has associated with it the name and address of the corresponding library-user. There are a few specimen Bachman Diagrams available online - see Section 4 of Hitchman (2004/2004 online), Section 15.6 of Yourdon (2001/2004 online), or Maurer and Scherbakov (2004 online). |
|
Copyright © 2004, Derek J. Smith. Reverse-engineered from the ERD shown in Figure 3, itself a simplification of Oxborrow (1989, p36). |
Bachman
and his team had a prototype version of IDS up and running in "early
1963" (Olle, 1978), and had it reliable enough
for full operational use monitoring GE's own stock levels in 1964 (Bachman,
1980). Bachman then spent the mid-1960s at GE in
Historical Aside - "CODASYL" and the "Data Base Task Group": The Conference on
Data Systems Languages was set up by the US Department of Defence
in May 1959 at the suggestion of Charles A. Phillips, the Pentagon's Director
of Data Systems Research. Their remit was to produce a general purpose computer
programming language for business users, and they organised
themselves into three more precisely tasked sub-committees, namely (1) the Short
Range Committee (SRC), responsible for the
immediate specification of the language, (2) the Intermediate Range
Committee, responsible for its development in the medium term, and (3) the Long
Range Committee, responsible for its development in the longer term. In the
event, only the SRC actually sat, and their principal
success was the COBOL programming language, whose specifications were approved
in January 1960. The new language was not perfect, of course, and various
teething troubles were reported. It was bad, for example, at processing chain
pointer sets (or "lists") of records, and from October 1965 the CODASYL sub-committee structure was extended by the
addition of the List Processing Task Force (LPTF)
to look into improvements in this area of functionality. By 1967, however, the LPTF meetings were so dominated by database issues that the
committee's name was changed to the Data Base Task Group (DBTG). William Olle of RCA
was at the first DBTG meeting in
Key Concept - Chain
Pointers: A systems programming device used in the IDS and
subsequent DBMSs for implementing an owner-set
relationship. Involves adding space for one or two address fields to the basic
record length, such that each record in a set can "point" to the one
after and/or before it in the set. These are known as NEXT and PRIOR pointers,
respectively [there is an explanatory graphic in Schubert (1972; Figure 1), if
needed]. Here is an extract from Olle (1978) which
illustrates the relationship between COBOL and the set pointer: "The first
report from the DBTG came out in January 1968 and was
entitled 'COBOL extensions to handle data bases'. Some quotations from the one
page summary of recommendations show the thinking of the era. It was recommended
to 'Add a facility for declaring master and detail record relationships which
use circular chains as a means to provide the widest possible file structuring
capability." (p4.)
Then
came a curious turn of events which saw the development rights to IDS being
taken over by the B.F.
Goodrich Chemical Corporation of Cleveland, OH, (henceforth simply
"Goodrich"). The motivation for this reversal of roles seems to be
(a) that IDS had bugs in the software which GE had no time to cure, and (b)
that Goodrich preferred to remain IBM users. Goodrich therefore undertook a
repair-and-migrate exercise [see, for example, Karasz
(1998/2004 online)],
and (unfortunately for GE) did such a good job that they were able in 1969 to
field their own system under the name "Integrated Database Management
System" (IDMS). The development work was
carried out at Goodrich's
Biological Aside - Richard F.
Schubert (): This from the biographical note at the end of Schubert (1972):
"Mr. Schubert is manager-information systems programming and operations
for B.F. Goodrich Chemical Co., Cleveland. He served on the CODASYL
Systems Committee from 1963 until this year and has been a member of the CODASYL Data Base Task Group since 1970. His B.S. in
chemical engineering is from Cleveland State Univ." (p47.)
CODASYL, meanwhile, had not been dragging its feet.
Between 1969 and 1971, it compiled two major statements of database principles
(CODASYL, 1969, 1971; subsequently incorporated into
ANSI/SPARC, 1976), inspired by the single central axiom that the internal
complexities of a database should at all times remain totally
"transparent" to the end-user. A DBMS, in other words, should allow
users to concentrate upon their data rather than upon the tool they happened to
be using to view it. This transparency was eventually obtained by implementing
the data model in three time-separated sub-stages, each separately programmed,
and each passing critical output to the one following. These three stages were
as follows .....
(1) Set Up a "Database
Schema": The first step is to convert the data model into a physically
equivalent set of declarations and descriptions known collectively as a "database
schema". Unlike the data model, however, the database schema is now in
a form which can be stored within, and manipulated by, the DBMS. This is a more
technical view of the data than hitherto, and constitutes the first major step
in bridging the gap between the data as the user knows it and the hardware on
which it is eventually to be stored.
(2) Set Up Database "Subschemas": The second step is to create a
"departmental" view of the data. This is another technical view, and
reflects the fact that no single application program will ever need access to
all the available data. This, of course, is where the sharing of the common
pool of data is enabled. Each individual end-user - and that includes even the
most senior executives - only needs access to a fraction of the total available
data, and for him/her to be shown too much is at best inefficient, and at worst
a breach of system security punishable by civil or criminal law (or both). This
"need to know" facility is provided by subsets of the schema known as
"subschemas", each one
allowing an individual application program to access only the data it is
legitimately concerned with.
(3) Set Up Database "Storage
Schemas": The third and final step is to create a "machine level" view
of the data. This is achieved by declaring what is known as a "storage
schema" to the DBMS, which the DBMS then uses to translate every
user-initiated store and retrieve instruction into a set of equivalent physical
store and retrieve instructions.
Schubert's
principal developers were Vaughn Austin, Ken Cunningham, Jim Gilliam, Peter Karasz, and Ron Phillips. From the outset, the product was
highly compatible with DBTG recommendations (not
surprisingly, Karasz explains, when you consider that
Schubert was a member of CODASYL and had an advance
copy of the DBTG's April 1971 database
specifications). The working principles of the Goodrich product were as follows
.....
Technical Aside - IDMS Internals: IDMS
databases are organised into "realms"
of sequentially numbered "pages" of known and identical
capacity (in bytes). Each page contains a small header index, followed by up to
256 separate records, or "lines", each with its own unique
page-and-line "database key". These records are typically organised into sets by one or more "chain
pointers" [actually more database keys] concatenated into the total
record length. Each page can be transferred on a random access basis as a data
block by a single disk read or write. Database realms are typically very large,
having many pages, large pages, or both, and so the art of traversing them
efficiently is (a) to establish what is known as an "entry point",
that is to say, a suitable starting page, and (b) to access no more pages than
is absolutely necessary. Here, enhanced from Bachman (1973), are the six
most commonly used traversal options: (1) A search can be started at the
beginning of a database realm, and then proceed line by line within page by
page until there are no more records to examine. This will retrieve records in
strict database key sequence with no reference to record type. It therefore
retrieves all possible records, and the order in which it retrieves them will
to all intents and purposes be random. (2) A search can be conducted by the
aforementioned database key, the known permanent address of the record in
question. This will retrieve the record at the specified line and page, again
without regard to its record type. This option can be used in conjunction with
option (1) to begin a realm sweep from part way through [this facility might be
useful, for example, if restarting a full sweep after an interruption]. (3) It
is also possible to retrieve the record at a specified line and page by using
the "database currency" mechanisms provided [details]. The currency tables
maintained by the DBMS allow the last accessed record of a particular type or
in a particular set to be re-accessed by its database key without the
programmer needing to know that database key explicitly [that is to say,
the currency tables are a systems programming facility, and not an applications
programming one]. (4) A search can be conducted by key field "hashing
algorithm". This is the "direct access" method
previously referred to, only it is known within the IDMS
world as "CALC access", because of the calculations carried
out by the hashing algorithm. The algorithm takes the contents of the specified
key field, performs a mathematical conversion of the component characters, and
comes up with a number between 1 and the number of pages in the realm. Since
exactly the same algorithm had been used when the record was originally stored,
we now know exactly where to go to get it back [Karasz
credits Vaughn Austin with having perfected the IDMS
hashing algorithm]. The DBMS software can then retrieve the specified page, and
rapidly scan down it for the matching line. (5) Using one of the earlier
methods as initial entry point, a search can then proceed via a pre-established
set relationship. [
The
first five customers for IDMS were ACME Cleveland,
Abbott Labs, General Motors, RCA, and Sperry Rand (Karasz,
1998/2004 online),
but in 1973, in order to concentrate on their core business, Goodrich sold the IDMS rights to John
Cullinane's Cullinane Corporation, later Cullinet
Software Inc., and now part of Computer Associates. The product survives there
to this day as the CA-IDMS proprietary DBMS, and
continues to support many of the world's heaviest duty "on line
transaction processing" (OLTP) systems [lots of history].
IDS, meanwhile, went to Honeywell in a buy-out of GE's computing division, and
was then enhanced in 1974 as IDS-II.
Historical Aside - So is it
"Database" or "Data Base": So which is it, one word or
two? Well it certainly began life as two words, appearing as such in Head
(1970), Olle (1972), Schubert (1972), and in the name
of the DBTG itself, but nowadays it is certainly one
word for most commentators [at time of writing, there were nearly 70 million
Google hits for the one-word option but less than 4 million for the two-word
option]. But basically it remains a matter of taste, and Haigh
still prefers the two-word option. Rather inconsistently, therefore, current
practice is to use the word "database", but to retain the two-letter
abbreviation "DB" (as in DBMS). As for the term "Data Base
Management System", the initials DBMS had an instant popular appeal, and
usage of this acronym spread rapidly after the 1971 DBTG
report (Haigh, 2004/2004 online). Nevertheless, some caution is
needed, because there were many unscrupulous sales teams ready to jump on the
bandwagon, thus: "The term [was] applied retroactively to some existing
systems, and used to describe virtually every new file management system,
regardless of its fidelity to the specific ideas of the DBTG"
(Haigh, 2004/2004 online).
5 - The Network-Relational Schism
But
clouds were looming on the network database horizon in the shape of a "flat
file" implementation developed in 1969 by IBM's Edgar F.
("Ted") Codd (1923-2003).
Key Concept - The
"Flat File" or "Table": A "flat
file" is a computer file composed of relatively large, identically
formatted, data records, which, properly indexed, is ideal for random access
retrieval of uniquely keyed individual records. At heart, it is the technology
of the card index tray, made digital; brilliant for "read only"
applications, but guaranteed to struggle with "volatile" (i.e.
rapidly changing) data. The Internet is awash with illustrations of flat file
structures - click here, or here, or here, if interested.
PERSONAL ASIDE: Between 1982 and
1989, the author was an IDMS database designer and
applications programmer, and found the product both versatile and robust once
you got used to it. It was admirably suited to systems needing to update
volatile data, such as booking systems, banking, and logistics.
Interrogation-only systems (e.g. marketing data) are better approached with a Codd-style tabular system. The reason we have concentrated
so intensely on the internals of the CODASYL-type
database, is that it sets up what are, in effect, semantic networks, just like
those we introduced in Part 4 (Section 4.2), and this
functionality is currently in wide demand in the artificial intelligence world.
Codd had joined IBM in 1949, and served time on the
SSEC and Stretch teams before switching to research
into methods of data management. Here is how Hayes (2002/2003
online) explains what happened next .....
"Meanwhile, IBM researcher
Edgar F. ('Ted') Codd was looking for a better way to
organise databases. In 1969, Codd
came up with the idea of a relational database, organised
entirely in flat tables. IBM put more people to work on the project, codenamed
System/R, in its
From
the outset, the unique selling proposition of the "relational
database" (RDB) was that it was quicker to
set up and easier to maintain than its DBTG rivals,
but what happened next is a prime example of how words can often
unintentionally misinform. The word in question is "relational",
and the nature of its misuse was that the RDB
manufacturers allowed (and perhaps even encouraged) the perception that RDB was synonymous with well-designed.
ASIDE: Within British
Telecom, at least, it was not difficult to take large conventional files of
data and load them into simple flat file databases, whereupon it was then only
a matter of minutes before the system was responding to its first adhoc interrogations. It was keeping the data up-to-date
which was the problem.
As
a result, the flat file implementations became so easy to market that everybody
bought one, including many for whom the technology was entirely
inappropriate because they had data update requirements as well. The fact
that the network database was equally tightly grounded in the analysis of
entities and their relationships - and had been since the very first Bachman
Diagram had been drawn - was conveniently overlooked. [For a detailed
comparison of the two technologies, see Michaels, Mittman,
and Carlson (1976), and for a balanced criticism of RDBs
from within IBM, see Borkin (1980).]
ASIDE: In our 1982 to
1989 system, we interfaced our central heavy duty IDMS
database with a number of ancillary flat file systems, thus largely separating
the update transactions from the enquiries. The network system handled the
primary second-by-second updates, the big prints, and the routine small
enquiries, whilst the flat file systems handled the adhoc
enquiries and analyses using a structured query language. Each suite of
programs therefore played to its inherent strengths, to the ultimate benefit of
the company. Here is Olle again, who had foreseen
precisely this problem nearly two decades earlier: "The arguments which
were raging during the years 1967 and 1968 reflected the two principal types of
background from which contributors to the data base field came. People like
Bachman [.....] epitomised the manufacturing
environment and they saw the need for the more powerful structures which IDS
[and similar systems] offered. Others, [including] myself had seen the need for
easy to use retrieval languages which would enable easy access to data by
non-programmers." (Op. Cit., p3.) <<AUTHOR'S
NOTE: The supreme irony may yet prove to be that current attempts to build
semantic networks for artificial intelligence applications [a volatile OLTP environment if ever there was one] using relational
technology are bogging down in precisely the same problems of complexity that
the relational people accused the network people of 30 years beforehand. It may
or may not be relevant that Lehmann's (1992) microscopically thorough review of
semantic network applications in artificial intelligence contains in its 745
substantive pages not a single reference to Bachman, the Bachman Diagram, CODASYL, or IDMS, despite having
acknowledged on page 1 that networks are "a convenient way to organise information in a computer or database". There
is plenty on databases, to be sure, but mainly the hierarchical and relational
types.>>
6 - From the Entity-Relationship Diagram to er... the Entity-Relationship Diagram
Although
there was often bitter squabbling between the DBTG
and RDB people about the relative merits of their
respective products, there was one thing that both camps agreed upon, and that
was the need for a meticulously thorough entity-relationship analysis at the
data modelling stage. Whether you were looking at the
most intricate of data networks or at the tallest and widest of flat files, you
still needed to know what data elements clustered on what other data elements.
The next key player in the database story was another IBM researcher, Peter P.
Chen .....
Biographical Aside - Peter P.
Chen (): See Chen's
.....
who in 1976 gave us the "Entity-Relationship Diagram" (ERD) as we most commonly see it today .....
Key Concept - The
Entity-Relationship Diagram: "The Entity-Relationship Model is a
data model for high-level descriptions of conceptual data models and it
provides a graphical notation for representing such data models in the form of entity-relationship
diagrams. Such data models are typically used in the first stage of
information system design and are used for example to describe information
needs and/or the type of information that is to be stored in the database
[.....]. The modelling technique, however, can be
used to describe any ontology (i.e. an overview and classification of used
terms and their relationships) for a certain universe of discourse (i.e. an
area of interest). In the case of the design of an information system that is
based on a database, the conceptual model is at a later stage, usually called
logical design, mapped to a logical data model, such as the relational model,
which in turn is mapped to a physical model during physical design."
(Wikipedia, 2004 online; bold
emphasis original.)
And
here is Chen's own subsequent account of what he did .....
"There were several
competing data models that had been implemented as commercial products in the
early [1970s]: the file system model, the
hierarchical model (such as IBM's IMS database
system), and the Network model (such as Honeywell's IDS database system). The
Network model, also known as the CODASYL model, was
developed by Charles Bachman, who received the ACM Turing Award in 1973. Most organisations at that time used file systems, and not too many
used database systems. [Then] in 1970 the relational model was proposed, and it
generated considerable interest in the academic community. It is correct to say
that in the early '70s most people in the academic
world worked on relational model instead of other models. One of the main
reasons is that many professors had a difficult time to understand the long and
dry manuals of commercial database management systems, and Codd's
relational model [was] written in a much more concise and scientific
style." (Chen, 2002/2004 online; Section 2.1.)
But
as we saw in Figure 2, the Bachman Diagram is itself an entity-relationship
diagram, so what we actually have here is 15 "lost years" (1961 to 1976)
in which Bachman's seminal role in developing the entity-relationship network,
GE's IDS, Goodrich's IDMS, and all the DBTG-compliant systems by then in operation, suddenly
became academically invisible in the service of Mammon. So, lest we perpetuate
this confusion, we shall be working to the following naming standards for the
remainder of this paper .....
Bachman Diagram (or erd, in small letters) = Bachman's 1961 entity-relationship
diagram.
ERD (in big letters) =
Chen's 1976 entity-relationship diagram.
"Data structure
diagram" or entity-relationship diagram (unabbreviated) = either/both, or
the generic practice.
So
successfully did the 1976 ERD do the job of
sketch-mapping the physical world, and so quickly did it wring productive work
out of the newcomers being sucked into the ever-expanding database industry,
that it rapidly became the industry standard method. To get some idea of the RDB products currently on offer, and their fundamental
reliance on an ERD to point them in the right
direction, see Comsys
Information Technology Services, the Database Design Studio, or the Schreyer
Institute for Innovation in Learning (who, like many others across
education, prefer the title "concept map" for their data
structure diagrams). The DBTG systems, by contrast,
have been relegated to what is known as "legacy system" duty,
that is to say, they are doing what they were originally built to do, and they
will continue in this "
ERDs differ from Bachman Diagrams in two main
respects, namely (a) that there is no longer any attempt to add arrowheads or
crows' feet symbols at the "many" end of the relationship links, and
(b) that all attributes are now shown on the diagram itself rather than in the
supporting documentation [see the ovals on the specimen diagram below]. Using
these conventions, an example of a simple ERD is
given in Figure 3.
|
Figure 3 - A Simple ERD: Here is our Figure 2 Bachman Diagram recast as an ERD. As noted above, the main changes are as follows: (1) the relationships are shown as diamonds, (2) there is no longer any attempt to add arrowheads or crows' feet symbols at the "many" end of the relationship links, and (3) attributes are now shown clustered on their owning entities. There are a number of excellent specimen ERDs available online - click here, or here, or here, if interested [check out Google - there's hundreds]. To see a specimen "Chen Model", click here. |
|
Copyright © 2004, Derek J. Smith. Redrawn from a black-and-white original in Smith (1996; Figure 8.6B, p99), itself a simplification of Oxborrow (1989, p36). |
Commercially
speaking, Codd and Chen were very much in the right
place at the right time, therefore, and IBM were very astute to have put them
there. Yet the story as the textbooks and websites now tell it usually begins
with them, to the exclusion of the earlier figures. This is acceptable as sales
practice (where all is fair and truth is relative), but not for academic
purposes (where the source scholarship should be accurately identified), so we
fully sympathise with Hitchman
(2004/2004
online) when he suggests: "Every text that discusses an ER diagram
should be citing Bachman as the source of the technique and should be careful
to disentangle the ER model from the diagram technique".
7 - Normalising an
Entity-Relationship Diagram
"Entia
non sunt multiplicanda preaeter necessitatem"
[i.e. "One should not increase, beyond what is necessary, the number of
entities required to explain anything"] (attrib. William
of Ockham, early 15th century).
In
practice, the task of creating a full-sized data model for a full-sized
business area is complex and time-consuming in the extreme [one recent Internet
discussant described a particular rather complex Bachman Diagram as "like
a friggin' schematic for a nuclear power
plant"]. This is because the business analysis phase of the exercise can
be relied upon to turn up a profusion of entities, each one wanting to be
related to every other. Which means, irritatingly enough, that the search for
neatness and implementability in one's data routinely
produces a dog's breakfast of a data network. Worse still, there are several
configurations within the diagram which need to be managed out of the way by
the creation, artificially, of yet more entities, and that means having yet
more relationships to go with them. Fortunately, decades of experience with the
method has given data modellers many tips on how to
tease out the underlying good sense. These are known as "data normalisation" procedures .....
Key Concept -
"Data Normalisation": Data normalisation is the process of removing duplications and
contradictions from early drafts of a data model. Remembering how much research
data will have been collected, and that different departments will almost
certainly have been involved, this means coalescing entity definitions where
possible, rationalising their relationships, and
generally "tuning" the data structure to the demands which will
eventually be made upon it.
In
fact, it is necessary to carry out a number of passes (typically up to five)
through the normalisation procedure, each time
ironing out a particular subset of difficulties. The techniques were applied
instinctively by the early data modellers, and not
formally described or named until Codd (1972). The first three of these are as
follows .....
(1) "First
EXAMPLE: A department
employs many employees. It would therefore not be possible in advance to
specify the maximum length of a department record if it was decided to include
employee attributes on it. Far better to analyse out
all the employee-relevant detail and store it instead on separate employee
records. (After Oxborrow, 1989, p39.)
(2) "Second
EXAMPLE: It would be
wasteful if the employee record from (1) were to contain the department name
(because it would be redundant on every employee record after the first). This
field should therefore be removed to a separate department record, stored once,
and cross-referenced when necessary. (After Oxborrow,
1989, p39.)
(3) "Third
EXAMPLE: If the employee
record from (2) contained project name and project deadline detail, then these
two fields would not be independent. This should be resolved by introducing a
project record to contain project-dependent data, and again to cross-reference
it when necessary. (After Oxborrow, 1989, pp39-40.)
As
a rule of thumb, entities should be multi-attribute but single key. This is (a)
because a single attribute is usually not an entity in its own right (but is
part of a larger, as yet unrecognised, entity), and
(b) because having two distinct names suggests two distinct entities. In
addition, entities should most definitely not be processes, and should obey Ockham's
Razor as far as practicable. Similarly, relationships should avoid obvious
hierarchical redundancies - if a hospital contains many wards, and a ward
contains many beds, it is usually safe to leave implicit the second-order truth
that a hospital must also contain many beds. Analysts should also beware of
time (because many relationships which are one-to-one or one-to-many at a given
moment, will prove to be many-to-many over the passage of time), and of
one-to-one relationship types (because this will frequently indicate that two
entities can be merged). There is also a very important and widely used
standard transformation which resolves a many-to-many relationship. This
transformation is frequently needed, and involves inserting an additional
entity between the original two - thus breaking the original relationship into
two separate parts - and then redefining the single many-to-many link as a compound
of a one-to-many and a many-to-one. Finally, if normalisation
results in a multiplication of entities it could well be that some form of
"data abstraction" exercise might also be warranted over and above
the normalisation procedure [which is where you
really earn your keep, but that is another story].
After
judicious application and re-application of all these good procedures, the
model can no longer be improved upon. It is then in what is known as its "fully
normalised" form, and is ready to be mapped
forward onto the final physical system. It is now up to the physical designers
to do their bit, raising the necessary data storage and program specifications
before handing those specifications on, in turn, to the programming teams to
put it all together. Referring back to Figure 1, we are ready to cross from the
left-side quadrants to the right-side ones, and thereby to swap our logical
considerations for physical ones .....
8 - The Optionality of
Physical Implementation
This section is dedicated to
the late Geoff Hartnell, British Telecom Database
Administrator, Area Stores Module, who took the time to explain the concept of
hardware independence to me one sunny lunchtime in 1983, and who thereby
brought together for the first time in my mind what I already knew about
cognitive psychology and what I was being taught at the time about the
logical-physical divide in database design.
Before
we return to the issue of the physical implementation of a logical design, we
need to reflect carefully on the specimen data network shown in Figures 2 and
3. As we have already seen, a data model presents an abstract set of truths - a
formal specification of the world itself, including such commonsense truths as
the fact that you need to know library users' mailing addresses in order to
correspond with them! But Jacob Marley would have had to maintain just such an
array of data in Ebenezer Scrooge's mid-19th century address book. Data models
are thus a lot of things to a lot of people, but as yet have nothing to do
with computers. They could be implemented on papyrus or tablets of clay if
that is all your technology budget runs to, which is why they are invariably
described as "logical", or "conceptual", or
"machine-independent", depending on your preferred terminology. It is
what happens next that commits us to the computer age.
What
does happen next is that physical designers take the logical design they have
been given, and fit it as best they can to the particular technical capabilities
of a particular physical computer system. To do this, they have to devise what
is known as a "first-cut" design .....
Key Concept -
"First-Cut" Design: The first-cut design for a computer system
is its broad initial physical design specification. It is the point at which a
particular hardware range and file management strategy are decided upon - for
example, making the choice between a PC or a MAC, WINDOWS or LINUX, or a
network or relational DBMS. It is also the point where the "system
boundary" has to be established, because few organisational
models are small enough to be computerised in a
single physical implementation [and take care here, because many a systems
debacle can be traced back to a loosely defined system boundary at this stage
in the proceedings - see Section 2(#1) of our e-paper on "Systems Thinking"].
It
is when drawing up the first-cut design that true computer knowledge and
experience is called for, and by the mid-1980s it was
commonplace for IT training courses to include sessions detailing how to
produce first-cut designs for each of the dozen or so main implementation
options .....
ASIDE: We still have our
British Telecom course training notes on the subject of first-cut design, and
from a single data model they offer conversion routes to all major
implementation options. These were the conventional serial and
indexed-sequential file technologies, the CODASYL
network option (IDMS), IBM's hierarchical option (IMS), and a host of relational DBMSs
(such as Ingres, Adabas, DB2,
etc.). BT had to cover all the angles in this way, because, as a broad church
corporation, it had all the technologies in use somewhere or other.
We
are now ready to put the final touches to our previously incomplete Figure 1
....
|
Figure 4 - The Sequence of Events during Structured Development: Here is Figure 1 again, only this time with the transitional box between logical and physical design filled in, and the DATABASE SCHEMA (whose role was explained in Section 4) shown. The first-cut stage of design can now be seen to be setting the constraints for the detailed design work which is about to follow. It sets what we are about to start referring to as the "computational principles" of the system in question. |
|||
|
|
Developmental Phase |
||
|
Logical Design |
Physical Design |
||
|
Function |
FUNCTIONAL DECOMPOSITION |
FIRST-CUT PHYSICAL DESIGN |
PROGRAM STRUCTURES |
|
Data |
DATA MODEL |
DATABASE SCHEMA |
|
We
are also ready to extend the list of available first-cut physical
implementation options by one very important new one, namely the brain, so that
data modelling might henceforth be seen as a tool of
cognitive science as well as of database design .....
9 - The Logical-Physical Divide in the Philosophy of
Mind
"The hard problem, in
contrast, is the question of how physical processes in the brain give rise to
subjective experience" (Chalmers, 1995, p63).
"..... the central tenet
of Marr's approach is that studying the hardware is in itself not enough. To do
that is to neglect the crucially important requirement of understanding the
nature of the task that the hardware is carrying out. [.....] He argued that,
without this topmost level of analysis (which he called the computational
theory level, and which he believed had been largely neglected by
neurophysiologists and psychophysicists), we will never have a deep
understanding of the phenomena and mechanisms of biological vision systems - we
will never know why the hardware they possess is designed the way it
is." (Frisby, 1986, p139;
italics original.)
|
An earlier version of the paragraphs on Marr appeared in Smith (1997; Chapter 3). |
We
closed Section 8 by making the point that the brain - to the extent that it is
a general purpose information processing architecture - will always be one candidate
system amongst many for the physical implementation of a logical design. This
is the essence of the logical-physical design split built into the computing
industry's structured development philosophy, and it is also a popular position
in cognitive psychology, being clearly seen in the theoretical writings of the
late David C.
Marr (1945-1980), whose position on the philosophy of mind we have
elsewhere described as "modern functionalism" .....
Historical Aside -
Functionalism: "Functionalism" was the name given to the philosophical
doctrine that the mind's mental operations exist thanks to their practical
value in satisfying the needs of a vulnerable organism in an hostile environment.
The term became popular around the turn of the 19th/20th centuries from the
writings of John Dewey and James Angell at the
In
fact, functionalism in its computer-influenced form is nothing less than modern
cognitive psychology's central philosophical standpoint. It is the study of
what the mind is doing, rather than what the brain is doing, and
it is predicated upon the presumption that the mind - once its basic principles
have been determined - then needs "implementing" on a machine capable
of "running" it. We have yet to establish whether Marr was aware of,
say, the pioneering work on structured development methodologies being carried
out by Learmonth and Burchett during the late 1970s, but if he was not, then he was independently
inventing many of their principles for himself. Marr also held that the brain
was not necessarily the only device capable of doing the job of cognition. Just
as software in general was transferable from one hardware platform to another,
so it follows that minds too, if they are software, ought to be transferable.
Theoretically - to extreme functionalists, at least - one has only to analyse the processes making up human consciousness to
implement it on machines other than the brain, and that would mean having
machines which could become as conscious as their creators .....
ASIDE: Johnson-Laird
(1987) traces similar ideas back to the writings of Kenneth Craik,
Alonzo Church (1903-1995), and Alan Turing.
Church was the
All
in all, Marr identified three discrete levels of analysis of cognition, the
first and highest of which was the level of "process computation".
Under the heading "Understanding Complex Information-Processing
Systems", he analysed how we should best
consider what a process might actually be .....
"The term process
is very broad. For example, addition is a process, and so is taking a Fourier
transform. But so is making a cup of tea, or going shopping. For the purposes
of this book, I want to restrict our attention to the meanings associated with
machines that are carrying out information-processing tasks. So let us examine
in depth the notions behind one simple such device, a cash register at the
checkout counter of a supermarket. There are several levels at which one needs
to understand such a device, and it is perhaps most useful to think in terms of
three of them. The most abstract is the level of what the device does
and why. [Some example arithmetic is then given.] This whole argument
is what I call the computational theory of the cash register.
[.....] In order that a process shall actually run, however, one has to realise it in some way and therefore choose a
representation for the entities that the process manipulates. The second
level of the analysis of a process, therefore, involves choosing two things:
(1) a representation for the input and for the output of the process and
(2) an algorithm by which the transformation may actually be
accomplished. [.....] This brings us to the third level, that of the device in
which the process is to be realised physically
....." (Marr, 1982, pp112-114; italics original;
bold emphasis added. Marr uses the phrase "to realise
it" in exactly the sense that systems designers use "to
implement".)
Key Concept -
Computational Principles: The "computational principles" of
an information processing system are its basic working principles. They consist
of a number of fundamental decisions as to the system's functional and
structural architectures, which then give the system its essential nature [as
James Clerk Maxwell and Kenneth Craik would have had
it, they specify the "particular go" of that system (Sherwood,
1966)]. <<AUTHOR'S NOTE: The main problem with
explaining the workings of the mind is that nobody has yet succeeded in stating
its computational principles. We know a lot about the physical implementation -
the neuron - but very little about how neurons contribute towards the mind's
higher functions. This is the essence of the "explanatory gap" as
discussed by modern philosophers (see, for example, Chalmers, 1995).>>
But
even Church, Turing, and Craik were not the first to
have been interested in the progressive movement of data through the biological
mind, because many of the philosophical debates of the late 19th century were
concerned with what in effect, if not terminology, were the same issues. We
refer specifically to the academic confrontation between "act"
psychologists such as Franz C. Brentano (1838-1917) and "content"
psychologists such as Wilhelm Wundt (1832-1920). As laid down in his 1874
monograph "Psychology from an Empirical Standpoint" (Brentano,
1874/1995), Brentano saw the mind busily at work classifying the momentary
contents of perceptions and intentions into one or other of three fundamental
classes of phenomena, namely (1) Ideating (e.g. seeing, hearing, etc.),
(2) Judging (e.g. agreeing, rejecting, etc.), and (3) Loving-Hating
(e.g. feeling, wishing, intending, etc.) (Titchener,
1921/2004
online). His key concept was that of Vorstellung,
usually translated as "presentation" .....
Key Concepts - Vorstellung and Phenomenal Reality: The word
"presentation" is Brentano's translators' rendering of Vorstellung in the original German. The word
translates more fully as conceivability, image, imagination, association, and
hence presentation. However there is a parallel usage of the word within a
theatrical context, where it relates to the giving of a performance, or
presentation in the sense of oration or display. One of the standard
usages of the word "phenomenon" is "cognisable
by the senses, or in the way of immediate experience; apparent, sensible,
perceptible" (O.E.D.). This allowed the
philosopher Immanuel Kant (1724-1804) to use the term "phenomenal
reality" to refer to our internal experience of the world about us. <<AUTHOR'S
NOTE: Again, we know a lot about phenomenal reality (because we experience it
directly), but little about how our neurons organise
themselves to make it happen that way.>>
Here,
from the 1995 translation, is Brentano's core argument .....
"Psychology, like the
natural sciences, has its basis in perception and experience. Above all,
however, its source is to be found in the inner perception of our own
mental phenomena. We would never know what a thought is, or a judgement, pleasure or pain, desires or aversions, hopes or
fears, courage or despair, decisions and voluntary intentions, if we did not
learn what they are through inner perception of our own phenomena. Note,
however, that we said that inner perception [Wahrnehmung]
and not introspection, i.e. inner observation [Beobachtung],
constitutes this primary and essential source of psychology. These two concepts
must be distinguished from one another. One of the characteristics of inner
perception is that it can never become inner observation. We can observe
objects which, as they say, are perceived externally. In observation, we direct
our full attention to a phenomenon in order to apprehend it accurately.
But with objects of inner perception this is absolutely impossible."
(Brentano, 1874/1995, pp29-30; italics original; bold
keywording added; square bracketing by the
translators.)
"Every idea or presentation
which we acquire either through sense perception or imagination is an example
of a mental phenomenon. By presentation I do not mean that which
is presented, but rather the act of presentation [nicht
das, was Vorstellung wird, sondern den Akt des Vorstellens]. Thus, hearing a sound, seeing a coloured object, feeling warmth or cold, as well as similar
states of imagination are examples of what I mean by this term. I also mean by
it the thinking of a general concept, provided such a thing actually does
occur. Furthermore, every judgement, every
recollection, every expectation, every inference, every conviction or opinion,
every doubt, is a mental phenomenon. Also to be included under this term
is every emotion [.....] Examples of physical phenomena, on the other
hand, are a colour, a figure, a landscape which I
see, a chord which I hear, warmth, cold, odour which
I sense; as well as similar images which appear in the imagination. [.....] It
is hardly necessary to mention again that by 'presentation' we do not
mean that which is presented, but rather the presenting of it. This act of presentation
forms the foundation not merely of the act of judging, but also of desiring and
of every other mental act." (Brentano, 1874/1995, pp78-80;
bold keywording added; square bracketing ours. Note
the now-famous word Akt.)
It
was Wundt's student Edward B. Titchener who popularised the Brentano-Wundt debate as a confrontation
between act and content. In Titchener (1921/2004 online),
he described the two theorists as alike because (a) "by happy chance"
they had published their first major book in the same year (1874), (b) they
agreed to focus on phenomena [see Key Concepts panel above], (c) they both
rejected the unconscious as a principle of psychological explanation, and (d)
they defined "the unity of consciousness in substantially the same
terms". He saw them as differing primarily over what they accepted as the
subject matter for their observations: Brentano focused on the mental act,
whilst Wundt focused on mental content. In spite of their many similarities,
therefore, Brentano and Wundt "psychologise in
different ways" (Titchener, 1921/2004 online)
.....
"For Wundt, psychology is
a part of the science of life. Vital processes may be viewed from the outside,
and then we have the subject-matter of physiology, or they may be viewed from
within, and then we have the subject-matter of psychology. The data, the items
of this subject-matter, are always complex, and the task of experimental
psychology is to analyse them into 'the elementary
psychical processes.' If we know the elements, and can compare them with the
resulting complexes, we may hope to understand the nature of integration, which
according to Wundt is the distinguishing character of consciousness. [.....]
His primary aim in all cases is to describe the phenomena of the mind as the
physiologist describes the phenomena of the living body, to write down what
is there, going on observably before him. (Titchener,
1921/2004 online; bold emphasis added.)
It
was this focus on what was there, rather than on what was going on, which
stopped Wundt being an Akt man. Titchener explicitly warns, however, that the distinction
is often only a fine one, and that in fact you can never act without
content. Nevertheless, the separation of act and content in the philosophy
of mind is conceptually close to the separation of data and function in
computing. Data (content) is what we store until process (act) comes along and
acts upon it in some way, and, like the chicken and the egg, it is hard to see
which is the more important.
10 - Data Models in Cognitive Science
"Knowledge representation
is one of the thorniest issues in cognitive science. If we are to have a theory
in which mental objects undergo transformations, we need to have some notation
to represent these objects. The difficulty is determining what it is about a
representation that amounts to a substantive theoretical claim, and what is
just notation." (Anderson, 1993, p17.)
So
if the logical-physical divide in computing is the same as the logical-physical
divide in psychology, then what of the flagship engineering techniques, can
they be used too? Specifically, is there a role for the data structure diagram
in helping to devise biological database schemas? The answer, as it turns out,
is not only that there is, but that considerable progress has already
been made with it, and by two of the most exciting branches of cognitive
science at that, namely "semantic networks" and "production
systems".
We
have little to say here about the semantic networks, having already written
about them at length in the following papers .....
|
See the discussion of semantic networks in the entry for "Deep Dyslexia". |
|
|
See the entry for "Semantic Memory" and follow the links. |
|
|
Contains a sustained discussion of semantic networks by mainstream connectionist authors. |
|
|
Contains a sustained discussion of semantic networks by mainstream cognitive modellers. |
|
|
Short-Term Memory Subtypes in Computing and Artificial Intelligence (Part 4) |
Section 4.2 is dedicated to the history of semantic networks up to 1958. |
|
Short-Term Memory Subtypes in Computing and Artificial Intelligence (Part 5) |
Sections 1.9 and 3.9 are dedicated to the history of semantic networks since 1959. |
As
for production systems, the roots of the approach go back to Newell (1973) and
Key Concept -
Production System: A set of computational principles and
an associated processing architecture, proposed by
Biographical Aside - John Robert
Anderson (1947-): [See fuller biography] John R. Anderson
[academic
homepage] is the R.K. Mellon Professor of Psychology
and Computer Science at
ASIDE:
In
practice, however, it took
"The basic structure of
problem solving programs (and their associated interpreter) is quite simple. In
its most fundamental form a production system consists of two interacting data
structures, connected through a simple processing cycle: (1) A working
memory consisting of a collection of symbolic data items called working
memory elements. (2) A production memory consisting of
condition-action rules called productions, whose conditions describe
configurations of elements that might appear in working memory and whose
actions specify modifications to the contents of working memory." (p3; italics original.)
Technical Aside - The ACT-R
Production System: ACT-R is a model of the "architecture" of
human cognition. It is heavily grounded on the distinction between "declarative
memory" [glossary] and "procedural
memory" [glossary], thus: "There are three
essential theoretical commitments one makes in the ACT-R knowledge
representation. One is that there are two long-term repositories of knowledge:
a declarative memory and a procedural memory. The second is that the chunk
is the basic unit of knowledge in declarative memory. The third is that the production
is the basic unit of knowledge in procedural memory." (Anderson, 1993, p17; italics original.) Looking back on 35 years on the
case, Anderson's main lament seems to be how few researchers have managed to
acquire the hands-on programming skills required to master the system
[unfortunately, as with the network databases, the systems have "a
forbidding aura of esoteric mystery and complexity" (Neches, Langley, and Klahr, 1987, p1), resulting in
their being somewhat daunting to the uninitiated]. For specific examples of
production rules, see
Our
own interest in production systems arises from the fact that ACT practitioners
routinely found themselves considering data relationships in their research, and
soon adopted a form of entity-relationship diagramming of their own. When
writing software to simulate the production of a sentence, for example, it
involved constantly dipping in and out of the mind's lexicon [glossary],
not just for the words in their root form, but for the rules by which they
could be linked to other words.
|
Figure 5 - A Simple Propositional Network: Here is a network showing the active utilisation of knowledge during the production of a sentence. The five circles represent individual propositions, the junction points represent people nodes [e.g. "Professor Jones", top left], object nodes [e.g. "Car", centre page, bottom], and relationship nodes [e.g. "Isa", centre page, bottom], and the arrows represent the role played by the nodes in the proposition(s) in question. The central proposition is that "X bought Y", and this is mapped by a trivalent proposition, X being the Agent, Y being the Object, and BOUGHT being the relation [see mauve highlight, lower left]. The other four propositions are those which might also need to be activated to take account of context, prior experience, and "implicature" [glossary]. The diagram as a whole represents the sentence-level structure which makes sense of the individual component propositions. More complex diagrams could, of course, be used to track the paragraph-level structuring of the many tens of simultaneous propositions which need to be properly sequenced during spoken or written discourse [glossary]. There are a number of nice specimen propositional networks available online - click here, or here, or here [see Section D], if interested. |
|
Copyright © 2004,
Derek J. Smith. Redrawn from a black-and-white original in |
What
Figure 5 shows us is how the static structures of semantic memory [the
"data", or "content"] can be kicked into action [the
"function", or "act"] in the expression of a complex
thought, and with this focus on the successive activation of the component
nodes - treating the act as just a touch more important than the content - we
may perhaps acclaim Anderson as the new Brentano. However, the process will
only proceed smoothly if the data has been laid out for ease of processing in
the first place, which is why we have ourselves long argued that cognitive
science needs the sort of data modelling skills
freely available within the computing industry [see, for example, Smith (1998b)].
ASIDE: Data modelling skills are not the only potentially valuable
import from computer science. Despite some major demonstrations of computerised cognition [see the work of Seidenberg and McClelland (1989), Norris (1991), and Hinton, Plaut, and Shallice (1993)], the penetration of
computing concepts and vocabulary into cognitive science has been decidedly
patchy. Interested readers can find a provisional list of "missing
concepts" in Section 1 of our e-paper on "Short-Term Memory Subtypes in
Computing and Artificial Intelligence" (Part 7). To give praise
where it is due, Baddeley
(2000) has recently done good work with the concept of "buffer",
Johnson-Laird (1987) has called for mechanisms of "deadlock"
prevention to be included in parallel processing models, and Chalmers (1995) has daringly couched the
entire consciousness debate in the language of Shannonian
information theory.
11 - Conclusion
"It is a sign of the
immature state of psychology that we can scarcely utter a single sentence about
mental phenomena which will not be disputed by many people" (Brentano,
1874/1995, p80).
To
summarise, here is our core argument, step by step
.....
1.
Between
1961 and 1964, the General Electric Corporation developed the IDS database
management system. This system was based on the principles (a) that individual
fragments of data could be stored and retrieved on a "direct access"
basis, but only when (b) their "data structure" had been fully
established by painstaking analysis beforehand. The IDS developers documented
such data structures on "Bachman Diagrams", and the product
subsequently made its way to market under a number of proprietary badges and
still powers much of the heavy end of the world's on-line transaction
processing industry.
2.
During
the 1970s, the computing industry responded to a
surge of systems debacles by gradually introducing stricter controls over the
systems development process. This culminated in the emergence of a number of
commercially competing "structured development methodologies", and
one of the principles of such methodologies was/is that process and data are
fundamentally different things and should be analysed
separately. The data part of the equation needs to be analysed
at a logical level prior to any consideration being given to the physical
implementation of the system. The results of this analysis are then set down
formally as the "data model" for said system. Data modellers regularly use the Bachman Diagram (or variants
thereof) to give a visual summary of their conclusions.
3.
Data
models purport to set down all you will ever need to know about the data in
your world - how its elements must necessarily be clustered together and
interrelated in order to become meaningful, and how you are then likely to have
to store and/or retrieve them. But it does so in the abstract, and without
reference to the hardware you end up using. It follows (a) that data models
of this sort could have been drawn up BEFORE THE COMPUTER HAD BEEN INVENTED and
would have looked just the same, and (b) that data modelling
is as much a branch of associationist philosophy as
it is an IT skill. It also follows that there is an element of "optionality" about the final choice of physical
system, at both the software and hardware architecture level.
4.
At
the insistence of the structured development methodologies, the process of
"implementing" a logical design - turning it into a physical system -
takes place in two stages. Firstly, a "first-cut design" is hammered
out, and then the detailed design work is carried out. The resolution of optionality occurs at the first-cut stage, when the number
of candidate physical information processing systems is gradually whittled down
to just one. A first-cut design thus establishes the "computational
principles" of a system in the sense that Marr (1982) used that term.
5.
The
brain is a physical information processing system. Carefully housed in human
beings known as "clerks", it has been the implementation of necessity
for business systems for all but the last 130 years of civilisation,
being then progressively displaced by such inventions as the cash register
(1878), the electromechanical calculator (1885 to 1886), the punched card (1884
to 1890), the analog computer (1915 to 1931), and the digital computer (1931 to
1945). The thrust of artificial intelligence research since 1945 has been to
simulate more and more higher cognitive processes.
11 - References
See the
Master References List
[Home]
ANSWERS TO EXERCISES
Exercise 1.3 (a): You cannot do it with the
existing system. The access key is <STUDENT-NAME>, and the worst thing
you can do with a direct access system is to forget those keys. Perhaps there
is some sort of desktop work log which you could consult. Or you could ask a
colleague. Or even go and stand by the cabinets and hope that some contextual
cue will spur your recollection. A typical database designer's solution
would be to add a <RECENT-UPDATES> audit trail set with PRIOR pointers,
so that you could browse backwards through all the recent changes. Job done!
Exercise 1.3 (b): You cannot do it with the
existing system, because there are no "superchief"
records. Perhaps there is some sort of contents sheet - a sort of "Guide
to our Archive" - to consult, otherwise you are going to have to go
through the entire category selecting the ones you want by eye. A typical
database designer's solution would be to add an <YEAR-OF-GRADUATION>
entity class, with occurrences for 2001, 2002, 2003, etc., each owning its
particular year's records.
Exercise 1.3 (c): Comments as for (b).