data-modelling

Course Handout - An Introduction to Data Modelling for Semantic Network Designers

Copyright Notice: This material was written and published in Wales by Derek J. Smith (Chartered Engineer). It forms part of a multifile e-learning resource, and subject only to acknowledging Derek J. Smith's rights under international copyright law to be identified as author may be freely downloaded and printed off in single complete copies solely for the purposes of private study and/or review. Commercial exploitation rights are reserved. The remote hyperlinks have been selected for the academic appropriacy of their contents; they were free of offensive and litigious content when selected, and will be periodically checked to have remained so. Copyright © 2004-2018, Derek J. Smith.

First published online 08:30 BST 1st October 2004, Copyright Derek J. Smith (Chartered Engineer). This version [2.0 - copyright] 09:00 BST 5th July 2018.

Although this paper is reasonably self-contained, it is primarily designed to be read as a subordinate file to our e-paper on "Short-Term Memory Subtypes in Computing and Artificial Intelligence" (especially, Parts 5 to 7 thereof).

1 - Introduction

"Only about 16% of IT projects can be considered truly successful" (Computer Weekly, 27th April 2004).

The UK currently [August 2004] spends some £22.6 billion a year on computer systems, a high proportion of which [roughly five sixths of it, if we are to believe our header text] is more or less wasted. The situation is so bad, indeed, that it has become something of an annual ritual amongst software industry pundits to produce the latest failure statistics and horror stories. One early researcher into the causes of computerisation failure was the University of Missouri's Donald A.B. Lindberg (1933-). He drew on the experiences of the US healthcare industry with computerisation projects, and placed vagueness and (on occasions) deliberate misinformation high on the list of root causes of failure, thus .....

"In no case can one yet say that medical care of ill patients actually depends upon a computer or information system. Why is this? [Firstly,] medical people have been extremely slow to spell out in a cohesive and organised form the conditions under which they wish to work with an information system. [And secondly,] the flagrant and consistent 'over sell' of capability on the part of manufacturers and computer enthusiasts." (Lindberg, 1967.)

Lindberg was closely followed by G. Octo Barnett, an American cardiologist, who had been involved during the late 1960s in trials of an early time-sharing system at Massachussetts General Hospital. Barnett had seen at first hand how easy it was for ostensibly good ideas to degenerate into expensive systems debacles, and when he investigated the sort of things which had been going wrong, he identified a number of recurring factors. He duly summarised these as the "Ten Commandments" of successful medical informatics (Barnett, 1970), and they covered such areas of engineering best practice as obtaining an agreed statement of requirements and explicitly designing in the appropriate levels of robustness and reliability. More than three decades later, Lindberg's and Barnett's advice still defies our corporate ability to deliver on it, and Lorenzo and Riley (2000/2004 online) were not ashamed to start their analysis of the psychology of systems failures by reminding us of Barnett's Ten Commandments.

So why are we so bad at following advice. Well one of the secrets of getting a computer system to work is to understand that the management failures are only ever the secondary cause of a systems disaster, for even the weakest manager can look good if nothing ever goes wrong. What really does the damage is an underlying technical problem combined with slack management, for then all hell breaks loose. Which brings us to the point of this section, which is that modern systems are by and large only as successful as their "data model" .....

Key Concept - The Data Model: We shall be looking at the concept of data models in considerable detail over the coming pages, so here is a temporary definition to be getting on with: A data model is a formally commissioned compilation of empirical research findings concerning the data known to a system, such that it becomes the sum total of an organisation's understanding of the informational content of its world (or at least of a particular precisely delineated part of that world). It is "a way of representing a business enterprise (or other similar organisation such as a government agency) on paper, much the same as a street map is a way or representing a city on paper" (Relational Systems Corporation, 1989). And if your metaphorical map is substandard, of course, then you will get lost. In short, it is a system's investment in a high-specification data model which most prevents the sort of technical problems referred to above.

A data model is thus an excellent example of what modern commentators like to call "metadata" .....

Key Concept - Metadata: Metadata is data about data. It is facts about the facts themselves, as when we support a simple proposition with a number of ancillary propositions [Example: To say that oxygen is the eighth element by atomic number is to state a bald chemical fact, but to add that it was Joseph Priestley who discovered that fact is to have supporting knowledge]. The role of metadata in data modelling is vital and includes such ancillary information as field sizes, character set, hierarchical structure, usage, and retention period. For more on the systems analyst's view of data see Section 1.4 of our e-tutorial on "IT Basic Concepts", and for more on the programming aspects see Section 2.3 of our e-paper on "Short-Term Memory Subtypes in Computing and Artificial Intelligence" (Part 6).

This paper traces the evolution of the data model over the last half century, and then considers the cross-disciplinary relevance of the underlying concepts to two particular areas of cognitive scientific research, namely the philosophy of mind and the use of semantic networks within artificial intelligence. We begin with some historical perspective .....

2 - Historical Background

"The chain [Marley] drew was clasped about his middle. It was long, and wound about him like a tail; and it was made (for Scrooge observed it closely) of cash-boxes, keys, padlocks, ledgers, deeds, and heavy purses wrought in steel." (Charles Dickens: "A Christmas Carol"; bold emphasis added. Note the nature of commercial "data processing" half a century before Herman Hollerith's 1890 punched card system ushered in the modern age.)

During the Second World War, computer programmers tended to be university mathematicians attached to the military. They worked typically on top-secret projects like automated gun-laying, the atom bomb, or code-breaking [illustrative history], and they were their own harshest critics because they knew exactly what they were trying to achieve (in many instances better than anyone else on the planet at the time). The unit of software development was therefore the individual computer program, and the software development process relied on a technique known as "functional decomposition" [tutorial], in which the overall "functionality" of the system (as defined by those who were paying for it) was progressively broken down into chunks of logic precise enough to be coded. Only at the last moment was either the location or the nature of the associated data ever taken into account.

As the years went by, however, increasing hardware capacities allowed systems to become more complex, and it eventually became clear that there were flaws inherent in the function-first (or "function-driven") approach. On any or all of three counts, systems whose component programs had been constructed in this way were nightmares to keep integrated. Firstly, the data output from program A was never what was needed by programs B, C, and D, either because program A was breaking new ground, or because their respective boffins had failed to keep each other fully informed. Secondly, when changes were required to one program it would bring down perfectly good programs elsewhere, requiring compensatory changes to them as well; which then brought down yet other programs in turn, and so on. And thirdly, data duplications across the various programs were allowed to become the rule, meaning that when you amended a data definition at one location you had also to update all its duplications (providing you could remember where they all were, and providing you could cope with the practical difficulties of amending card- and tape-based files).

To start with, bugs and irritations of this sort were dealt with reactively, that is to say, as and when they arose; programmers spent all their time correcting their programs anyway, so an additional stream of data definition problems was just another occupational hazard. There was no proactive error prevention, in short, because there were simply not enough systems around to justify the effort. Then things started to change. By the mid-1950s, computers like the IBM650, the UNIVACs, and the LEOs were starting to make major inroads into the commercial data processing marketplace [illustrative history], whereupon the demand for programming skills suddenly exceeded the available supply of boffins. New programmers were therefore recruited from amongst the technically minded and trained to order. Unfortunately, being now just ordinary folks, they were no longer able to work out for themselves what they needed to do - you could make them good at the mechanics of programming, sure, but they nevertheless lacked the boffins' instinctive feel for the systems end of things. So out went programming as the largely self-specifying side of applied electronics, and in came the skills of "systems analysis" to help draw up the specifications for the programmers to work to .....

Key Concept - Systems Analysis: Systems analysis is the search for abstract principles in real-world systems, especially information processing systems [see alternative Wikipedia Definition]. It is "operational research" [glossary] taken from the factory floor and applied instead to the world of information flow, and to do it well calls for many of the skills and mind sets of "systems thinking", as detailed in our e-paper on "Systems Thinking: The Knowledge Structures and the Cognitive Processes". Systems analysis also requires a portfolio of information reduction tools, many of which represent their conclusions in diagrams and tables [e-tutorial; further detail]. For a detailed history of the "systems men", see Haigh (2001).

The shortage of programmers also meant that errors now needed to be more diligently avoided, and it was soon realised how many of them had been self-inflicted all along and would disappear if just a little more attention was paid to the underlying data structure at the beginning of the development process. Data "definitions" accordingly became part of standard program documentation, central registries started to be set up to co-ordinate the hundreds, if not thousands, of data fields found in modern organisations, and constructing one's "data divisions" became a major aspect of program writing.

ASIDE: To see what the Data Division looks like in the COBOL programming language, see Section 2.3 of our e-paper on "Short-Term Memory Subtypes in Computing and Artificial Intelligence" (Part 6).

At the same time, it became important for senior designers to ensure that data files were set up with as little duplication as possible, and then used in the most efficient sequence as possible; and the best way to achieve this was to set up a carefully integrated central copy of the data. This data "pool" or "bank" or "base" could then be administered by a dedicated team of suitably experienced professionals, and made available on a "shared access" basis to whoever had a valid need. Individual applications programs dealt only with the particular subset of the central pool that they happened to be concerned with, whilst locks and privacy mechanisms protected the integrity and availability of the rest like gold dust. We mention all this because these are the principles of what we know today as the "database", and the teams who administer the process are known as "database administrators" (DBAs).

ASIDE: To a not-inconsiderable extent, the modern world is its databases, for without them the banking, point of sale, e-commerce, logistics, and knowledge industries would be put back half a century at a stroke. Nevertheless, the battle against the duplication and sloppy definition of data has only partly been won, for although large systems are now meticulously analysed as a matter of routine [albeit this does not stop the five-out-of-six failure rate mentioned in Section 1], the proliferation of desktop and mobile systems in the modern corporate world has destroyed any pretence of data standardisation (at least we know of no organisation on earth who has this under control).

As things worked out, systems analysis skills were not the only factor missing from the systems development equation. IT projects now involved so many people that far greater managerial discipline was needed as well, and so "IT project management" was born .....

ASIDE: Amongst the first civilian IT project managers were the Lyons Company's John Simmons and David Caminer [full story]. For an introduction to the skills and qualities demanded of the role see our e-tutorial on "IT Project Management", and to see what happens when the managers get it wrong [which is what really accounts for the five-out-of-six failure rate mentioned in Section 1], see our "IT Project Mismanagement Disasters" database.

What the early project managers did was to chart the activities and deliverables involved in putting a system together, and then simply to arrange for everything to be done in the least wasteful sequence, using the "critical path analysis" techniques being developed around that time in the aerospace industry [full story]. This established an infrastructure of standards to observe and controls to administer, into which were firmly embedded the all-important technical analyses of data and function. Systems development now had an "industry best practice" of its own, in which everyone knew what needed to be done and in what order, and this, by the 1980s, gave us what came to be known as the "structured development methodology" .....

Key Concept - The Structured Development Methodologies: Here is how we defined this topic in our e-tutorial on "IT Project Management": "A structured methodology is a staged development, with the final stage producing the finished product, and each preceding stage producing a logical subset of prior components, in much the same way that a car is put together on a factory assembly line. It is the staged machining of components. It is the engineering of systems. It is a philosophy of system development. Key features are that it is simultaneously user-oriented (being driven from the outset by carefully derived statements of user requirement) as well as product-oriented (in that it then carefully specifies what is to be delivered, and when and how). While there may well be minor differences across the market range of structured development methodologies, their common denominator has always been their unrelenting emphasis on developmental sequence." The production stages which were standardised upon were Feasibility Assessment, Project Set-Up (or "Initiation"), Requirements Specification, Logical Design, Physical Design, Development, Delivery, and Operation.

In Britain, the most widely known of the structured methodologies came from the Central Communications and Telecommunications Agency (now part of the Office of Government Commerce) in the shape of SSADM [short for "Structured Systems Analysis and Design Method"] and its associated project management package called PRINCE [short for "Projects in Controlled Environments"].

ASIDE: SSADM was a government-sponsored development of LSDM [short for Learmonth Structured Development Method], itself arguably the first structured development methodology. LSDM was put together by Learmonth and Burchett Management Systems [LBMS - now part of Computer Associates International, Inc.] in the late 1970s. SSADM first appeared in January 1981 after a formal evaluation of no less than 47 competing products, and has been the gold standard for British civil service and commercial applications ever since. For further details, some help is available online [click here], and the booklet SSADM Version 4 is a good introduction (CCTA, 1991). The current version of the control methodology is PRINCE 2. To read a brief note from the methodology's sponsor, click here, for more of the detail, click here, and for routes to a formal qualification in the method, click here.

So there we have it - systems with inferior data models will always fail, no matter how clever their programmers, because sooner or later the accumulation of individually small problems will exceed the team's ability to cope, whilst systems with superior data models, although not totally immune to disaster, are at least immune to the greatest single error stream, namely that arising from confusion as to what the subject matter of one's computation actually is.

3 - The Logical-Physical Divide

"Once management realises the relationship of reliable data to corporate well-being, they will treat their data with the same care used to handle their cash" (Cahill, 1970, p23; cited in Haigh, 2004/2004 online).

Structured development, then, is essentially staged development, with management break-points separating requirements specification and logical design, logical design and physical design, physical design and development, and development and delivery. All these breakpoints are important, but by far the most fundamental is the one separating the "logical" and the "physical" stages of design, because this is the one which allows analysts not to have to worry about the physical design decisions which will eventually be based on their findings. It allows them time to get their data structures right in the abstract, and both allows and requires the wise project management team to keep the entire project firmly "on hold" until that abstract understanding is complete. The ensuing challenge lies in then "implementing" the logical design, that is to say, in turning the logical design documentation into a particular physical system mounted on a particular physical platform in a particular physical way.

Key Concept - "Logical Design" vs "Physical Design": The logical design stage of a computerisation project allows the data and function within the systems area in question to be thoroughly researched and analysed. The practical value of the resulting reference documentation is that it allows better physical design decisions to be made when the time comes, and thus more successful systems to be built. The process of giving physical dimensions to a previously logical design is known as "implementing" that logical design, and we shall be saying a lot more about what is involved in Section 8.

And this is where the data model comes in. Remembering what we said earlier about data being more fundamental than process, the most important technique at the logical design stage turns out to be the creation of a record of the real world objects the organisation is concerned with, how they behave, and what their properties are. Here are some definitions from the literature .....

"Data models are techniques for representing information, and are at the same time sufficiently structured and simplistic as to fit well into computer technology" (Kent, 1978, p93).

"[Data modelling is] the process of structuring real world facts into real world concepts [and of] deriving a conceptual model as a result" (Oxborrow, 1989, pp15/19).

"The primary purpose of any data model [.....] is of course to provide a formal means of representing information and a formal means of manipulating such a representation. For a particular model to be useful for a particular application, there must exist some simple correspondence between the components of that model and the elements of that application; that is, the process of mapping elements of the application into constructs of the model must be reasonably straightforward." (Date, 1983, pp182-183.)

Figure 1 shows us how the data model fits in to the broader process of systems development .....

Figure 1 - The Sequence of Events during Structured Development: With the arrival of the structured development methodologies in the early 1980s, it became standard practice to partition systems development both "vertically" (with the early developmental stages routinely diagrammed to the left of the later ones) and "horizontally" (in terms of whether you were focussing on the supporting data or the supported function). Here are the four Johari quadrants [bottom right] which emerge when the two stages of development (i.e. the logical and the physical) intersect with the two fundamental design aspects (i.e. the data and the function). The logical view of a system's function is set out in the "functional decomposition" [upper left quadrant], whilst the logical view of its data is set out in the DATA MODEL [lower left quadrant]. These map in due course onto the program structures [upper right quadrant], and the program data definitions [lower right quadrant], respectively. We have set the data model quadrant into RED-BOLD, because it is the key to doing the whole job properly.
	Developmental Phase
	Logical Design	Physical Design
Function	FUNCTIONAL DECOMPOSITION	WE'LL TALK ABOUT WHAT GOES ON IN THIS TRANSITIONAL BOX IN SECTION 8	PROGRAM STRUCTURES
Data	DATA MODEL		DATA DEFINITIONS

Figure 1 is important because it gives us a context against which we can state the purpose of the present paper quite precisely. For the reasons set out in the caption above, we shall be concerning ourselves only with the lower left quadrant of Figure 1 (the data-logical quadrant), and looking in greater detail at how to construct one of these data models. We shall then reflect upon the skills needed during the construction process, and consider what cross-disciplinary relevance those skills might have for psychology.

4 - The Bachman Diagram and the Database Management System

"The Data Base Management System (DBMS) is the foundation of almost every modern business information system. Virtually every administrative process in business, science, or government relies on a data base. The rise of the Internet has only accelerated this trend - today a flurry of database transactions powers each content update of a major website, literature search, or internet shopping trip. Yet very little research addresses the history of this vital technology, or that of the ideas behind it. We know little about its technical evolution, and still less about its usage." (Haigh, 2004/2004 online; bold emphasis added.)

This section shares some content with Section 3.1 of our e-paper on "Short-Term Memory Subtypes in Computing and Artificial Intelligence" (Part 5).

To summarise our argument so far, we have tried to view the process of data modelling against the broader context of systems development, so that those outside the systems industry can see how and why so much time and money needs to be spent on this particular activity. We now look in more detail at what this obsession with data actually entails. How, for example, did the industry progress from a simple sheaf of data definition slips to the database as we know it today, and when did the data-first (or "data driven") approach first start to take hold? The answer, in most tellings of the story, takes us back to the early 1960s, and to the General Electric Corporation's computing laboratories in New York, where one of GE's recent recruits, Charles W. Bachman, was more or less single-handedly inventing data modelling as an adjunct to developing GE's in-house "Integrated Data Store" (IDS) database management system .....

Biographical Aside - Charles W. Bachman (1924-): [See fuller biography] Famous in several areas, we mention one time "triple-A" technician Charles Bachman here for having devised a version of the data-structure diagram known as the "Bachman Diagram" [more on which below], and for showing how such diagrams could make for highly effective use of "direct access" data storage devices. Bachman was awarded the 1973 Turing Award by the Association for Computing Machinery for this achievement. <<AUTHOR'S NOTE: Never underestimate the importance of gunnery control computation in the history of the world. As we explained in detail in Section 2 of our e-paper on "Short-Term Memory Subtypes in Computing and Artificial Intelligence" (Part 2), the demand for effective anti-aircraft predictor technology was a major driving force for computer development during the 1940s. Bachman was with the US Army Anti-Aircraft Artillery Corp from 1943 to 1946, and gained early computing experience with the predictor systems on the 90mm anti-aircraft gun [picture and specifications].>>

Historical Aside - The "Database Management System": A Database Management System, or "DBMS", is a complex software product designed to manage large pooled stocks of data for you, and especially to allow that data to be accessed by lesser software products called "application programs" [tutorial]. Databases are thus the computer equivalent of the old-fashioned file index systems, but with the advantage of very rapid search times. (Haigh, 2004/2004 online) argues that we should view the DBMS as a coming together of three originally separate earlier trends, namely (a) the idea of a common pool of data, (b) the development of "file management" software, and (c) the growing sophistication of "report generator" software. Bachman himself claims that GE's 1957 "Report Generator System" "was the first production data base management system" (Bachman, 1980, p7), and was himself responsible for building a similar product at the Dow Chemical Company in 1958.

What made the IDS system tick was a clever combination of two ideas. On the one hand there was the then-brand-new "direct access" facility provided by disk storage devices [which we have described in detail in Section 1.2 of our e-paper on "Short-Term Memory Subtypes in Computing and Artificial Intelligence" (Part 4)], and on the other hand there was what Bachman called the "data structure set", a method of preparing your data so that it could make the best possible use of that access technology. The essence of the data structure set was that each record could be "logically associated, as a trailer, with multiple header records" (Bachman, 1980, p7), that is to say, on a set owner/set member basis. And when you brought these two ideas together, the result was sheer engineering elegance - you could store the owner record using the direct access technology and then pick up its related members using externally invisible addressing. Above all, you could toss a particular record in amongst a million similar ones, and still go straight to it when you needed it again! The following worked example will illustrate the power of this new method of access .....

Exercise 1 - An Everyday Example of Direct Access Storage and Retrieval in a Set-Structured "Database"

1 In the psychology office, a few feet from where we are typing this in, there is a row of six conventional four-drawer filing cabinets. Each of the 24 drawers contains alphabetically sequenced hammock-slung student record files, perhaps four files per hammock, and perhaps 60 hammocks per drawer. Five separate alphabetically sorted runs of files are maintained, as follows: (1) all extant first year students, (2) all extant second year students, (3) all extant third year students, (4) all successful graduates, and (5) all failures and withdrawals.

2 This sort of paper-based filing system is notoriously bad at coping with certain types of update and retrieval transaction. For example, it has no overall list of the members of any one category of student. That sort of list must be carefully maintained elsewhere, and the two systems regularly cross-checked. The risk is, of course, that any given file might be "out" or misfiled at any given time. And when a file cannot be found, the only real option is to go looking for it, scanning through the files themselves one by one in what is termed a "serial", or "exhaustive", search. It follows that while alphabetic misfiles are bad enough (since they invite a serial search of at least a hammock or two), category misfiles are even worse, since they invite a serial search of the entire six cabinets!

3 Devise an indexing system capable of locating (a) the student you wrote to three hours ago (into whose file you suspect you misfiled your pay cheque, but whose name you have momentarily forgotten), (b) all 2001 graduates, (c) the mathematics grades of all withdrawals within their first year of study. [ANSWERS AT END.]

4 Have a look at Morton, Hammersley, and Bekerian (1985), noting especially their notion of "headings" as the access keys for biological memories.

The first lesson of data modelling is therefore that the long-term success of a system is proportional to the amount of careful thought put in during early development. Specifically, you need to keep careful track of which records are set owners and which are set members (and also, incidentally, to decide how you are going to handle any record types which are both owners and members [like the hospital ward mentioned in Section 7]). Bachman's solution came in two parts - what to do, and how to document it. The first part of the solution was to identify all the owner-member relationships. This involved (1) identifying the attributes that mattered, (2) deciding how these clustered together into entities, and (3) considering how these entities might be related. The second part of the solution was to display this priceless metadata graphically, combining all the individual data structure sets into a single larger diagram known as a "data structure diagram" (soon to became famous as the "Bachman Diagram"). The real beauty of the data structure diagram is that being a diagram it has all the traditional advantages of pictorial matter for the rapid communication of ideas - once you have grasped the visual "syntax", each picture (without the slightest exaggeration) speaks a thousand words.

Key Concept - The Bachman Diagram: This was Bachman's personal notation for data structure diagrams. It shows the record types needing to be stored in the proposed system, together with their storage arguments and their owner-member set relationships. Bachman Diagrams treat attributes as the properties of things; as atomic items of data, each of which is capable of being named, but of not being further divided. Entities are treated as the things which matter to, and therefore need to be identified by, the system. Each entity is thus a collection of attributes, and relationships are the reasons entities may be associated. Relationships are assertions of truth about the subject area [we shall therefore meet them again in Section 10 when discussing "propositional networks"], and take the form "a man can own many dogs". Both the subject and object of this truth are themselves entities, and there is usually a one-to-many relationship between them. The notational conventions are few: attributes are usually relegated to the supporting documentation for clarity's sake, classes of entities are represented by suitably captioned boxes, relationships by lines drawn between the entity boxes concerned, and the pluralities by adding arrowheads or so-called "crows' feet" symbols at the "many" end of these lines. The rule is that "the arrow points from the entity class that owns the sets to the entity class that makes up the membership of the sets" (Bachman, 1969, p5).

Here is a specimen Bachman Diagram .....

Figure 2 - A Bachman Data Structure Diagram: Here is a small but nonetheless illustrative example of the Bachman Diagram expression of a typical data model. At a structural level it shows four entity types [the boxes] and three relationships [the arrows]. More specifically, it shows the books in a library, and the mechanism of their reservation by library users. Note how the natural pluralities of the real world are represented primarily by one-to-many relationships. For example, a library may hold many copies of a single catalogued title, as shown by the <library-has> relationship towards the left of the diagram. Similarly, there may be a queue of many reservations for each title, each one of which has associated with it the name and address of the corresponding library-user. There are a few specimen Bachman Diagrams available online - see Section 4 of Hitchman (2004/2004 online), Section 15.6 of Yourdon (2001/2004 online), or Maurer and Scherbakov (2004 online).

If this diagram fails to load automatically, it may be accessed separately at

http://www.smithsrisca.co.uk/DMODELS-fig2.gif

Bachman and his team had a prototype version of IDS up and running in "early 1963" (Olle, 1978), and had it reliable enough for full operational use monitoring GE's own stock levels in 1964 (Bachman, 1980). Bachman then spent the mid-1960s at GE in Phoenix, AZ, upgrading the system for bigger and ever faster machines, and so enthusiastic was the initial user feedback that the Bachman-GE approach soon came to the attention of the computing industry's de facto steering committee, the Conference on Data Systems Languages (CODASYL) .....

Historical Aside - "CODASYL" and the "Data Base Task Group": The Conference on Data Systems Languages was set up by the US Department of Defence in May 1959 at the suggestion of Charles A. Phillips, the Pentagon's Director of Data Systems Research. Their remit was to produce a general purpose computer programming language for business users, and they organised themselves into three more precisely tasked sub-committees, namely (1) the Short Range Committee (SRC), responsible for the immediate specification of the language, (2) the Intermediate Range Committee, responsible for its development in the medium term, and (3) the Long Range Committee, responsible for its development in the longer term. In the event, only the SRC actually sat, and their principal success was the COBOL programming language, whose specifications were approved in January 1960. The new language was not perfect, of course, and various teething troubles were reported. It was bad, for example, at processing chain pointer sets (or "lists") of records, and from October 1965 the CODASYL sub-committee structure was extended by the addition of the List Processing Task Force (LPTF) to look into improvements in this area of functionality. By 1967, however, the LPTF meetings were so dominated by database issues that the committee's name was changed to the Data Base Task Group (DBTG). William Olle of RCA was at the first DBTG meeting in Minneapolis in May 1967 and served to January 1968, an experience he described recently as "enriching and instructive" (Olle, 2004; personal communication).

Key Concept - Chain Pointers: A systems programming device used in the IDS and subsequent DBMSs for implementing an owner-set relationship. Involves adding space for one or two address fields to the basic record length, such that each record in a set can "point" to the one after and/or before it in the set. These are known as NEXT and PRIOR pointers, respectively [there is an explanatory graphic in Schubert (1972; Figure 1), if needed]. Here is an extract from Olle (1978) which illustrates the relationship between COBOL and the set pointer: "The first report from the DBTG came out in January 1968 and was entitled 'COBOL extensions to handle data bases'. Some quotations from the one page summary of recommendations show the thinking of the era. It was recommended to 'Add a facility for declaring master and detail record relationships which use circular chains as a means to provide the widest possible file structuring capability." (p4.)

Then came a curious turn of events which saw the development rights to IDS being taken over by the B.F. Goodrich Chemical Corporation of Cleveland, OH, (henceforth simply "Goodrich"). The motivation for this reversal of roles seems to be (a) that IDS had bugs in the software which GE had no time to cure, and (b) that Goodrich preferred to remain IBM users. Goodrich therefore undertook a repair-and-migrate exercise [see, for example, Karasz (1998/2004 online)], and (unfortunately for GE) did such a good job that they were able in 1969 to field their own system under the name "Integrated Database Management System" (IDMS). The development work was carried out at Goodrich's Cleveland office, and the project manager was Richard F. Schubert .....

Biological Aside - Richard F. Schubert (): This from the biographical note at the end of Schubert (1972): "Mr. Schubert is manager-information systems programming and operations for B.F. Goodrich Chemical Co., Cleveland. He served on the CODASYL Systems Committee from 1963 until this year and has been a member of the CODASYL Data Base Task Group since 1970. His B.S. in chemical engineering is from Cleveland State Univ." (p47.)

CODASYL, meanwhile, had not been dragging its feet. Between 1969 and 1971, it compiled two major statements of database principles (CODASYL, 1969, 1971; subsequently incorporated into ANSI/SPARC, 1976), inspired by the single central axiom that the internal complexities of a database should at all times remain totally "transparent" to the end-user. A DBMS, in other words, should allow users to concentrate upon their data rather than upon the tool they happened to be using to view it. This transparency was eventually obtained by implementing the data model in three time-separated sub-stages, each separately programmed, and each passing critical output to the one following. These three stages were as follows .....

(1) Set Up a "Database Schema": The first step is to convert the data model into a physically equivalent set of declarations and descriptions known collectively as a "database schema". Unlike the data model, however, the database schema is now in a form which can be stored within, and manipulated by, the DBMS. This is a more technical view of the data than hitherto, and constitutes the first major step in bridging the gap between the data as the user knows it and the hardware on which it is eventually to be stored.

(2) Set Up Database "Subschemas": The second step is to create a "departmental" view of the data. This is another technical view, and reflects the fact that no single application program will ever need access to all the available data. This, of course, is where the sharing of the common pool of data is enabled. Each individual end-user - and that includes even the most senior executives - only needs access to a fraction of the total available data, and for him/her to be shown too much is at best inefficient, and at worst a breach of system security punishable by civil or criminal law (or both). This "need to know" facility is provided by subsets of the schema known as "subschemas", each one allowing an individual application program to access only the data it is legitimately concerned with.

(3) Set Up Database "Storage Schemas": The third and final step is to create a "machine level" view of the data. This is achieved by declaring what is known as a "storage schema" to the DBMS, which the DBMS then uses to translate every user-initiated store and retrieve instruction into a set of equivalent physical store and retrieve instructions.

Schubert's principal developers were Vaughn Austin, Ken Cunningham, Jim Gilliam, Peter Karasz, and Ron Phillips. From the outset, the product was highly compatible with DBTG recommendations (not surprisingly, Karasz explains, when you consider that Schubert was a member of CODASYL and had an advance copy of the DBTG's April 1971 database specifications). The working principles of the Goodrich product were as follows .....

Technical Aside - IDMS Internals: IDMS databases are organised into "realms" of sequentially numbered "pages" of known and identical capacity (in bytes). Each page contains a small header index, followed by up to 256 separate records, or "lines", each with its own unique page-and-line "database key". These records are typically organised into sets by one or more "chain pointers" [actually more database keys] concatenated into the total record length. Each page can be transferred on a random access basis as a data block by a single disk read or write. Database realms are typically very large, having many pages, large pages, or both, and so the art of traversing them efficiently is (a) to establish what is known as an "entry point", that is to say, a suitable starting page, and (b) to access no more pages than is absolutely necessary. Here, enhanced from Bachman (1973), are the six most commonly used traversal options: (1) A search can be started at the beginning of a database realm, and then proceed line by line within page by page until there are no more records to examine. This will retrieve records in strict database key sequence with no reference to record type. It therefore retrieves all possible records, and the order in which it retrieves them will to all intents and purposes be random. (2) A search can be conducted by the aforementioned database key, the known permanent address of the record in question. This will retrieve the record at the specified line and page, again without regard to its record type. This option can be used in conjunction with option (1) to begin a realm sweep from part way through [this facility might be useful, for example, if restarting a full sweep after an interruption]. (3) It is also possible to retrieve the record at a specified line and page by using the "database currency" mechanisms provided [details]. The currency tables maintained by the DBMS allow the last accessed record of a particular type or in a particular set to be re-accessed by its database key without the programmer needing to know that database key explicitly [that is to say, the currency tables are a systems programming facility, and not an applications programming one]. (4) A search can be conducted by key field "hashing algorithm". This is the "direct access" method previously referred to, only it is known within the IDMS world as "CALC access", because of the calculations carried out by the hashing algorithm. The algorithm takes the contents of the specified key field, performs a mathematical conversion of the component characters, and comes up with a number between 1 and the number of pages in the realm. Since exactly the same algorithm had been used when the record was originally stored, we now know exactly where to go to get it back [Karasz credits Vaughn Austin with having perfected the IDMS hashing algorithm]. The DBMS software can then retrieve the specified page, and rapidly scan down it for the matching line. (5) Using one of the earlier methods as initial entry point, a search can then proceed via a pre-established set relationship. [Kent (1978) assesses IDMS's central innovation as "the named relationship" (p143), because that is what the IDMS set really is, and it derives, of course, from Bachman's original IDS architecture.] A set search may go in either direction around the set according to which NEXT or PRIOR pointers have been designed in. (6) Using one of the earlier methods as initial entry point, a search can also proceed using NEXT, PRIOR, or OWNER pointers to the target OWNER record. OWNER pointers are a system and storage overhead, and so would usually only be provided for very long sets distributed across many pages, when there is accordingly a response time payoff to be had [OWNER pointers take you straight to the OWNER record, rather than leaving you to get there by walking the set]. For a worked example traversal, see our e-paper on "Database Navigation and the IDMS Semantic Net", and for an alternative introduction to CODASYL internals, see Maurer and Scherbakov (2004 online).

The first five customers for IDMS were ACME Cleveland, Abbott Labs, General Motors, RCA, and Sperry Rand (Karasz, 1998/2004 online), but in 1973, in order to concentrate on their core business, Goodrich sold the IDMS rights to John Cullinane's Cullinane Corporation, later Cullinet Software Inc., and now part of Computer Associates. The product survives there to this day as the CA-IDMS proprietary DBMS, and continues to support many of the world's heaviest duty "on line transaction processing" (OLTP) systems [lots of history]. IDS, meanwhile, went to Honeywell in a buy-out of GE's computing division, and was then enhanced in 1974 as IDS-II.

Historical Aside - So is it "Database" or "Data Base": So which is it, one word or two? Well it certainly began life as two words, appearing as such in Head (1970), Olle (1972), Schubert (1972), and in the name of the DBTG itself, but nowadays it is certainly one word for most commentators [at time of writing, there were nearly 70 million Google hits for the one-word option but less than 4 million for the two-word option]. But basically it remains a matter of taste, and Haigh still prefers the two-word option. Rather inconsistently, therefore, current practice is to use the word "database", but to retain the two-letter abbreviation "DB" (as in DBMS). As for the term "Data Base Management System", the initials DBMS had an instant popular appeal, and usage of this acronym spread rapidly after the 1971 DBTG report (Haigh, 2004/2004 online). Nevertheless, some caution is needed, because there were many unscrupulous sales teams ready to jump on the bandwagon, thus: "The term [was] applied retroactively to some existing systems, and used to describe virtually every new file management system, regardless of its fidelity to the specific ideas of the DBTG" (Haigh, 2004/2004 online).

5 - The Network-Relational Schism

But clouds were looming on the network database horizon in the shape of a "flat file" implementation developed in 1969 by IBM's Edgar F. ("Ted") Codd (1923-2003).

Key Concept - The "Flat File" or "Table": A "flat file" is a computer file composed of relatively large, identically formatted, data records, which, properly indexed, is ideal for random access retrieval of uniquely keyed individual records. At heart, it is the technology of the card index tray, made digital; brilliant for "read only" applications, but guaranteed to struggle with "volatile" (i.e. rapidly changing) data. The Internet is awash with illustrations of flat file structures - click here, or here, or here, if interested.

PERSONAL ASIDE: Between 1982 and 1989, the author was an IDMS database designer and applications programmer, and found the product both versatile and robust once you got used to it. It was admirably suited to systems needing to update volatile data, such as booking systems, banking, and logistics. Interrogation-only systems (e.g. marketing data) are better approached with a Codd-style tabular system. The reason we have concentrated so intensely on the internals of the CODASYL-type database, is that it sets up what are, in effect, semantic networks, just like those we introduced in Part 4 (Section 4.2), and this functionality is currently in wide demand in the artificial intelligence world.

Codd had joined IBM in 1949, and served time on the SSEC and Stretch teams before switching to research into methods of data management. Here is how Hayes (2002/2003 online) explains what happened next .....

"Meanwhile, IBM researcher Edgar F. ('Ted') Codd was looking for a better way to organise databases. In 1969, Codd came up with the idea of a relational database, organised entirely in flat tables. IBM put more people to work on the project, codenamed System/R, in its San Jose labs. However, IBM's commitment to IMS kept System/R from becoming a product until 1980. [//] But at the University of California, Berkeley, in 1973, Michael Stonebraker and Eugene Wong used published information on System/R to begin work on their own relational database. Their Ingres project would eventually be commercialised by Oracle Corp., Ingres Corp. and other Silicon Valley vendors. And in 1976, Honeywell Inc. shipped Multics Relational Data Store, the first commercial relational database."

From the outset, the unique selling proposition of the "relational database" (RDB) was that it was quicker to set up and easier to maintain than its DBTG rivals, but what happened next is a prime example of how words can often unintentionally misinform. The word in question is "relational", and the nature of its misuse was that the RDB manufacturers allowed (and perhaps even encouraged) the perception that RDB was synonymous with well-designed.

ASIDE: Within British Telecom, at least, it was not difficult to take large conventional files of data and load them into simple flat file databases, whereupon it was then only a matter of minutes before the system was responding to its first adhoc interrogations. It was keeping the data up-to-date which was the problem.

As a result, the flat file implementations became so easy to market that everybody bought one, including many for whom the technology was entirely inappropriate because they had data update requirements as well. The fact that the network database was equally tightly grounded in the analysis of entities and their relationships - and had been since the very first Bachman Diagram had been drawn - was conveniently overlooked. [For a detailed comparison of the two technologies, see Michaels, Mittman, and Carlson (1976), and for a balanced criticism of RDBs from within IBM, see Borkin (1980).]

ASIDE: In our 1982 to 1989 system, we interfaced our central heavy duty IDMS database with a number of ancillary flat file systems, thus largely separating the update transactions from the enquiries. The network system handled the primary second-by-second updates, the big prints, and the routine small enquiries, whilst the flat file systems handled the adhoc enquiries and analyses using a structured query language. Each suite of programs therefore played to its inherent strengths, to the ultimate benefit of the company. Here is Olle again, who had foreseen precisely this problem nearly two decades earlier: "The arguments which were raging during the years 1967 and 1968 reflected the two principal types of background from which contributors to the data base field came. People like Bachman [.....] epitomised the manufacturing environment and they saw the need for the more powerful structures which IDS [and similar systems] offered. Others, [including] myself had seen the need for easy to use retrieval languages which would enable easy access to data by non-programmers." (Op. Cit., p3.) <<AUTHOR'S NOTE: The supreme irony may yet prove to be that current attempts to build semantic networks for artificial intelligence applications [a volatile OLTP environment if ever there was one] using relational technology are bogging down in precisely the same problems of complexity that the relational people accused the network people of 30 years beforehand. It may or may not be relevant that Lehmann's (1992) microscopically thorough review of semantic network applications in artificial intelligence contains in its 745 substantive pages not a single reference to Bachman, the Bachman Diagram, CODASYL, or IDMS, despite having acknowledged on page 1 that networks are "a convenient way to organise information in a computer or database". There is plenty on databases, to be sure, but mainly the hierarchical and relational types.>>

6 - From the Entity-Relationship Diagram to er... the Entity-Relationship Diagram

Although there was often bitter squabbling between the DBTG and RDB people about the relative merits of their respective products, there was one thing that both camps agreed upon, and that was the need for a meticulously thorough entity-relationship analysis at the data modelling stage. Whether you were looking at the most intricate of data networks or at the tallest and widest of flat files, you still needed to know what data elements clustered on what other data elements. The next key player in the database story was another IBM researcher, Peter P. Chen .....

Biographical Aside - Peter P. Chen (): See Chen's Louisiana State University homepage for a biography.

..... who in 1976 gave us the "Entity-Relationship Diagram" (ERD) as we most commonly see it today .....

Key Concept - The Entity-Relationship Diagram: "The Entity-Relationship Model is a data model for high-level descriptions of conceptual data models and it provides a graphical notation for representing such data models in the form of entity-relationship diagrams. Such data models are typically used in the first stage of information system design and are used for example to describe information needs and/or the type of information that is to be stored in the database [.....]. The modelling technique, however, can be used to describe any ontology (i.e. an overview and classification of used terms and their relationships) for a certain universe of discourse (i.e. an area of interest). In the case of the design of an information system that is based on a database, the conceptual model is at a later stage, usually called logical design, mapped to a logical data model, such as the relational model, which in turn is mapped to a physical model during physical design." (Wikipedia, 2004 online; bold emphasis original.)

And here is Chen's own subsequent account of what he did .....

"There were several competing data models that had been implemented as commercial products in the early [1970s]: the file system model, the hierarchical model (such as IBM's IMS database system), and the Network model (such as Honeywell's IDS database system). The Network model, also known as the CODASYL model, was developed by Charles Bachman, who received the ACM Turing Award in 1973. Most organisations at that time used file systems, and not too many used database systems. [Then] in 1970 the relational model was proposed, and it generated considerable interest in the academic community. It is correct to say that in the early '70s most people in the academic world worked on relational model instead of other models. One of the main reasons is that many professors had a difficult time to understand the long and dry manuals of commercial database management systems, and Codd's relational model [was] written in a much more concise and scientific style." (Chen, 2002/2004 online; Section 2.1.)

But as we saw in Figure 2, the Bachman Diagram is itself an entity-relationship diagram, so what we actually have here is 15 "lost years" (1961 to 1976) in which Bachman's seminal role in developing the entity-relationship network, GE's IDS, Goodrich's IDMS, and all the DBTG-compliant systems by then in operation, suddenly became academically invisible in the service of Mammon. So, lest we perpetuate this confusion, we shall be working to the following naming standards for the remainder of this paper .....

Bachman Diagram (or erd, in small letters) = Bachman's 1961 entity-relationship diagram.

ERD (in big letters) = Chen's 1976 entity-relationship diagram.

"Data structure diagram" or entity-relationship diagram (unabbreviated) = either/both, or the generic practice.

So successfully did the 1976 ERD do the job of sketch-mapping the physical world, and so quickly did it wring productive work out of the newcomers being sucked into the ever-expanding database industry, that it rapidly became the industry standard method. To get some idea of the RDB products currently on offer, and their fundamental reliance on an ERD to point them in the right direction, see Comsys Information Technology Services, the Database Design Studio, or the Schreyer Institute for Innovation in Learning (who, like many others across education, prefer the title "concept map" for their data structure diagrams). The DBTG systems, by contrast, have been relegated to what is known as "legacy system" duty, that is to say, they are doing what they were originally built to do, and they will continue in this "Old Faithful" role until someone invents a better way of handling online update transactions.

ERDs differ from Bachman Diagrams in two main respects, namely (a) that there is no longer any attempt to add arrowheads or crows' feet symbols at the "many" end of the relationship links, and (b) that all attributes are now shown on the diagram itself rather than in the supporting documentation [see the ovals on the specimen diagram below]. Using these conventions, an example of a simple ERD is given in Figure 3.

Figure 3 - A Simple ERD: Here is our Figure 2 Bachman Diagram recast as an ERD. As noted above, the main changes are as follows: (1) the relationships are shown as diamonds, (2) there is no longer any attempt to add arrowheads or crows' feet symbols at the "many" end of the relationship links, and (3) attributes are now shown clustered on their owning entities. There are a number of excellent specimen ERDs available online - click here, or here, or here, if interested [check out Google - there's hundreds]. To see a specimen "Chen Model", click here.

If this diagram fails to load automatically, it may be accessed separately at

http://www.smithsrisca.co.uk/DMODELS-fig3.gif

Commercially speaking, Codd and Chen were very much in the right place at the right time, therefore, and IBM were very astute to have put them there. Yet the story as the textbooks and websites now tell it usually begins with them, to the exclusion of the earlier figures. This is acceptable as sales practice (where all is fair and truth is relative), but not for academic purposes (where the source scholarship should be accurately identified), so we fully sympathise with Hitchman (2004/2004 online) when he suggests: "Every text that discusses an ER diagram should be citing Bachman as the source of the technique and should be careful to disentangle the ER model from the diagram technique".

7 - Normalising an Entity-Relationship Diagram

"Entia non sunt multiplicanda preaeter necessitatem" [i.e. "One should not increase, beyond what is necessary, the number of entities required to explain anything"] (attrib. William of Ockham, early 15th century).

In practice, the task of creating a full-sized data model for a full-sized business area is complex and time-consuming in the extreme [one recent Internet discussant described a particular rather complex Bachman Diagram as "like a friggin' schematic for a nuclear power plant"]. This is because the business analysis phase of the exercise can be relied upon to turn up a profusion of entities, each one wanting to be related to every other. Which means, irritatingly enough, that the search for neatness and implementability in one's data routinely produces a dog's breakfast of a data network. Worse still, there are several configurations within the diagram which need to be managed out of the way by the creation, artificially, of yet more entities, and that means having yet more relationships to go with them. Fortunately, decades of experience with the method has given data modellers many tips on how to tease out the underlying good sense. These are known as "data normalisation" procedures .....

Key Concept - "Data Normalisation": Data normalisation is the process of removing duplications and contradictions from early drafts of a data model. Remembering how much research data will have been collected, and that different departments will almost certainly have been involved, this means coalescing entity definitions where possible, rationalising their relationships, and generally "tuning" the data structure to the demands which will eventually be made upon it.

In fact, it is necessary to carry out a number of passes (typically up to five) through the normalisation procedure, each time ironing out a particular subset of difficulties. The techniques were applied instinctively by the early data modellers, and not formally described or named until Codd (1972). The first three of these are as follows .....

(1) "First Normal Form" (1NF): This is the first attempt to simplify the draft data model, and is designed to remove repeated data fields from within a record definition, as now shown .....

EXAMPLE: A department employs many employees. It would therefore not be possible in advance to specify the maximum length of a department record if it was decided to include employee attributes on it. Far better to analyse out all the employee-relevant detail and store it instead on separate employee records. (After Oxborrow, 1989, p39.)

(2) "Second Normal Form" (2NF): This is the second pass through the normalisation procedure, and is designed to remove any attributes from a record which are not fully dependent on that record's primary key.

EXAMPLE: It would be wasteful if the employee record from (1) were to contain the department name (because it would be redundant on every employee record after the first). This field should therefore be removed to a separate department record, stored once, and cross-referenced when necessary. (After Oxborrow, 1989, p39.)

(3) "Third Normal Form" (3NF): This is the third pass through the normalisation procedure, and is designed to make non-key attributes "mutually independent".

EXAMPLE: If the employee record from (2) contained project name and project deadline detail, then these two fields would not be independent. This should be resolved by introducing a project record to contain project-dependent data, and again to cross-reference it when necessary. (After Oxborrow, 1989, pp39-40.)

As a rule of thumb, entities should be multi-attribute but single key. This is (a) because a single attribute is usually not an entity in its own right (but is part of a larger, as yet unrecognised, entity), and (b) because having two distinct names suggests two distinct entities. In addition, entities should most definitely not be processes, and should obey Ockham's Razor as far as practicable. Similarly, relationships should avoid obvious hierarchical redundancies - if a hospital contains many wards, and a ward contains many beds, it is usually safe to leave implicit the second-order truth that a hospital must also contain many beds. Analysts should also beware of time (because many relationships which are one-to-one or one-to-many at a given moment, will prove to be many-to-many over the passage of time), and of one-to-one relationship types (because this will frequently indicate that two entities can be merged). There is also a very important and widely used standard transformation which resolves a many-to-many relationship. This transformation is frequently needed, and involves inserting an additional entity between the original two - thus breaking the original relationship into two separate parts - and then redefining the single many-to-many link as a compound of a one-to-many and a many-to-one. Finally, if normalisation results in a multiplication of entities it could well be that some form of "data abstraction" exercise might also be warranted over and above the normalisation procedure [which is where you really earn your keep, but that is another story].

After judicious application and re-application of all these good procedures, the model can no longer be improved upon. It is then in what is known as its "fully normalised" form, and is ready to be mapped forward onto the final physical system. It is now up to the physical designers to do their bit, raising the necessary data storage and program specifications before handing those specifications on, in turn, to the programming teams to put it all together. Referring back to Figure 1, we are ready to cross from the left-side quadrants to the right-side ones, and thereby to swap our logical considerations for physical ones .....

8 - The Optionality of Physical Implementation

This section is dedicated to the late Geoff Hartnell, British Telecom Database Administrator, Area Stores Module, who took the time to explain the concept of hardware independence to me one sunny lunchtime in 1983, and who thereby brought together for the first time in my mind what I already knew about cognitive psychology and what I was being taught at the time about the logical-physical divide in database design.

Before we return to the issue of the physical implementation of a logical design, we need to reflect carefully on the specimen data network shown in Figures 2 and 3. As we have already seen, a data model presents an abstract set of truths - a formal specification of the world itself, including such commonsense truths as the fact that you need to know library users' mailing addresses in order to correspond with them! But Jacob Marley would have had to maintain just such an array of data in Ebenezer Scrooge's mid-19th century address book. Data models are thus a lot of things to a lot of people, but as yet have nothing to do with computers. They could be implemented on papyrus or tablets of clay if that is all your technology budget runs to, which is why they are invariably described as "logical", or "conceptual", or "machine-independent", depending on your preferred terminology. It is what happens next that commits us to the computer age.

What does happen next is that physical designers take the logical design they have been given, and fit it as best they can to the particular technical capabilities of a particular physical computer system. To do this, they have to devise what is known as a "first-cut" design .....

Key Concept - "First-Cut" Design: The first-cut design for a computer system is its broad initial physical design specification. It is the point at which a particular hardware range and file management strategy are decided upon - for example, making the choice between a PC or a MAC, WINDOWS or LINUX, or a network or relational DBMS. It is also the point where the "system boundary" has to be established, because few organisational models are small enough to be computerised in a single physical implementation [and take care here, because many a systems debacle can be traced back to a loosely defined system boundary at this stage in the proceedings - see Section 2(#1) of our e-paper on "Systems Thinking"].

It is when drawing up the first-cut design that true computer knowledge and experience is called for, and by the mid-1980s it was commonplace for IT training courses to include sessions detailing how to produce first-cut designs for each of the dozen or so main implementation options .....

ASIDE: We still have our British Telecom course training notes on the subject of first-cut design, and from a single data model they offer conversion routes to all major implementation options. These were the conventional serial and indexed-sequential file technologies, the CODASYL network option (IDMS), IBM's hierarchical option (IMS), and a host of relational DBMSs (such as Ingres, Adabas, DB2, etc.). BT had to cover all the angles in this way, because, as a broad church corporation, it had all the technologies in use somewhere or other.

We are now ready to put the final touches to our previously incomplete Figure 1 ....

Figure 4 - The Sequence of Events during Structured Development: Here is Figure 1 again, only this time with the transitional box between logical and physical design filled in, and the DATABASE SCHEMA (whose role was explained in Section 4) shown. The first-cut stage of design can now be seen to be setting the constraints for the detailed design work which is about to follow. It sets what we are about to start referring to as the "computational principles" of the system in question.
	Developmental Phase
	Logical Design	Physical Design
Function	FUNCTIONAL DECOMPOSITION	FIRST-CUT PHYSICAL DESIGN	PROGRAM STRUCTURES
Data	DATA MODEL		DATABASE SCHEMA

We are also ready to extend the list of available first-cut physical implementation options by one very important new one, namely the brain, so that data modelling might henceforth be seen as a tool of cognitive science as well as of database design .....

9 - The Logical-Physical Divide in the Philosophy of Mind

"The hard problem, in contrast, is the question of how physical processes in the brain give rise to subjective experience" (Chalmers, 1995, p63).

"..... the central tenet of Marr's approach is that studying the hardware is in itself not enough. To do that is to neglect the crucially important requirement of understanding the nature of the task that the hardware is carrying out. [.....] He argued that, without this topmost level of analysis (which he called the computational theory level, and which he believed had been largely neglected by neurophysiologists and psychophysicists), we will never have a deep understanding of the phenomena and mechanisms of biological vision systems - we will never know why the hardware they possess is designed the way it is." (Frisby, 1986, p139; italics original.)

An earlier version of the paragraphs on Marr appeared in Smith (1997; Chapter 3).

We closed Section 8 by making the point that the brain - to the extent that it is a general purpose information processing architecture - will always be one candidate system amongst many for the physical implementation of a logical design. This is the essence of the logical-physical design split built into the computing industry's structured development philosophy, and it is also a popular position in cognitive psychology, being clearly seen in the theoretical writings of the late David C. Marr (1945-1980), whose position on the philosophy of mind we have elsewhere described as "modern functionalism" .....

Historical Aside - Functionalism: "Functionalism" was the name given to the philosophical doctrine that the mind's mental operations exist thanks to their practical value in satisfying the needs of a vulnerable organism in an hostile environment. The term became popular around the turn of the 19th/20th centuries from the writings of John Dewey and James Angell at the University of Chicago [full history], but did not directly address mental information processing as we would nowadays understand it. The coming of the computer age changed all this, and more recently the term has been extended to include the belief that there is value to be had from analysing cognitive processes in isolation, i.e. separated from considerations of brain anatomy. In this respect, modern philosophies of mind borrow heavily from computer science [e.g. Chalmers (1995)]. The first stirrings of modern functionalism can be seen in the early 1950s, in computer-influenced theories of attention and memory [e.g. Cherry (1953), Broadbent (1958), and Sperling (1960)], and came to full fruition in the writings of David Marr, and in his notion of the computational level of cognition (see below) in particular. [Full discussion. For a nice introduction to Marr's work, we recommend McClamrock (1991/2004 online).]

In fact, functionalism in its computer-influenced form is nothing less than modern cognitive psychology's central philosophical standpoint. It is the study of what the mind is doing, rather than what the brain is doing, and it is predicated upon the presumption that the mind - once its basic principles have been determined - then needs "implementing" on a machine capable of "running" it. We have yet to establish whether Marr was aware of, say, the pioneering work on structured development methodologies being carried out by Learmonth and Burchett during the late 1970s, but if he was not, then he was independently inventing many of their principles for himself. Marr also held that the brain was not necessarily the only device capable of doing the job of cognition. Just as software in general was transferable from one hardware platform to another, so it follows that minds too, if they are software, ought to be transferable. Theoretically - to extreme functionalists, at least - one has only to analyse the processes making up human consciousness to implement it on machines other than the brain, and that would mean having machines which could become as conscious as their creators .....

ASIDE: Johnson-Laird (1987) traces similar ideas back to the writings of Kenneth Craik, Alonzo Church (1903-1995), and Alan Turing. Church was the Princeton mathematician who in 1936 stimulated the "unsolvable problems" debate, and who contributed in the 1950s to the development of recursive computing. Turing - whose story we have told elsewhere [early history; late history] - was Church's student at the time he wrote his Entscheidungsproblem paper, in which he put across the idea of the Turing Machine [details]. Craik - whose story we have also told elsewhere [click here], was another who argued that the essence of the mind lay in its functional organisation.

All in all, Marr identified three discrete levels of analysis of cognition, the first and highest of which was the level of "process computation". Under the heading "Understanding Complex Information-Processing Systems", he analysed how we should best consider what a process might actually be .....

"The term process is very broad. For example, addition is a process, and so is taking a Fourier transform. But so is making a cup of tea, or going shopping. For the purposes of this book, I want to restrict our attention to the meanings associated with machines that are carrying out information-processing tasks. So let us examine in depth the notions behind one simple such device, a cash register at the checkout counter of a supermarket. There are several levels at which one needs to understand such a device, and it is perhaps most useful to think in terms of three of them. The most abstract is the level of what the device does and why. [Some example arithmetic is then given.] This whole argument is what I call the computational theory of the cash register. [.....] In order that a process shall actually run, however, one has to realise it in some way and therefore choose a representation for the entities that the process manipulates. The second level of the analysis of a process, therefore, involves choosing two things: (1) a representation for the input and for the output of the process and (2) an algorithm by which the transformation may actually be accomplished. [.....] This brings us to the third level, that of the device in which the process is to be realised physically ....." (Marr, 1982, pp112-114; italics original; bold emphasis added. Marr uses the phrase "to realise it" in exactly the sense that systems designers use "to implement".)

Key Concept - Computational Principles: The "computational principles" of an information processing system are its basic working principles. They consist of a number of fundamental decisions as to the system's functional and structural architectures, which then give the system its essential nature [as James Clerk Maxwell and Kenneth Craik would have had it, they specify the "particular go" of that system (Sherwood, 1966)]. <<AUTHOR'S NOTE: The main problem with explaining the workings of the mind is that nobody has yet succeeded in stating its computational principles. We know a lot about the physical implementation - the neuron - but very little about how neurons contribute towards the mind's higher functions. This is the essence of the "explanatory gap" as discussed by modern philosophers (see, for example, Chalmers, 1995).>>

But even Church, Turing, and Craik were not the first to have been interested in the progressive movement of data through the biological mind, because many of the philosophical debates of the late 19th century were concerned with what in effect, if not terminology, were the same issues. We refer specifically to the academic confrontation between "act" psychologists such as Franz C. Brentano (1838-1917) and "content" psychologists such as Wilhelm Wundt (1832-1920). As laid down in his 1874 monograph "Psychology from an Empirical Standpoint" (Brentano, 1874/1995), Brentano saw the mind busily at work classifying the momentary contents of perceptions and intentions into one or other of three fundamental classes of phenomena, namely (1) Ideating (e.g. seeing, hearing, etc.), (2) Judging (e.g. agreeing, rejecting, etc.), and (3) Loving-Hating (e.g. feeling, wishing, intending, etc.) (Titchener, 1921/2004 online). His key concept was that of Vorstellung, usually translated as "presentation" .....

Key Concepts - Vorstellung and Phenomenal Reality: The word "presentation" is Brentano's translators' rendering of Vorstellung in the original German. The word translates more fully as conceivability, image, imagination, association, and hence presentation. However there is a parallel usage of the word within a theatrical context, where it relates to the giving of a performance, or presentation in the sense of oration or display. One of the standard usages of the word "phenomenon" is "cognisable by the senses, or in the way of immediate experience; apparent, sensible, perceptible" (O.E.D.). This allowed the philosopher Immanuel Kant (1724-1804) to use the term "phenomenal reality" to refer to our internal experience of the world about us. <<AUTHOR'S NOTE: Again, we know a lot about phenomenal reality (because we experience it directly), but little about how our neurons organise themselves to make it happen that way.>>

Here, from the 1995 translation, is Brentano's core argument .....

"Psychology, like the natural sciences, has its basis in perception and experience. Above all, however, its source is to be found in the inner perception of our own mental phenomena. We would never know what a thought is, or a judgement, pleasure or pain, desires or aversions, hopes or fears, courage or despair, decisions and voluntary intentions, if we did not learn what they are through inner perception of our own phenomena. Note, however, that we said that inner perception [Wahrnehmung] and not introspection, i.e. inner observation [Beobachtung], constitutes this primary and essential source of psychology. These two concepts must be distinguished from one another. One of the characteristics of inner perception is that it can never become inner observation. We can observe objects which, as they say, are perceived externally. In observation, we direct our full attention to a phenomenon in order to apprehend it accurately. But with objects of inner perception this is absolutely impossible." (Brentano, 1874/1995, pp29-30; italics original; bold keywording added; square bracketing by the translators.)

"Every idea or presentation which we acquire either through sense perception or imagination is an example of a mental phenomenon. By presentation I do not mean that which is presented, but rather the act of presentation [nicht das, was Vorstellung wird, sondern den Akt des Vorstellens]. Thus, hearing a sound, seeing a coloured object, feeling warmth or cold, as well as similar states of imagination are examples of what I mean by this term. I also mean by it the thinking of a general concept, provided such a thing actually does occur. Furthermore, every judgement, every recollection, every expectation, every inference, every conviction or opinion, every doubt, is a mental phenomenon. Also to be included under this term is every emotion [.....] Examples of physical phenomena, on the other hand, are a colour, a figure, a landscape which I see, a chord which I hear, warmth, cold, odour which I sense; as well as similar images which appear in the imagination. [.....] It is hardly necessary to mention again that by 'presentation' we do not mean that which is presented, but rather the presenting of it. This act of presentation forms the foundation not merely of the act of judging, but also of desiring and of every other mental act." (Brentano, 1874/1995, pp78-80; bold keywording added; square bracketing ours. Note the now-famous word Akt.)

It was Wundt's student Edward B. Titchener who popularised the Brentano-Wundt debate as a confrontation between act and content. In Titchener (1921/2004 online), he described the two theorists as alike because (a) "by happy chance" they had published their first major book in the same year (1874), (b) they agreed to focus on phenomena [see Key Concepts panel above], (c) they both rejected the unconscious as a principle of psychological explanation, and (d) they defined "the unity of consciousness in substantially the same terms". He saw them as differing primarily over what they accepted as the subject matter for their observations: Brentano focused on the mental act, whilst Wundt focused on mental content. In spite of their many similarities, therefore, Brentano and Wundt "psychologise in different ways" (Titchener, 1921/2004 online) .....

"For Wundt, psychology is a part of the science of life. Vital processes may be viewed from the outside, and then we have the subject-matter of physiology, or they may be viewed from within, and then we have the subject-matter of psychology. The data, the items of this subject-matter, are always complex, and the task of experimental psychology is to analyse them into 'the elementary psychical processes.' If we know the elements, and can compare them with the resulting complexes, we may hope to understand the nature of integration, which according to Wundt is the distinguishing character of consciousness. [.....] His primary aim in all cases is to describe the phenomena of the mind as the physiologist describes the phenomena of the living body, to write down what is there, going on observably before him. (Titchener, 1921/2004 online; bold emphasis added.)

It was this focus on what was there, rather than on what was going on, which stopped Wundt being an Akt man. Titchener explicitly warns, however, that the distinction is often only a fine one, and that in fact you can never act without content. Nevertheless, the separation of act and content in the philosophy of mind is conceptually close to the separation of data and function in computing. Data (content) is what we store until process (act) comes along and acts upon it in some way, and, like the chicken and the egg, it is hard to see which is the more important.

10 - Data Models in Cognitive Science

"Knowledge representation is one of the thorniest issues in cognitive science. If we are to have a theory in which mental objects undergo transformations, we need to have some notation to represent these objects. The difficulty is determining what it is about a representation that amounts to a substantive theoretical claim, and what is just notation." (Anderson, 1993, p17.)

So if the logical-physical divide in computing is the same as the logical-physical divide in psychology, then what of the flagship engineering techniques, can they be used too? Specifically, is there a role for the data structure diagram in helping to devise biological database schemas? The answer, as it turns out, is not only that there is, but that considerable progress has already been made with it, and by two of the most exciting branches of cognitive science at that, namely "semantic networks" and "production systems".

We have little to say here about the semantic networks, having already written about them at length in the following papers .....

Neuropsychology Glossary	See the discussion of semantic networks in the entry for "Deep Dyslexia".
Memory Glossary	See the entry for "Semantic Memory" and follow the links.
Lecturer's Précis - Hinton, Plaut, and Shallice (1993)	Contains a sustained discussion of semantic networks by mainstream connectionist authors.
Lecturer's Précis - Morton, Hammersley, and Bekerian (1985)	Contains a sustained discussion of semantic networks by mainstream cognitive modellers.
Short-Term Memory Subtypes in Computing and Artificial Intelligence (Part 4)	Section 4.2 is dedicated to the history of semantic networks up to 1958.
Short-Term Memory Subtypes in Computing and Artificial Intelligence (Part 5)	Sections 1.9 and 3.9 are dedicated to the history of semantic networks since 1959.

As for production systems, the roots of the approach go back to Newell (1973) and Anderson and Bower (1973) .....

Key Concept - Production System: A set of computational principles and an associated processing architecture, proposed by Anderson (e.g. 1983) as the basis of all biological cognition. Combines the best of modern memory theory with some basic cybernetics and a programming language capable of producing working simulations. A package of very good things, therefore. For a full history of production architectures, see Neches, Langley, and Klahr (1987).

Biographical Aside - John Robert Anderson (1947-): [See fuller biography] John R. Anderson [academic homepage] is the R.K. Mellon Professor of Psychology and Computer Science at Carnegie Mellon University. He studied at Stanford University under Gordon Bower, and helped build the FRAN computer simulation of free recall. His 1972 dissertation included a lengthy literature review of the various forms of semantic network theory, much of which he subsequently incorporated into his first major publication, "Human Associative Memory" (Anderson and Bower, 1973). He joined Carnegie Mellon in 1978, and proceeded to make famous the ACT-series of production systems. The goal of his research is "to understand how people organise knowledge that they acquire from their diverse experiences to produce intelligent behaviour" (website).

ASIDE: Anderson and Bower's (1973) "Human Associative Memory" is a 500-page monograph on the structures and functions of long-term memory. In brief, it sets out to show how semantic memory [glossary] would have to be structured in order for a semantic network of the sort described by the associationist philosophers [glossary] to support language production. It even explicitly claimed the title "neo-associationist" to describe the modern form of this philosophy. The book begins with a historical review of the associationist position all the way back to Aristotle, notes the parallel development of artificial intelligence and cognitive psychology during the post-war years, and then combines a detailed review of the major theories of long-term memory organisation (as they stood at that time) with insights into various attempts to simulate the use of memory during higher cognition. There is a particularly useful description of Rumelhart, Lindsay, and Norman's (1972) ELINOR computer program, which adopted the "n-ary relation" as its basic building block, as shown in Figure 5 below. The authors' early thoughts on "propositional networks" [discussed in greater detail below] are contained in Chapter 7, if interested.

In practice, however, it took Anderson another ten years to turn the 1973 monograph into a more rounded product, tightly grounded in theory, but nevertheless capable of generating specific and coherent programs of empirical research. The exposition now most usually cited is Anderson (1983), as summarised by Neches, Langley, and Klahr (1987) .....

"The basic structure of problem solving programs (and their associated interpreter) is quite simple. In its most fundamental form a production system consists of two interacting data structures, connected through a simple processing cycle: (1) A working memory consisting of a collection of symbolic data items called working memory elements. (2) A production memory consisting of condition-action rules called productions, whose conditions describe configurations of elements that might appear in working memory and whose actions specify modifications to the contents of working memory." (p3; italics original.)

Anderson's theory-system is known as ACT (short for "Adaptive Control of Thought"), and has been progressively improved over the years. To start with, it was just ACT, but then a series of ACT systems codenamed ACT-A to ACT-H was developed during the 1970s (published, primarily, in Anderson, 1976). This experience led to system improvements published as ACT* (pronounced "act-star") in 1983, and it was this book which popularised Anderson's work. Still further improvements were announced in 1993 under the name ACT-R (where the R stands for "rational") .....

Technical Aside - The ACT-R Production System: ACT-R is a model of the "architecture" of human cognition. It is heavily grounded on the distinction between "declarative memory" [glossary] and "procedural memory" [glossary], thus: "There are three essential theoretical commitments one makes in the ACT-R knowledge representation. One is that there are two long-term repositories of knowledge: a declarative memory and a procedural memory. The second is that the chunk is the basic unit of knowledge in declarative memory. The third is that the production is the basic unit of knowledge in procedural memory." (Anderson, 1993, p17; italics original.) Looking back on 35 years on the case, Anderson's main lament seems to be how few researchers have managed to acquire the hands-on programming skills required to master the system [unfortunately, as with the network databases, the systems have "a forbidding aura of esoteric mystery and complexity" (Neches, Langley, and Klahr, 1987, p1), resulting in their being somewhat daunting to the uninitiated]. For specific examples of production rules, see Anderson (1989/2004 online), and for an indication of the current state of the art, see Salvucci and Lee (2003/2004 online).

Our own interest in production systems arises from the fact that ACT practitioners routinely found themselves considering data relationships in their research, and soon adopted a form of entity-relationship diagramming of their own. When writing software to simulate the production of a sentence, for example, it involved constantly dipping in and out of the mind's lexicon [glossary], not just for the words in their root form, but for the rules by which they could be linked to other words. Anderson called these "propositional networks", and we reproduce part of one of them in Figure 5 .....

Figure 5 - A Simple Propositional Network: Here is a network showing the active utilisation of knowledge during the production of a sentence. The five circles represent individual propositions, the junction points represent people nodes [e.g. "Professor Jones", top left], object nodes [e.g. "Car", centre page, bottom], and relationship nodes [e.g. "Isa", centre page, bottom], and the arrows represent the role played by the nodes in the proposition(s) in question. The central proposition is that "X bought Y", and this is mapped by a trivalent proposition, X being the Agent, Y being the Object, and BOUGHT being the relation [see mauve highlight, lower left]. The other four propositions are those which might also need to be activated to take account of context, prior experience, and "implicature" [glossary]. The diagram as a whole represents the sentence-level structure which makes sense of the individual component propositions. More complex diagrams could, of course, be used to track the paragraph-level structuring of the many tens of simultaneous propositions which need to be properly sequenced during spoken or written discourse [glossary]. There are a number of nice specimen propositional networks available online - click here, or here, or here [see Section D], if interested.

If this diagram fails to load automatically, it may be accessed separately at

http://www.smithsrisca.co.uk/DMODELS-fig5.gif

What Figure 5 shows us is how the static structures of semantic memory [the "data", or "content"] can be kicked into action [the "function", or "act"] in the expression of a complex thought, and with this focus on the successive activation of the component nodes - treating the act as just a touch more important than the content - we may perhaps acclaim Anderson as the new Brentano. However, the process will only proceed smoothly if the data has been laid out for ease of processing in the first place, which is why we have ourselves long argued that cognitive science needs the sort of data modelling skills freely available within the computing industry [see, for example, Smith (1998b)].

ASIDE: Data modelling skills are not the only potentially valuable import from computer science. Despite some major demonstrations of computerised cognition [see the work of Seidenberg and McClelland (1989), Norris (1991), and Hinton, Plaut, and Shallice (1993)], the penetration of computing concepts and vocabulary into cognitive science has been decidedly patchy. Interested readers can find a provisional list of "missing concepts" in Section 1 of our e-paper on "Short-Term Memory Subtypes in Computing and Artificial Intelligence" (Part 7). To give praise where it is due, Baddeley (2000) has recently done good work with the concept of "buffer", Johnson-Laird (1987) has called for mechanisms of "deadlock" prevention to be included in parallel processing models, and Chalmers (1995) has daringly couched the entire consciousness debate in the language of Shannonian information theory.

11 - Conclusion

"It is a sign of the immature state of psychology that we can scarcely utter a single sentence about mental phenomena which will not be disputed by many people" (Brentano, 1874/1995, p80).

To summarise, here is our core argument, step by step .....

1. Between 1961 and 1964, the General Electric Corporation developed the IDS database management system. This system was based on the principles (a) that individual fragments of data could be stored and retrieved on a "direct access" basis, but only when (b) their "data structure" had been fully established by painstaking analysis beforehand. The IDS developers documented such data structures on "Bachman Diagrams", and the product subsequently made its way to market under a number of proprietary badges and still powers much of the heavy end of the world's on-line transaction processing industry.

2. During the 1970s, the computing industry responded to a surge of systems debacles by gradually introducing stricter controls over the systems development process. This culminated in the emergence of a number of commercially competing "structured development methodologies", and one of the principles of such methodologies was/is that process and data are fundamentally different things and should be analysed separately. The data part of the equation needs to be analysed at a logical level prior to any consideration being given to the physical implementation of the system. The results of this analysis are then set down formally as the "data model" for said system. Data modellers regularly use the Bachman Diagram (or variants thereof) to give a visual summary of their conclusions.

3. Data models purport to set down all you will ever need to know about the data in your world - how its elements must necessarily be clustered together and interrelated in order to become meaningful, and how you are then likely to have to store and/or retrieve them. But it does so in the abstract, and without reference to the hardware you end up using. It follows (a) that data models of this sort could have been drawn up BEFORE THE COMPUTER HAD BEEN INVENTED and would have looked just the same, and (b) that data modelling is as much a branch of associationist philosophy as it is an IT skill. It also follows that there is an element of "optionality" about the final choice of physical system, at both the software and hardware architecture level.

4. At the insistence of the structured development methodologies, the process of "implementing" a logical design - turning it into a physical system - takes place in two stages. Firstly, a "first-cut design" is hammered out, and then the detailed design work is carried out. The resolution of optionality occurs at the first-cut stage, when the number of candidate physical information processing systems is gradually whittled down to just one. A first-cut design thus establishes the "computational principles" of a system in the sense that Marr (1982) used that term.

5. The brain is a physical information processing system. Carefully housed in human beings known as "clerks", it has been the implementation of necessity for business systems for all but the last 130 years of civilisation, being then progressively displaced by such inventions as the cash register (1878), the electromechanical calculator (1885 to 1886), the punched card (1884 to 1890), the analog computer (1915 to 1931), and the digital computer (1931 to 1945). The thrust of artificial intelligence research since 1945 has been to simulate more and more higher cognitive processes.

In many of its activities the mind is clearly a semantic network on the move, and yet progress in semantic network simulations is routinely held back by the sheer complexity (a) of the data analysis, and (b) of the programming. Ergo, cognitive science would be well advised (a) to improve its data modelling skills, and (b) to keep broadening its grasp of computing concepts and vocabulary.

11 - References

See the Master References List

[Home]

ANSWERS TO EXERCISES

Exercise 1.3 (a): You cannot do it with the existing system. The access key is <STUDENT-NAME>, and the worst thing you can do with a direct access system is to forget those keys. Perhaps there is some sort of desktop work log which you could consult. Or you could ask a colleague. Or even go and stand by the cabinets and hope that some contextual cue will spur your recollection. A typical database designer's solution would be to add a <RECENT-UPDATES> audit trail set with PRIOR pointers, so that you could browse backwards through all the recent changes. Job done!

Exercise 1.3 (b): You cannot do it with the existing system, because there are no "superchief" records. Perhaps there is some sort of contents sheet - a sort of "Guide to our Archive" - to consult, otherwise you are going to have to go through the entire category selecting the ones you want by eye. A typical database designer's solution would be to add an <YEAR-OF-GRADUATION> entity class, with occurrences for 2001, 2002, 2003, etc., each owning its particular year's records.

Exercise 1.3 (c): Comments as for (b).