Last updated February 18, 2010. The following article explains how all of the internal data structures of language, including words, can be built from nothing more than simple links and nodes. Brainchild 5 And The Internal Data Structures Of Language. 1. The Fundamental Linguistic Link. The internal data structures of language can be modeled using naught but links and nodes. The nodes are connecting points, and the links are directed connections between them. Each link consists of a source, a type, and a destination. Using a computer programming language such as standard C, these fundamental links can be defined as follows: typedef struct { unsigned char typ; /* A single byte for link type */ unsigned short dst; /* A two-byte integer for destination */ } lnk_t; In the human imagination it may be useful to perceive such links as tiny arrows of various colors connected at their ends to other identical arrows in various ways. In such a model, the color of the arrow would represent link type. The human linguistic apparatus relies upon two internal data structures consisting of links and nodes, the first of which is the ONTOLOGY. The nodes of this structure are associated with (linked to) fundamental meanings equivelant to the word-sense definitions found within the pages of a dictionary. They are also connected to each other, the links between them being of the following types: holonymy (something is part of something else) hypernymy (something is a kind of something else) synonymy (something is the same thing as something else) antonymy (something is the opposite of something else) To these may be addedother negative link types as necessary for the functioning of the system, for example negative synonymy (something is NOT the same thing as something else). Definition: semnod = semantic node. Definition: radlink = link between two semnods. Definition: lexlink = link from a semnod to an ordinal number representing a natural-language word. The Brainchild-5 ontology, which is a rather large data structure, is really nothing but a very long array of the tiny data structures we have just defined as lnk_t; for as it turns out, this same tiny data structure can be used to represent both links AND nodes. Each entry in the ontology consists of an lnk_t structure representing the node itself, followed by one or more links of type lnk_t representing the links emanating from that node. For example: lnk.typ, number of lnk_t structures (3) in the entry for this semnod. lnk.dst, ordinal number identifying this semnod. lnk.typ, singular noun (this lexlink goes to a natural-language word). lnk.dst, ordinal number representing the natural-language word. lnk.typ, hypernym (this radlink goes to another semnod. lnk.dst, an ordinal number identifying the hypernym of this semnod. A human or machine traversing the ontology in search of the entry for a particular semantic node would look at the first entry in the ontology and compare its identifying ordinal number (lnk.dst) to the number he/she/it was looking for, and if the numbers did not match, he/she/it would then skip to the start of the next entry by adding the value in the type field (lnk.typ) to his/her/its current position, and go on repeating this process until either a match were found or the lnk.dst value turned out to be 0, in which case he/she/it would know that the desired value was not to be found in the ontology. The limited ontology being used in the current English version of Brainchild 5 is an array of 58,145 elements of type lnk_t as described above. It is a large data structure built up of many instantiations of the tiny data structure, lnk_t. The second internal data structure of language, called the CORPUS, is considerably more complex, and yet it is also just an array of identical tiny data structures, each of which is a word. We often think of words as sounds or written symbols, but here we are speaking of words at their most fundamental level as described in the following theorem: Every word that is part of a coherent thought in any language is a structure consisting of two links having the same source node. One of these links, called the SEMANTIC LINK, or SEMLINK, goes to a meaning in the ontology, and the other, which is the SYNTACTIC LINK or SYNLINK usually goes to another word within the same sentence called the REGENT of the current word. The top word of many sentences has no regent, therefore its synlink has a type but goes nowhere. Definition: synlink = syntactic link. Definition: semlink = link from a word to its meaning. Definition: semnod = semantic node, or meaning. So it is essentially true that all the words of linear texts can also be modeled using nothing else but simple links and nodes. Yet in order to accurately model texts, there remain a few additional considerations as follows: 1. Sequencing and linearity. If we were simply to retain the two links that compose each word and let everything else drop away, we would be left with a pure "tree structure" with no sequence or linearity whatsoever. We might let all the links hang from the synlink of the top word and just as it were blow freely in the wind. The complete meaning of the sentence would remain, but it would be hard to know how to rearrange the links and nodes back into a string of words. And it is clear from various observations we can make that this is not how language works. Some kind of sequencing or linearity must be retained, even if it is just to maintain the chronological order of a narrative. So we need to find a way to create tree structures that encode linearity. 2. Punctuation, Capitalization, Emphasis, Quotation, Apostrophe, and Contraction. Words can be capitalized, emphasized, punctuated, quoted, contracted, and said or written in apostrophe (between parentheses); so we need a system that can encode these conditions. 3. We need a code that is maximally efficient for machines and maximally readable by machines, but readily convertible to linear text for humans. Definition: atom = word in the subsurface structure of language. Definition: Interlinguish = an array of atoms from which coherent text can readily be generated. With these requirements in mind, the following C structure has been adopted for Brainchild 5: Typedef struct { /* The Interlinguish atom */ unsigned char sib; /* Number of atoms in the subtree */ unsigned char syn; /* Syntactic link type */ unsigned char sem; /* Semantic link type */ unsigned char flg; /* Various flags */ unsigned short tag; /* Semantic node identifier */ } atm_t; Definition: regent, or head, = word that has a dependent. Definition, dependent = word that depends from another word in the same sentence called its regent, or head. It may not seem obvious at first, but in fact this structure represents two linguistic links emanating from the same node, arranged in such a way as to be maximally traversable by computers. Here is how it works: The corpus, which is a large internal data structure, is actually nothing but a long array of identical elements, each of which is a tiny data structure of type atm_t. Within this array, every regent, if it has a dependent or dependents, is followed by its dependents without regard for the arrangement of words in an equivelant linear text. In other words, the arrangement of atoms in an Interlinguish array is rigorously "head first," whereas in natural languages the head, or regent, can appear anywhere according to the conventions of the particular language. But the linear text arrangement is remembered at all times by a bit set in the flags field of the atom that should be used to generate a word just after the regent in the linear text. Thus, for example, the atoms for the sentence, "John loves Mary," would be arranged in the order, "loves John Mary," in an Interlinguish array, with the appropriate marker bit set in the atom for "Mary." The "sib" field of each atom always contains the number of atoms in its subtree. Thus for an atom with no dependents, the sib value is 1, for an atom with two dependents, the sib value is 3, etc. In the example, "loves John Mary," the atom for loves would have a sib value of 3, the atom for John would have a sib value of 1, and the atom for Mary would have a sib value of 1. The mnemonic, "sib," may seem counterintuitive, but in fact this value can be added to any atom in order to skip any of its dependents and go to its right sibling, which is an atom of equal rank to the right. Definition: rank, or grammatical rank = two words that share the same regent are of equal rank. From the above it can be seen why no destination field is necessary to complete the synlink whose type is the "syn" field of an atom. If any atom points around another, then the atom pointing beyond (whose sib value, when added to its address points beyond) another atom is the regent of that other atom. This method of encoding, though seemingly convoluted, lends itself amazingly well to computer data processing, and is well based in certain points of linguistic theory which I will not go into here. As an example, in order to search for some thought in the corpus, instead of reading laboriously through every word in the corpus, the computer can just jump from head to head to head until a head match is found, and then start matching dependents in the same fashion to see whether the whole thought matches. Another feature whose desirability is immediately obvious is that in order to "grab" any subtree from a large Interlinguish array, all that one has to do is find the head of that subtree, no matter whether it be the head of a sentence, a phrase, or even just a word, and by looking at the sib value of the regent, one can immediately know how many atoms to take. And if one has set up the required algorithms for printing, all one need do to print out any sentence, phrase, or word from within a large Interlinguish array is simply to pass the printing routine its address. In short, this is an encoding method that has been optimized through years of testing, trial, and error, and its benefits, only a few of which I have mentioned here, make the trouble of learning it well worthwhile. The corpus Brainchild 5 is currently using is composed of 15,912 atoms of type atm_t, each atom being six bytes long. Summary Of Link Types. Brainchild 5 employs four linguistic link types: radlink = link from one semnod to another. lexlink = link from a semnod to a natural-language word. semlink = link from an atom, or word, to its meaning (a semnod). synlink = link from one atom to another or from one word to another. Interlinguish meets all the requirements for internal representation mentioned above, and uses only six bytes per word. It is so amenable to examination by computers that several thousand sentences can be checked for various purposes within a single second, and yet it is immediately convertible to written/spoken text. Its strength lies in the fact that it has been carefully crafted to mimic the internal workings of real human language. It is limited only by the fact that certain languages, such as Greek, have interleaved dependencies, by which I mean that they do not obey the "one Interlinguish subtree per text segment" rule. It is my hope that we will be able to find some way of correcting this shortcoming as we learn more. In the meantime Interlinguish is an excellent resource that can be applied to most of the world's languages today. And by knowing Interlinguish, it is possible to take full advantage of the linguistic foundation provided by Brainchild 5, upon which a great number of linguistic artifacts and tests might be designed with all paths leading towards the ultimate goal of true artificial intelligence.