AUTOTYP · Projects

The goals and principles of AUTOTYP

AUTOTYP is a large-scale research program with goals in both quantitative and qualitative typology. In quantitative typology, we are interested in detecting and explaining geographical distributions of typological features and in producing statistical estimates of universal preferences as well as of genealogical inheritance and areal diffusion potentials. In qualitative typology, we aim at a systematic analysis of the kinds of variation found in various typological domains.

AUTOTYP was developed in response to two problems faced by traditional typological databases:

Traditional databases typically rely on a static and pre-defined category list which tends to conflict with the data as more languages are entered and which restricts the database to research in theories that completely sanction the category list. (For discussion of some of the problems arising, download our paper on the use of AUTOTYP in field linguistics. )
Traditional databases are typically integrated into a single file containing a wide variety of information making it difficult if not impossible to re-use any part of this information, e.g., genetic affiliation, in other databases or to search for typological correlations across databases.

The AUTOTYP program addresses these shortcomings and proposes general design principles for the development of typological databases. For a presentation of all projects currently adopting the AUTOTYP principles, go to the projects page.

The principles of AUTOTYP

Autotypology. AUTOTYP databases are autotypologizing : Rather than starting with a pre-defined list of categories, AUTOTYP databases rely on an automatic generation of category lists during data input. When entering a new language, one first checks whether the previously established notions are sufficient for this language. If not, new notions are postulated in consultation with the PI(s) of the project and are carefully defined in a definition file, corresponding to but separate from the data file containing the actual data for each language. Thereby, definition files reflect at any time an empirically well-supported and detailed typology of the phenomenon at hand. If necessary, previous entries will be revised by more fine-grained analyses that reflect the new typology. This procedure is time-consuming in the beginning because each new type requires review (and possibly revision) of all previous entries, but after a few dozen languages, new types become less likely to emerge and the typology stabilizes. In our experience this happens after about 40 languages are entered. The advantage of the procedure is data accuracy on a level that is impossible in databases with predefined typologies. If descriptive needs go beyond the revision of definition files, entire new fields can even be added during inputting. As a general design principle, AUTOTYP lets the size and complexity of the database increase rather than the complexity of coding decisions. This has the additional benefit of keeping the data compatible with a wide range of theoretical frameworks.
Precision. AUTOTYP databases strive for as detailed as possible a break-down of descriptive notions into unambiguous terms. Notions like 'relative clause' figure only as practical labels; the actual information behind such notions is distributed over several fields (e.g. values in fields such as clause linkage type, part of speech, finiteness, and argument representation). Moreover, definition files are designed in such a way that they allow both very specific terms (e.g., 'head-marking with both stem alternation and possessor agreement', 'experiencer transitive subject') or very general terms (e.g., 'head-marking', 'subject'). The choice of specific or general, which will be logged in prose fields, depends on the language itself or the quality of its description. Both distributed information and variable degrees of specificity have the advantage that more precise questions can be answered while still allowing research on broad traditional categories, defined now as specific constellations of fine-grained choices. The database will also be largely immune to any theoretical move in the definition of terms since in most cases such moves can be captured by searching for fine-grained choices in new constellations.

Exemplar-based Method. Paradigms are often heterogenous: some parts of a case paradigm may be highly fusional and polyexponential, other parts more isolating and mono-exponential. Also languages often have competing constructions in one and the same structural domain, e.g., analytic tense constructions along with periphrastic ones or multiple relativization strategies. While AUTOTYP allows recording all this variation, for typological surveys it is often desirable to have one record per language only. In order to chose, in each language, comparable data, we follow what we call the Exemplar-Based Method: we chose one particular examplar of paradigms or structural domains, and this exemplar is identfied following a standard algorithmic definition. For example, the AUTOTYP exemplar definition of TAM (tense/aspect/mood) markers is: “If any of the TAM markers differs from others in their morphological behavior (here: exponence), pick TENSE; within TENSES, pick PAST (or whatever is chiefly used for simple, independent, past time reference); if there is none, pick FUTURE. If there is no TENSE, pick the closest ASPECT equivalent of past tense (e.g. perfective aspect). If there is no ASPECT, pick that MOOD, STATUS, or EVIDENTIALITY marker that is mostly used for past time reference (e.g. realis status).”
Modularity. In order to achieve maximal flexibility in formulating queries, AUTOTYP databases distribute information over a network of several separate, thematically defined files linked together in a relational network via standardized language ID codes. Each module can also function as a stand-alone database itself and can easily be linked to other databases, existing or new. This gives AUTOTYP databases the potential to grow in any possible direction without necessitating revisions of their basic design structures. Central access files do not contain information on their own but rather collect data from different modules. Any number of fields from any number of databases can thus be combined in one or several access files to allow for a large variety of possible query and browsing needs.
Connectivity. AUTOTYP database modules are linked together via numerical language ID codes that can be mapped on other codes, such as the Ethnologue ID code. Any database, new or existing, that follows this standard will be directly compatible with AUTOTYP databases; others will need adjustments. Existing databases which are not modular can be directly linked to AUTOTYP databases, although breaking them up into narrow modules will broaden query possibilities.

The Autotypology principles requires that databases differentiate between data files, which contain records on specific issues by language, and definition files, which contain records of notions and their definitions that prove to be necessary in the data files. The two file types allow for a dual use of the database in research: the data files allow quantitative typological inquiry into statistical correlations between structural, genealogical or geographical features, while the definition files produce contributions to qualitative typology since they contain all and only notions that are cross-linguistically relevant and viable.


	last modified © by Balthasar Bickel and Johanna Nichols 2000