The goals and principles of AUTOTYP
AUTOTYP is a large-scale research program with goals in both quantitative and qualitative typology. In quantitative typology, we are interested in detecting and explaining geographical distributions of typological features and in producing statistical estimates of universal preferences as well as of genealogical inheritance and areal diffusion potentials. In qualitative typology, we aim at a systematic analysis of the kinds of variation found in various typological domains.
AUTOTYP was developed in response to two problems faced by traditional
- Traditional databases typically rely on a static and pre-defined
category list which tends to conflict with the data as more languages
are entered and which restricts the database to research in theories that
completely sanction the category list. (For discussion of some of the problems
arising, download our paper on the use of AUTOTYP
in field linguistics.
- Traditional databases are typically integrated into a single
file containing a wide variety of information making it difficult if not
impossible to re-use any part of this information, e.g., genetic affiliation,
in other databases or to search for typological correlations across databases.
The AUTOTYP program addresses these shortcomings and proposes general
design principles for the development of typological databases. For a
presentation of all projects currently adopting the AUTOTYP principles, go
to the projects
The principles of AUTOTYP
- Autotypology. AUTOTYP databases are autotypologizing
: Rather than starting with a pre-defined list of categories, AUTOTYP
databases rely on an automatic generation of category lists during data
input. When entering a new language, one first checks whether the previously
established notions are sufficient for this language. If not, new notions
are postulated in consultation with the PI(s) of the project and are
carefully defined in a definition file, corresponding to but separate
from the data file containing the actual data for each language. Thereby,
definition files reflect at any time an empirically well-supported and
detailed typology of the phenomenon at hand. If necessary, previous entries
will be revised by more fine-grained analyses that reflect the new typology.
This procedure is time-consuming in the beginning because each new type
requires review (and possibly revision) of all previous entries, but after
a few dozen languages, new types become less likely to emerge and the typology
stabilizes. In our experience this happens after about 40 languages are
entered. The advantage of the procedure is data accuracy on a level that
is impossible in databases with predefined typologies. If descriptive needs
go beyond the revision of definition files, entire new fields can even be
added during inputting. As a general design principle, AUTOTYP lets the size
and complexity of the database increase rather than the complexity of coding
decisions. This has the additional benefit of keeping the data compatible
with a wide range of theoretical frameworks.
- Precision. AUTOTYP databases strive for as detailed
as possible a break-down of descriptive notions into unambiguous terms.
Notions like 'relative clause' figure only as practical labels; the actual
information behind such notions is distributed over several fields (e.g.
values in fields such as clause linkage type, part of speech, finiteness,
and argument representation). Moreover, definition
files are designed in such a way that they allow both very specific terms
(e.g., 'head-marking with both stem alternation and possessor agreement',
'experiencer transitive subject') or very general terms (e.g., 'head-marking',
'subject'). The choice of specific or general, which will be logged
in prose fields, depends on the language itself or the quality of its
description. Both distributed information and variable degrees of specificity
have the advantage that more precise questions can be answered while still
allowing research on broad traditional categories, defined now as specific
constellations of fine-grained choices. The database will also be largely
immune to any theoretical move in the definition of terms since in most
cases such moves can be captured by searching for fine-grained choices
in new constellations.
- Exemplar-based Method. Paradigms are often heterogenous: some
parts of a case paradigm may be highly fusional and polyexponential, other
parts more isolating and mono-exponential. Also languages often have competing
constructions in one and the same structural domain, e.g., analytic tense
constructions along with periphrastic ones or multiple relativization strategies.
While AUTOTYP allows recording all this variation, for typological surveys
it is often desirable to have one record per language only. In order to
chose, in each language, comparable data, we follow what we call the Exemplar-Based
Method: we chose one particular examplar of paradigms or structural domains,
and this exemplar is identfied following a standard algorithmic definition.
For example, the AUTOTYP exemplar definition of TAM (tense/aspect/mood)
markers is: “If any of the TAM markers differs from others in their morphological
behavior (here: exponence), pick TENSE; within TENSES, pick PAST (or whatever
is chiefly used for simple, independent, past time reference); if there
is none, pick FUTURE. If there is no TENSE, pick the closest ASPECT equivalent
of past tense (e.g. perfective aspect). If there is no ASPECT, pick that
MOOD, STATUS, or EVIDENTIALITY marker that is mostly used for past time reference
(e.g. realis status).”
- Modularity. In order to achieve maximal flexibility
in formulating queries, AUTOTYP databases distribute information over
a network of several separate, thematically defined files linked together
in a relational network via standardized language ID codes. Each module
can also function as a stand-alone database itself and can easily be linked
to other databases, existing or new. This gives AUTOTYP databases the potential
to grow in any possible direction without necessitating revisions of their
basic design structures. Central access files do not contain information
on their own but rather collect data from different modules. Any number of
fields from any number of databases can thus be combined in one or several
access files to allow for a large variety of possible query and browsing
- Connectivity. AUTOTYP database modules are linked together via numerical language ID codes that can be mapped on other codes, such as the Ethnologue ID code. Any database, new or existing, that follows this standard will be directly compatible with AUTOTYP databases; others will need adjustments. Existing databases which are not modular can be directly linked to AUTOTYP databases, although breaking them up into narrow modules will broaden query possibilities.
The Autotypology principles requires that databases differentiate between
data files, which contain records on specific issues by language,
and definition files, which contain records of notions and their definitions that prove to be necessary in the data files. The two file types allow for a dual use of the database in research: the data files allow quantitative typological inquiry into statistical correlations between structural, genealogical or geographical features, while the definition files produce contributions to qualitative typology since they contain all and only notions that are cross-linguistically relevant and viable.
© by Balthasar Bickel and Johanna Nichols 2000