*********************************************** Load CPDP toolbox files into R in a useful way: *********************************************** 1. [UNIX/MAC/LINUX only] If you need part of speech tags, insert them by first moving mgl_ps.sh and mgl_ps.pl into the same directory as the toolbox files and the lexicon and then (in the shell, i.e. on the Mac through Terminal.app): $ path/to/toolbox_files/mgl_ps.sh path/to/toolbox_files/lexicon.db path/to/clean_toolbox_files This creates two new subdirectories: 'path/to/toolbox_files/new' (containing the parsed and tagged files) and 'path/to/toolbox_files/log' (containing error logs) 2. [UNIX/MAC/LINUX only] If you need age codes, insert them by adding all IMDI files to the new directory and then in the shell: $ path/to/toolbox_files/new/codeAge.sh path/to/toolbox_files/new The files we will be ina new directory: path/to/toolbox_files/new/conv 3. Set up R: > source('http://www.spw.uzh.ch/autotyp/tb.r') # getting the toolbox file reader function. > setwd('path/to/toolbox_files/new/conv') # or use cmd-D; we need to tell R where the files are. 3.1 Load a single file into R: > my_file = read.tb('my_toolbox_file.txt', format.desc = list(ref='id', EUDICOp = 'single', mph = 'morpheme', mgl = 'morpheme', lg = 'morpheme', gw = 'word', tx = 'single', eng = 'single', nep = 'single')) # Puma needs 'engl='single'! Errors in glosses etc are logged in: > attr(my_file, 'log') The function is flexible. For example, you can also read gloss and morpheme strings as words so that each row is a gloss string or morpheme string instead of an unanalyzed word -- something that is useful for example for exploring word forms [UNIX/MAC/LINUX only]: > my_file = read.tb(pipe("cat my_toolbox_file.txt | sed -e '/\\mgl/s/- */-/g' -e '/\\mgl/s/ *-/-/g' -e '/\\mph/s/- */-/g' -e '/\\mph/s/ *-/-/g'"), format.desc = list(ref='id', EUDICOp = 'single', lg='morpheme', gw='word', mph= 'word', mgl = 'word', tx='single', eng='single', nep='single')) # here we first remove spaces (' *') between stems and affixes so that read.tb can recognize stem-affix sequences as words between spaces. But be careful in the analysis, adding lg='morpheme', as is done here, let's you access words that have affixes, or stems, from a certain languages (which is great), but it also means that the 'word' entries will be replicated as many times as they have morphemes. "!duplicated()", e.g. my_corpus[!duplicated(my_corpus$mgl),]" is your friend. NOTE: the same basic format with the embedded pipe() command is also good for reading gloss and morpheme strings as words within whole sentences (i.e. so that each row is one sentence of gloss strings or morpheme strings instead of unanalyzed words), i.e. as mph='single', mgl='single' -- which is useful for example for exploring repeating patterns and constructions in syntax. (Alternatively, you can load the data in the standard way and then perform complex multi-morpheme searches, as explained below.) 3.2 Load multiple files at once: > files=dir() # or, for loading specific files only: files = grep('regex', dir(), value=T) > my_corpus = do.call(rbind, lapply(files, function(x) read.tb(paste(x), format.desc = list(ref='id', EUDICOp = 'single', mph = 'morpheme', mgl = 'morpheme', lg = 'morpheme', gw = 'word', tx = 'single', eng = 'single', nep = 'single')))) Or, again by treating gloss and morpheme strings as single words [UNIX/MAC/LINUX only] : > my_corpus = do.call(rbind, lapply(files, function(x) read.tb(pipe(paste("cat ",x," | sed -e '/\\mgl/s/- */-/g' -e '/\\mgl/s/ *-/-/g' -e '/\\mph/s/- */-/g' -e '/\\mph/s/ *-/-/g'", sep="")), format.desc = list(ref='id', EUDICOp = 'single', lg='morpheme', gw='word', mph= 'word', mgl = 'word', tx='single', eng='single', nep='single'))))) NOTE: for collecting all error logs, the lapply() technique doesn't work (because it limits the output to a list of the same length as the number of files; see the documentation of lapply). Alternative: > files <- dir() > for (i in 1:length(files)) { assign(files[i], read.tb(paste(files[i]), format.desc = list(ref='id', EUDICOp = 'single', mph = 'morpheme', mgl = 'morpheme', lg = 'morpheme', gw = 'word', tx = 'single', eng = 'single', nep = 'single'))) } # This may take a long time; count on about 20 minutes for 100 files...! > logs <- lapply(files, function(x) attr(get(x), 'log')) > logs # now prints all errors in all files, with exact information on the location and kind of each error. HINT: One often needs to count the number of words in the entire corpus (where corpus means 'a set of files in one directory'). If that's all that one needs, it doesn't make sense to read the entire corpus. Use "wc" on the toolbox files in the shell for this, or from within R, the following convenience function: > source('http://www.spw.uzh.ch/autotyp/word.counter.R') > setwd('path/to/toolbox_files') > countwords() # returns the number of g-words; countwords(type='g') does the same. > countwords(type='p') # returns the number of p-words (Note that reading the corpus just for counting words is actually a bit tricky: you would have to set gw='word' or tx='word' and avoid all tiers of type 'morpheme' because such tiers would lead to duplicates, which are not easy to rid off again.) ********************************************* Perform searches in R and display found data: ********************************************* The following examples assume the following sample corpus: > test = read.tb('CLLDCh4R13S04.txt', format.desc = list(ref='id', EUDICOp = 'single', agegroup = 'single', age = 'single', eng= 'single', nep='single', lg='morpheme', mph='morpheme', mgl='morpheme', gw='word', tx='word')) For convenient data extraction on UNIX/MAC/LINUX, also do > source('http://www.spw.uzh.ch/autotyp/extract.r') # for displaying found data, using Jedit X; you can change the editor by changing the code of extract.r. Note: for extract() to work you need to do have GNU sed installed (gsed, available through http://www.macports.org, possibly made easier by http://porticus.alittledrop.com) and you need to be in the right directories in both R *and* the shell, i.e. > setwd('path/to/toolbox_files/new/conv') # *and*: $: cd path/to/toolbox_files/new/conv For help with regular expressions, see sites like http://www.ida.liu.se/~lensa/nikolaj/egrep_for_linguists.html http://www.regular-expressions.info/ or our own CPDP-targeted page at http://www.uni-leipzig.de/~bickel/grep_help.txt Note: all complex searches below can of course also be done through sequences of simple searches: search A, store the result as x (e.g. x=test[grep(' A ', test$mph),] and then search within x (e.g. x[grep('B',x$mgl),]). ################################################################### We have now a convenience function: > source('http://www.spw.uzh.ch/autotyp/search.tb.R') 1. morphological searches: 1.1 single item defined by content in one tier: > search.tb(what=list('ERG'), tiers=list('mgl'), corpus=test) 1.2 single item simultaneously defined by content in two tiers: > search.tb(what=list(c('ERG', 'ŋa')), tiers=list('mgl', 'mph'), corpus=test) 2. syntactic searches: 2.1. two items co-occuring in one clause, defined by content in one tier: > search.tb(what=list('ERG', '3P'), tiers=list('mgl'), test) 2.2. two items co-occuring in one clause, defined simultaneosly by content in two tiers: > search.tb(what=list(c('ERG','-ŋa'), c('3P', '-u')), tiers=list('mgl', 'mph'), test) (Current limitations: no more than 2 tiers; for syntactic searches: no more than 4 items within a clause) ################################################################### Doing it all by hand: 1. Search for a single morpheme 1.1 exact search, single tier (e.g. form or gloss) > test[test$mph=='-ko',] > extract(test[test$mph=='-ko','ref']) # look at tokens in context > length(test[test$mph=='-ko','ref']) # count tokens 1.2 exact search, two tiers (e.g. form and gloss) > test[test$mph=='-ko' & test$mgl=='-NMLZ.gm',] > extract(test[test$mph=='-ko' & test$mgl=='-NMLZ.gm','ref']) > length(test[test$mph=='-ko' & test$mgl=='-NMLZ.gm','ref']) 1.3 regex search, single tier (e.g. form or gloss) > test[grep('-ko',test$mph),] > extract(test[grep('-ko',test$mph),'ref']) > length(test[grep('-ko',test$mph),'ref']) 1.4 regex search, two tiers (e.g. form and gloss) > test[intersect(grep('-ko',test$mph), grep('NMLZ', test$mgl)),] > extract(test[intersect(grep('-ko',test$mph), grep('NMLZ', test$mgl)),'ref']) > length(test[intersect(grep('-ko',test$mph), grep('NMLZ', test$mgl)),'ref']) 2. Search for multiple morphemes occurring together within clauses, i.e. 'ref' units, regardless of order: 2.1 regex search, single tier (e.g. form or gloss) > unlist(tapply(test$mgl, list(test$ref), function(x) ifelse(grep('ERG', x) & grep('\\.vt',x), paste(x, collapse=" "), ""))) > extract(names(unlist(tapply(test$mgl, list(test$ref), function(x) ifelse(grep('ERG', x) & grep('\\.vt',x), paste(x, collapse=" "), ""))))) # look at tokens in context > length(unlist(tapply(test$mgl, list(test$ref), function(x) ifelse(grep('ERG', x) & grep('\\.vt',x), paste(x, collapse=" "), "")))) # count tokens of matching clauses 2.2 regex search, two tiers (e.g. form and gloss) > unlist(tapply(paste(test$mph,test$mgl,sep="#"), list(test$ref), function(x) ifelse(grep('ERG', x) & grep('-ce.*?#.*3nsP',x), paste(gsub('.*?#','',x), collapse=" "), ""))) # form-gloss pairs need to be entered as 'my_form_regex.*?#.*my_gloss_regex'! > extract(names(unlist(tapply(paste(test$mph,test$mgl,sep="#"), list(test$ref), function(x) ifelse(grep('ERG', x) & grep('-ce.*?#.*3nsP',x), paste(gsub('.*?#','',x), collapse=" "), ""))))) > length(unlist(tapply(paste(test$mph,test$mgl,sep="#"), list(test$ref), function(x) ifelse(grep('ERG', x) & grep('-ce.*?#.*3nsP',x), paste(gsub('.*?#','',x), collapse=" "), ""))))