Start | Recordings | MP3 listing | Transcripts | Background | Pictures | Movies

 

THE CORPUS AND WORD ORDER
The Pear-Chaplin Basque Corpus

© Jon Aske

 

 

Excerpts from Aske 1997/to appear, Chapter 2

Characteristics of the corpus

Introduction

Because of the variable nature of word order in Basque, and because of the well known difficulties involved in introspecting about the discourse pragmatic factors affecting word order, I felt that in addition to introspection, data of a different type would be needed in order to reach generalizations about the different ordering possibilities available to Basque speakers, their relative frequencies, the contexts in which they are found, and their functions, both in general and for different styles, genres, and media (speech vs. writing).

 

Previous studies of Basque word order have relied for the most part on introspection, and those which have actually looked at data in context have been either impressionistic or based on limited written sources. No previous studies seem to have looked at data in context (discourse) systematically, and for this purpose I recorded samples of narrative speech from 46 speakers. These speakers can be divided into two, not exactly homogeneous, groups: one composed of children with native and near-native degrees of fluency and the other composed of highly fluent and educated adults (in college or with college degrees). In a more complete and less preliminary study it would be desirable to include a much wider sample of the population, but this will have to remain as a long term objective for the Basque linguistics community, given the amount of tedious work involved in analyzing recorded speech.

 

The topic of the sample narratives were controlled for, since the speakers were asked to recount a short silent film that they had just seen. The audience for each of these narratives consisted of a school- or work-mate of the narrator, who had not watched the film. Two films were used in this study. The first one was the Pear Film, a 7 minute silent film made specifically for the purpose of obtaining narratives for linguistic analysis as reported in Chafe, ed., 1980. The second film was one which I made myself from clippings from Charlie Chaplin’s silent movie Modern Times. The whole sequence for this movie lasted 12 minutes. The several hours of recorded tapes were transcribed and turned into a database of intonation units, all of which were coded for characteristics and properties which might be of relevance to their analysis for this particular study on word order.

 

Having different speakers recount the same objective story has the added advantage of controlling for the characters (referents) and the action of the story allowing for a large number of inter-speaker generalizations, only some of which were taken advantage of in this study. This methodology also allows making interesting generalizations about how different speakers communicate objectively identical situations and to see what the differences and similarities are. It also allows the potential for comparisons with other languages.

 

In addition to this corpus, which I will refer to from now on as the spoken corpus, I collected a smaller written Basque corpus, which I will refer to as the written corpus. This written corpus consisted of the first chapters of two recent Basque books, containing narrative as well as dialogue: Atxaga 1991, a novel, and Elexpuru 1994, a travel diary. In addition, occasionally I will refer in this study to a larger corpus of written Basque, one composed of the following three novels: Amuriza 1984, Atxaga 1988, and Lertxundi 1994. Concordance searches were made in this extended written corpus for certain uncommon constructions.

 

In the next two sections I will present some general statistics about the clausal composition of the spoken and written corpora. Next I will explore the general word order characteristics of statement clauses in the corpora from a statistical point of view, comparing the spoken and written corpora, as well as the different subcorpora among each other where applicable.

The units of the spoken corpus

The Spoken corpus was analyzed and divided into a total of 6,760 intonation units of all types. This corpus contains 3,409 finite clauses and 632 non-finite clauses, as we can see in Table 3 below.

 

Finite clause type

Pear

Chaplin

Both

% Both

Main statements, affirmative

689

1991

2680

78.6%

Main statements, negative

22

109

131

3.8%

Completive statements, affirmative

52

187

239

7.0%

Completive statements, negative

4

25

29

0.9%

Head adverbial clauses

30

102

132

3.9%

Tail adverbial clauses

12

53

65

1.9%

Post presentative predicates

10

29

39

1.1%

Relative clauses (finite)

18

48

66

1.9%

Other finite clauses

8

20

28

0.8%

TOTAL

845

2564

3409

100.0%

Table 3: Finite clauses in the Spoken corpus.

 

As we can see, the majority of these clauses are affirmative statements in main clauses (declarative assertions). Negative statements constitute a much smaller number. Other finite clauses include pre-clausal adverbial clauses, post-clausal adverbial clauses, post-presentative predicates, and relative clauses. Their relative numbers can also be seen in this table.

 

In Table 4 we can see the statistics for non-finite clauses, most of which are clause-internal complements and secondary predicates. Head and tail adverbials are the next most numerous group. Another type of head clauses consists of what I have called ‘head statements’, a peculiar Basque construction expressing a ‘preparatory action’ preceding a second, main one (‘Complements’ are subjects or objects or complements of speech verbs, for instance. ‘Predicates’ are all other complements and ‘secondary predicates’.)

 

Non-finite clause type

Pear

Chaplin

Both

% Both

Predicate/complement clauses

93

279

372

59.6%

Statements

1

6

7

1.1%

Head adverbial clauses

14

32

46

7.4%

Tail adverbial clauses

49

89

138

22.1%

Head statements

7

29

36

5.8%

Relative clause

2

5

7

1.1%

Other

4

14

18

2.9%

TOTAL

170

454

624

100.0%

Table 4: Non-finite clauses in the Spoken corpus.

Constituent orders in main clauses in the spoken corpus

All the clauses in the spoken and written corpora were coded for the order of constituents and a number of other characteristics, both global ones and characteristics of their component parts. Word order coding was done in terms of a number of categories. The non-clausal categories used are: verb (V), ergative ‘subject’ (A), absolutive ‘subject’ (S), absolutive object (O), dative object (I), and other constituent (X). Clausal categories are primarily finite completive clause, Xla, with the subordinator -la, finite adverbial clauses, such as temporal ones, Xnean, with the subordinator -nean, causal ones, Xlako, and a variety of non-finite complement clauses: Xten, Xtea, Xtzeko, etc.

 

From this rich coding, about 140 different orderings were found among affirmative statements in the spoken corpus, many of them obviously small variants of each other and most of then with very low frequency. In Table 7 we can see the types and frequencies of the 28 most common orders in three types of clauses: main, affirmative statements, dependent affirmative statements (completive clauses), and non-finite clauses. In addition, and for purposes of comparison, in the last column I have added the statistics for those same orderings for affirmative statements in the written corpus. The 28 orders account for about 83.4% of all the clauses in the first group, and about 90% of the second and third groups, and about ¾ of the written statements.

 

 

 

Main Aff.

Statements

Completive

Affirmative
Statemts.

Non-finite

clauses

Main Affirm.

Statements
Written

1

V

305

11.4%

20

8.4%

210

33.7%

14

2.2%

2

OV

137

5.1%

13

5.4%

151

24.2%

55

8.6%

3

VO

233

8.7%

9

3.8%

15

2.4%

11

1.7%

4

SV

146

5.4%

21

8.8%

5

0.8%

30

4.7%

5

VS

159

5.9%

9

3.8%

0

3

0.5%

6

VX

275

10.3%

8

3.3%

20

3.2%

9

1.4%

7

XV

234

8.7%

61

25.5%

99

15.9%

122

19.2%

8

AV

47

1.8%

12

5.0%

9

1.4%

5

0.8%

9

VA

17

0.6%

1

0.4%

0

1

0.2%

10

AVO

46

1.7%

6

2.5%

0

11

1.7%

11

AOV

14

0.5%

2

0.8%

4

0.6%

12

1.9%

12

VAO

9

0.3%

0

0

0

13

VOA

1

0.0%

0

0

0

14

OVX

30

1.1%

0

7

1.1%

33

5.2%

15

VOX

29

1.1%

0

5

0.8%

0

16

VXO

15

0.5%

1

0.4%

0

0

17

XVO

21

0.8%

0

2

0.3%

14

2.2%

18

SVX

88

3.3%

8

3.3%

0

21

3.3%

19

SXV

58

2.2%

11

4.6%

3

0.5%

44

6.9%

20

VSX

41

1.5%

1

0.4%

2

0.3%

1

0.2%

21

XVS

39

1.5%

24

10.0%

0

34

5.3%

22

VXla

78

2.9%

1

0.4%

7

1.1%

6

0.9%

23

XlaV

1

0.0%

0

1

0.2%

7

1.1%

24

VXtzen

90

3.4%

2

0.8%

1

0.2%

0

25

XtzenV

27

1.0%

3

1.3%

2

0.3%

7

1.1%

26

SVXtzen

19

0.7%

3

1.3%

0

1

0.2%

27

V:

52

1.9%

0

5

0.8%

0

28

XVX

23

0.8%

1

0.4%

4

0.6%

38

6.0%

 

Total

 

83.4%

 

90.8%

 

88.5%

 

75.3%

 

Others

446

16.6%

22

9.2%

72

11.5%

157

24.7%

 

ALL

2680

100.0%

239

100.0%

624

100.0%

636

100.0%

Table 7: The 28 most common word orders in the spoken corpus, for main affirmative statements, dependent affirmative statements (completive clauses), and all affirmative non-finite clauses; and for main affirmative statements in the written corpus.

 

As we can see in Table 7, the vast majority of clauses consist of only a verb or a verb plus one additional complement. Thus, for instance, 11.4% of main affirmative clauses consist of just a verb, 13.8% consist of a verb and a direct object, 11.3% consist of a verb and an intransitive subject, and 19% consist of a verb and one other complement (not S or O). These clauses constitute 55.5% of all the affirmative statements. Also, among the many other observations we can make, we see that although clauses with more than two complements after the verb (in the same intonation unit) do exist, they do not make it into this list.

 

Many of the orderings in Table 7 are arranged in pairs, triplets, or even quadruplets, to allow us to see the relative frequency of preverbal vs. postverbal placement of particular types of complements in different combinations. Concentrating first on the spoken corpus, we can see in lines 2 and 3, for example, that finite VO clauses are more common than OV ones, but that the reverse is true for non-finite clauses (see also lines 10-17 for the order of the O). We can also see that VS order is slightly more common than SV order in main clauses, but much less common in non-main clauses (whether finite or non-finite). In other words, subject inversion is found almost exclusively in main clauses (see also lines 24-25). We also see in 8-9 and 10-11 that ergative subjects are much less likely to invert than absolutive (intransitive) ones.

 

We can also see that ordering patterns, or ‘preferences’, differ somewhat in the written and spoken corpora. Thus for instance, subject inversion is very rare in the written corpus. In the written corpus, XV is the single most common order by far, and, in general, verb-final clauses are significantly more numerous than in speech.

 

As for the order of complement clauses with respect to the verb in speech, we can see two types of complement in Table 7: (a) Xla: finite completive clauses with the -la complementizer; and (b) Xtzen clauses: non-finite imperfective clauses, the most common type of non-finite clause, which can act as objects, setting adverbials, or secondary predicates, much like -ing clauses in English. In speech, the completive complements are almost exclusively postverbal, for all clause types (cf. 22-23), although very few dependent clauses have such complements. The non-finite clauses are also much more common postverbally in main and finite dependent clauses (cf. 24-26). In writing, on the other hand, complement clauses are much more likely to be preverbal than in speech, particularly non-finite ones.

 

In Table 8 below we can see the frequencies with which the verb is clause-initial, clause-medial, clause-final (or alone) in finite and non-finite clauses in the spoken corpus. Post-verbal elements are found in a majority of main clauses: 60.2% overall (68% if we discount the 11.4% of clauses which are single verbs). Verb-final clauses, however, are a majority for all other clause types: completive clauses, with 56.9%, other finite dependent clauses, with 59.8%, and non-finite clauses, with 52.3%. If we add single-verb clauses, which naturally are also verb final, the numbers of verb final clauses increases to 65.3%, 68.1%, and 86.1% respectively (still only 39.8% of main clauses).

 

 

Main statements

Completives

Other finite

Non-finite

V...

1164

43.4%

36

15.1%

38

14.3%

67

10.8%

...V...

451

16.8%

47

19.7%

47

17.7%

19

3.1%

...V

760

28.4%

136

56.9%

159

59.8%

325

52.3%

V

305

11.4%

20

8.4%

22

8.3%

210

33.8%

Total

2680

100.0%

239

100.0%

266

100.0%

621

100.0%

Table 8: Single-verb, verb-initial, verb-final and verb-medial clauses affirmative in the spoken Basque corpus by clause type: sentence-body (finite), Complement (finite), Other dependent finite, and Non-finite

 

The larger percentage of verb-initial clauses in main (‘root’) clauses is probably due in part to the fact that existential-presentational and other thetic clauses are often (though not always) verb-initial in speech (but not in writing), and these clauses are almost exclusively main clauses (see chapters 3 and 4). Also, the relatively larger number of verb-final dependent clauses, especially non-finite clauses, is no doubt due in great part to the quite strong verb-final constraint that many of these clauses, the non-asserted ones, have, a constraint that is somewhat weakened in speech and more so in the speech of some speakers than others. Some finite dependent clauses, however, are not strictly verb-final, even in the standard written language, as we have already seen.

 

As we can surmise from Table 7, the written corpus has a greater proportion of verb-final utterances than spoken Basque. In Table 9 we can see the statistics for the different positions of the verb for different clause-types in the written corpus.

 

 

Affirmative

statements

Affirmative

Completives

Other

finite

Non-

finite

V...

52

8.0%

0

0

0

...V...

253

38.7%

14

21.5%

8

9.4%

0

...V

333

51.0%

49

75.4%

66

77.6%

286

80.1%

V

15

2.3%

2

3.1%

11

12.9%

71

19.9%

Total

653

100.0%

65

100.0%

85

100.0%

357

100.0%

Table 9: Single-verb, verb-final, verb-medial, and verb-initial clauses in the written corpus by clause type: main (finite) statements, completive statements, other dependent finite clauses, and non-finite clauses.

 

In main affirmative statements, the verb is clause-final more than half of the time, significantly more often than for any of the spoken corpus groups (28.5% on average, cf. Table 13 above). The percentage of verb final clauses is more than ¾ in affirmative completive clauses. Moving on to other finite dependent clauses, but excluding relative clauses, which are all verb-final, and embedded questions, we find that the vast majority of them are verb-final or single-verb. Finally, we can see that all non-finite clauses are verb-final or single-verb, for all types of clauses, whether affirmative or negative.

Additional comparisons between the spoken and written corpora

The general differences we have observed between the spoken and the written corpora in the direction of more verb-final utterances for the latter is most striking when we compare clauses made up of the verb and one other constituent, repeated here in Table 10 for convenience.

 

 

Order

Spoken corpus

Written corpus

1

SV

146

47.9%

30

90.9%

2

VS

159

52.1%

3

9.1%

3

*S*V*

354

55.8%

116

73.0%

4

*V*S*

280

44.2%

43

27.0%

4a

V*S*

232

 --

5

 --

5

*A*V*

202

82.4%

56

78.9%

6

*V*A*

43

17.6%

15

21.1%

7

OV

137

37.0%

58

82.9%

8

VO

233

63.0%

12

17.1%

9

XV

234

46.0%

122

93.1%

10

VX

275

54.0%

9

6.9%

11

AOV

14

23.3%

12

50.0%

12

AVO

46

76.7%

12

50.0%

13

SXV

58

39.7%

44

66.7%

14

SVX

88

60.3%

22

33.3%

15

*Xla*V*

1

1.0%

12

34.3%

16

*V*Xla*

103

99.0%

23

65.7%

Table 10: Percentages for different pairs of clausal orderings in main affirmative statements in the written and spoken corpora.

 

We can see in rows 1 and 2 that the first alternation, SV ~ VS, is evenly distributed in the spoken corpus, but extremely uneven in the written corpus in favor of SV. We will see that this is due in great part to the tendency in spoken Basque, but not in written Basque, to postpose subjects in presentative sentences.

 

This doesn’t mean that intransitive subject inversion is not common in writing. As we can see in rows 3 and 4, when there are other elements before the verb, be they topics or not, these subjects invert in writing as well, for about ¼ of all subjects. The percentage is higher, 44.2%, for the spoken corpus. Notice, however, that whereas 82.% of these clauses in the spoken corpus (232/280) are verb-initial, only 11.6% of those in the written corpus (5/43) are verb-initial (row 4a).

 

As we can see in rows 5-6, overt ergative arguments (A) are less likely to invert with the verb than are absolutive arguments in intransitive sentences (S). The difference is quite small for the written corpus, but significant for the spoken corpus. We will see later on that this has to do with the stronger correlation between A and topic than between S and topic.

 

As for the OV ~ VO alternation, Table 10 shows that whereas a majority of objects (in this particular configuration) are preverbal in writing, a majority are postverbal in speech. We will see later on that there is a tendency in speech to place objects which have given referents, i.e. which not ‘important’ enough to ‘deserve’ to be placed in rheme-initial position after the verb. For some speakers this extends even to objects with ‘new’ referents. In more ‘careful’ Basque, such as in writing, this only applies to cases in which the verb is noticeably more ‘important’ than the object, such as when it is contrastive, or when the whole assertion is emphatic.

 

The XV~VX alternation shows some of the same tendencies as the OV~VO one, but with a lesser degree of postposing. This seems to be due to the fact that the ideas that X’s represent (e.g. predicates and adverbials), are more likely to be the most ‘important’ element of the assertion¾ what I will call the focus¾ than are the referents of objects.

 

When a ‘subject’ (A or S) is added as a third element to the previous two alternations we find some differences in the degree to which postverbal complements are found. In rows 11-12 and 13-14 we can compare the AOV~AVO and the SXV~SVX alternations for these triplets (not the only possible ones). First of all we see that the overall number of tokens is much smaller, due to the fact that overt subjects are relatively rare. Furthermore, we see that O’s and X’s are more likely to be postverbal when there is a ‘subject’ than when there isn’t one. The reason for the increase, especially noticeable in the written corpus, is that in many of these clauses, the ‘subject’ is actually a rhematic element, that is, it is the focus and not the topic, and it occupies the only preverbal rhematic ‘slot’, which causes any and all other complements to be placed after the verb.

 

In the following chapters I will attempt to give a full account of these trends, correlations, and variation, based on a theory of information structure to be developed in Chapter 3 and expanded in subsequent chapters.

Differences among the subcorpora

A great many other interesting statistical facts in need of an explanation may be gleaned from the spoken and written corpus databases. In the previous sections we saw comparisons between the spoken and the written corpus. In this section and the one that follows will look at the differences¾ or variation¾ among the sub-texts and groups of sub-texts of which the spoken corpus database is composed.

 

It should be clear that the frequency statistics can be different for different subsets of the individual texts in each of the corpora. For example, in Table 11, we can see the statistics for the different positions of the verb for affirmative main clauses in the written corpus by subcorpus chapter. Here the most noticeable difference in word order between the two written samples is in the number of verb-initial clauses.

 

Sample

V...

...V...

...V

V

Total

Behi...

31

13.0%

86

36.1%

113

47.5%

8

3.4%

238

Kuba...

21

5.1%

167

40.2%

220

53.0%

7

1.7%

415

Total

51

7.9%

253

38.7%

333

51.0%

15

2.3%

653

Table 11: Single-verb, verb-final, verb-medial, and verb-initial affirmative main clauses in the written corpus (two sources combined) by sub-sample.

 

In general the differences between the written and the spoken corpora are of the same general type as those among the fluent groups of adults and the less fluent groups of children in the spoken corpus, in that the former in each case has a higher percentage of verb-final clauses, a smaller percentage of subject inversion, more rigidly verb-final dependent clauses, and so on. Later in this section we will look at these differences between speech and writing in more detail.

 

If we look at the two sets of stories that the spoken corpus is composed of, the Pear stories and the Chaplin stories, we find significant differences as to the position of the verb in affirmative statements, as we can see in Table 12.

 

Story

V...

...V...

...V

V

Total

Pear

336

48.8%

107

15.5%

144

20.9%

102

14.8%

689

100%

Chaplin

828

41.6%

344

17.3%

616

30.9%

203

10.2%

1991

100%

Total

1164

43.4%

451

16.8%

760

28.4%

305

11.4%

2680

100%

Table 12: Verb position for all affirmative, main clause statements in the spoken corpus according to the story used (across sentence types).

 

For some reason, there are some obvious differences between the two subsets of stories. Thus we can see, for instance, that the number of single-verb clauses is about 50% higher in the Pear stories subset. More interestingly, the percentage of verb-final clauses is 50% higher in the Chaplin set and, viceversa, the percentage of verb initial clauses is somewhat higher in the Pear set. Perhaps a higher percentage of presentatives and other similar clauses (see Chapter 3) in the Pear story accounts at least in part for this discrepancy.

 

These comparisons are important because they shows how significant the differences among seemingly comparable texts can be and urges us to be cautious about making hasty generalizations about the language from any one set of texts. Thus, although the differences that we saw between the spoken and the written corpora seemed significant, the reasons for these differences may not be so easy to ascertain. We may also conclude that classifying Basque as having a particular basic order cannot be done convincingly from a statistical perspective, since different clauses types differ as to their preferences, as do different samples and, in particular in this case, samples from different media (speech vs. writing).

Differences among the speakers and group of speakers

Another interesting¾ and perhaps more fruitful¾ comparison that we can make is the one among the different speakers, 45 in all, and between different groups of speakers, of which there are four major ones. Four different groups of speakers used in this study, two of which were composed of children, Hendaia and Ikasbide, and the other two, composed of adult speakers, Deustu and Lur. In each of the groups, half of the speakers related the Pear story and the other half related the Chaplin story.

 

As we can see in Table 13, there can be found interesting differences in word order frequencies among the four groups. The greatest differences found are those between the Ikasbide children’s group and the Deustu adult group.

 

 

V…

…V…

…V

V

Total

Hendaia

237

38.5%

95

15.4%

186

30.2%

97

15.8%

615

100%

Ikasbide

462

59.4%

118

15.2%

111

14.3%

87

11.2%

778

100%

Deustu

321

34.6%

167

18.0%

352

37.9%

88

9.5%

928

100%

Lur

144

40.1%

71

19.8%

111

30.9%

33

9.2%

359

100%

Total

1164

43.4%

451

16.8%

760

28.4%

305

11.4%

2680

100%

Table 13: Verb position for all affirmative, main clause statements in the spoken corpus by subcorpora

 

In the Ikasbide group, for example, postverbal elements are found in 74.6% of the (main affirmative) clauses (59.4 + 15.2), whereas in the Deustu group, this happens in only 52.6% (34.6 + 18). The differences between these two groups are greatest for verb-initial and verb-final clauses, since the Deustu group has more than twice as many verb-final clauses and the Ikasbide group has almost twice as many verb-initial clauses as the Deustu group.

 

Start | Recordings | MP3 listing | Transcripts | Background | Pictures | Movies