Sequence Analysis in Demographic Research

This paper examines the salient features of sequence analysis in demographic research. The new approach allows a holistic perspective on life course analysis, and is based on a representation of lives as sequences of states. Some of the methods for analysing such data are sketched, from complex description to optimal matching to monothetic divisive algorithms. After a short illustration of a demographically relevant example, the needs in terms of data collection and the opportunities of applying the same approach to synthetic data are discussed.


Holistic Perspectives on Life Course Data Analysis:
The Strong vs. the Pragmatic View The study of life courses has in recent years focused increasingly on event histories.This is naturally connected to the way life events are seen as the ultimate explanandum by the life course research stream (Mayer and Tuma, 1990;Giele and Elder, 1998), as well as to the productiveness of this approach.
In fact, with the help of event history analysis it is possible to study the intersection and mutual interdependencies between parallel careers of individuals and between potentially interdependent individuals such as couples.
In addition, one can analyse the impact of variables situated at different levels of aggregation on individual behaviour (Blossfeld and Rohwer, 1995).Event history analysis has gradually become important in demographic research.For instance, a substantial part of the design of the recent Fertility and Family Survey Project co-ordinated by the United Nations Economic Commission for Europe was influenced by the wish to use event history modelling techniques in the analyses.
Though it has played a crucial role, event history analysis cannot address every interesting issue that arises in life course research 1 .Event history analysis, as its name suggests, focuses on the time to occurrence (or non-occurrence) of specific events in the life course (Yamaguchi, 1991).By looking at its (latent) dependent variable, the transition or hazard rate, event history analysis expresses well the idea of what we may call an 'event-oriented' approach in life course research.Nevertheless, by focusing on time-to-event, researchers may miss a general overview of life courses, thus failing to adopt a 'holistic' perspective that sees life courses as one meaningful conceptual unit.A holistic perspective calls for efforts to analyse life courses in their wholeness, instead of specific events or combinations of events, as dependent variables.Taking this as a starting point, I shall focus in this paper on the idea of coupling the widespread event-oriented perspective with a holistic perspective on life courses.I shall review and discuss an approach that may prove useful for this purpose; in fact, it has already been proved useful.
Different theoretical approaches share a common goal: to analyse the life courses of individuals starting from a certain point in time and, where applicable, up to a certain point in time.But why should one be interested in choosing, in addition to an event-oriented approach, a holistic perspective on life courses?I see at least two theoretically significant ways of justifying this interest, each with a distinct emphasis on the strength of underlying hypotheses on individual behaviour.For simplicity, I shall call them the "strong" and the 'pragmatic' views.
The 'strong' viewpoint, which is present for instance in some of the traditional literature in neo-classical economics, sees life courses as subject to accurate inter-temporal planning and hence as outcomes of utility maximisation (see, for instance, the review of Deaton and Muellbauer, 1980,Ch.12).The classical Modigliani's life-cycle hypothesis, Friedman's permanent income concept, and the ensuing stream of research are good examples.Such a view has also led to empirical study of long-term plans in life courses and their consistency, as well as to critiques of a dynamic programming view of lives from experimental economists (see the literature review of Camerer, 1995).In other words, according to the 'strong' view, life courses are mainly the outcomes of planning (in an uncertain world with possibly external stochastic shocks).A holistic perspective is thus hypothesised to be present in the behaviour of individuals themselves, directly embedded in the behavioural assumptions.
In psychology too, individual life courses are considered emerging from internalised timetables (Heckhausen, 1999).In the sociodemographic literature, the notion of strategy, or 'life-planning', has been emphasised.As Giddens says: "In a world of alternative lifestyle options, strategic life-planning becomes of special importance.(…) Life-planning is a means of preparing a course of future actions mobilised in terms of the self's biography" (1991, p. 85).Settersten and Mayer (1997) maintain that the older concepts of the 'life course' were holistic, with more or less explicit reference to biological structures.This is no longer true in the life course literature, which also started out with a strong emphasis on internalised life calendars and schedules.So, for a theoretical approach that assumes the individuals look holistically at their own lives, it is undoubtedly necessary to have tools that allow us to follow the same perspective, and to treat the life course as a conceptual unit.
According to the 'pragmatic' viewpoint, the life course as a conceptual unit is thought of as being a contingent result of sequence of events.Following this viewpoint, researchers focus principally on events when they wish to explain individual behaviour.This view (see, e.g., Rohwer, 1994 andBlossfeld andRohwer, 1995) is justified by its links to the philosophical notions of causality and time.Even if one takes such a position, a holistic perspective is useful as a way to describe and to summarise the past history of individuals.One can also play the role of an historian.In life course research, it is fundamental to study the timing of events, their sequencing, the duration of time spent in states, and the spacing between events (Settersten and Mayer, 1997).Comparative research across countries or regions or cohorts is one of the examples where the life courses as a whole conceptual unit might provide particular insights.
If we accept that studying life courses in their wholeness is worthwhile and important, it is then necessary to complement the event-oriented techniques with those that can analyse them as a unit.In this paper, I deal with the sequence analysis approach to this topic, introduced to the social sciences by Abbott.In Section 2, I give a brief introduction to the sequence representation of life courses.Then, I review some of the approaches to analysing life courses represented as sequences.In Section 4, I briefly describe a demographic application of this approach.The issue of data collection is dealt with in section 5. Section 6 discusses the use of sequence analysis as a tool for analysing synthetic biographies.Finally, I consider some of the open questions and opportunities.In the appendix the reader will find a concise reference to existing software on sequence analysis.
Life Courses in a "Word" : The Sequence Representation Different approaches 2 have been proposed in the literature for studying life courses as whole conceptual units from a quantitative point of view.Nowadays, at least in sociology, there is an increasing agreement about focusing on the set of techniques known as sequence analysis (for a review, see Abbott and Tsay, 2000;Abbott, 1995).The basic idea is to represent each life course or trajectory in the life course as a 'word' or, to be precise, as a string of characters (also numerical).This representation is thus identical to the one used to code DNA molecules (Waterman, 1995).By and large, one focuses on a time window with a precise beginning and endpoint (two specific ages).Then, the technical question arising is how to analyse such strings in a meaningful and easily interpretable way.Abbott (1995) makes a distinction between 'non-recurrent sequences' those where a character may not repeat at all, and 'recurrent sequences,' where repetition is possible, exactly as in molecular biology.As a simple example, think of two characters: A and B. These two characters can give rise to five nonrecurrent sequences: an empty sequence, A, B, AB and BA.There is an infinite number of possible sequences if the characters may be repeated.For our purposes, the distinction between recurrent and non-recurrent sequences can be extended to the concepts of 'sequences of events' and 'sequences of states' (Billari and Piccarreta, 2000).We shall make very brief reference to sequences of events and then discuss sequences of states in greater detail.
When representing a life course as a sequence of events, one normally assigns a letter (or a number) to an event, and the ordering of events gives the ordering of letters in the word.Let us, for example, represent the union behaviour of an individual.He/she first forms a cohabiting union (event denoted by C), then gets married (M), then gets divorced (D) and remarries.A representation of this life course via a sequence of events would be: CMDM.The main advantages of such a representation are simplicity and compactness.The main problem with it is that one cannot explicitly take into account the time between events; in fact, this approach makes use of time implicitly in the sense that events are ordered.In particular, it is not possible to represent events that occur simultaneously, which can sometimes be the case when analysing demographic behaviour.The representation is, however, interesting when either the number of events is low or the complexity of life courses is limited, in the sense that they are concentrated in patterns that have exactly the same representation.
To overcome the limitations of the above representation and to take explicitly into account the duration between events, recent research efforts have focused on representing life courses as (recurrent) sequences of states.The origin of this approach can be traced back to computational biology, where this representation is used for macromolecules such as DNA (see Sankoff andKruskal, 1983 andWaterman, 1995).According to this approach, one explicitly views life course data as being framed into discrete time units.The corresponding assumption is that either there is a "natural" discrete time unit (e.g.month or year) in the data, or that some discretisation has been performed.
Let t refer to time, and consider a sequence of state variables such that y it represents the state s of individual i at time t, t = 1,...,h, with finite h and s S.
We shall assume that S is a state space consisting of a finite number k of discrete states, S = {s 1 , s 2 ,….s k }.A vector can then represent the life course of the i-th individual as: Y i =(y i1 , y i2 , ...., y ih ), i = 1,...,n.The sequence representation for the individual is then y y y i i ih 1 2 ... and the number of possible sequences is k h .
As a simple example, we shall consider three states: single (S), cohabiting (C), married (M), in a monthly time scale from 20 to 24 years and 12 months.The sequence representation of an individual life course could thus be:

SSSSSSCCCCCCCSSSSSSSSSSSSSSSSSSSSSSCCC SSSSSSSSSSSSMMMMMMMMM
This individual, starting as a single on his/her 20 th birthday, entered a cohabiting union at the age of 20 years and 6 months, ended the cohabitation at 21 years and 2 months, entered a new phase of cohabitation at 22 years and 12 months which ended at 23 years and 3 months, and got married at the age of 24 years and 3 months.The representation as a sequence of states can thus be easily reverted back into an event history representation with discrete time scale and vice-versa (we shall come back to this point later, when dealing with data collection).
The representation can be further generalised to a possibly small set of parallel words.It is common to consider life courses as being composed of several, say d, parallel domains, each characterised by a state space S d .This happens for instance when union history and reproductive history are studied jointly.We can then have a representation where each individual is represented by a vector of states at each point in time: ., y ijh ), i = 1,...,n, and j = 1,..., d.
Moreover, this idea can be used when looking at parallel careers of different individuals.One possibility could be to join the state spaces of each separate domain to obtain a joint space S = S 1 xS 2 x…xS d .We can represent individuals by a single letter for each point in time.The drawback is that the number of states, and thus the size of the alphabet required, then grows quickly.
In demographic analysis (see, e.g., Wunsch and Termote, 1978), a key distinction is made between renewable and non-renewable events.Renewable events can be converted into (different) non-renewable events by distinguishing them by their respective rank orders.Let us briefly discuss a particular type of recurrent sequence, using as an example non-renewable events concerning birth histories in marriages.Let us suppose that there are four sequences, each representing whether a birth of the 1 st , 2 nd , 3 rd, or 4 th order has occurred.Let us indicate as 0 the initial state of childlessness, 1 afterwards for each birth order, that is, S j = {0,1}, for j = 1,2,3,4.A simple example of a sequence is the following one: 00000000000001111111111111111111111111111111111111111111111 00000000000000000000000000000000001111111111111111111111111 00000000000000000000000000000000000000000000000000000000111 00000000000000000000000000000000000000000000000000000000000 Each of the above sequences represents a non-renewable (and possibly never occurring) event, which potentially ends in an absorbing state.Such sequences have the property: if y ijt =1, then y iju = 1 for u t.In such cases, the state at a certain point in time contains a summary of the past history of the process.This does not happen when events are renewable.This property allows us to study sequences in a specific fashion, as we shall see later in this paper.It is also interesting because it outlines the distinctive feature of sequences representing the demographic behaviour such as fertility.In this specific case, the time order between sequences can also be specified.And, there are restrictions on the possible sequences (e.g., one cannot have the first child after the second).As an example, consider the sequence 00000000000001111111111111111111111222222222222222222222333 In this case, the joint sequence has only four reachable states that might be characterised simply by the sum of the single sequences.

Reading Behind the Words: Analysing the Life Course as a Sequence of States
What analytical strategies can we follow when we have access to a set of sequences of states?Standard distribution-based methods will not work simply as a consequence of the complexity of the problem.With sequences on a monthly time scale and with a long enough time span (e.g.20 years) on a yearly time scale, the probability that two sample members can be represented by the same sequence becomes very low, tending towards zero.We therefore need techniques tailored to the problem.In addition, it is not a straightforward step to directly employ the methods used in computational biology, because the problems of the two fields are clearly of a different nature.Biological sequences are typically very long, and each of them has complex internal patterns with a vast number of states and state changes and a specific meaning (e.g., see Myers, 1995).In the social sciences, it is normal to have a large number of sequences of relatively short length (compared to biological sequences), and we are hardly interested in each individual sequence.The aim is rather to discover regularities in behaviour of a group of individuals.Here, I shall briefly review some of the approaches proposed in the social science literature.

Description Based on Features of Individual Sequences
While the description of individual life courses may be effective, especially with graphical tools (see, e.g., the descriptions provided by BioBrowser, Statistics Canada, 1999 3 and the monograph by Wehner, 1999), it is difficult to represent more than a handful of life courses.So, we need some other tools.Following Rohwer and Trappe (1999), we may distinguish two different ways of describing sequences of states, one 'cross-sectional' and the other 'longitudinal.'The crosssectional approach to sequence description takes as its starting point an origin on the time scale that is common to all sequences (say, a specific age like 15 for fertility analysis) 4 .The purpose is to synthesise, in various ways, the state distribution at that point in time.
The longitudinal approach is based on the idea that individuals cannot be simply classified according to their characteristics observed at only one point in time.So, observation time and past life history should be considered as being meaningful when describing sequence data.The purpose is then to synthesise the past history of individuals up to a certain point in time.The longitudinal approach can also be seen in a dynamic way: the synthesis we construct evolves across time.
While I cannot give here a full account of the different paths that can be followed (see Rohwer and Pötter, 2000), I wish to mention here an interesting idea commonly used with biosequences, namely the search for meaningful patterns.To give an example, let us go back to the single-cohabitation-marriage sequence illustrated above.It would be interesting, for instance, to know how often first marriages are preceded by unmarried cohabitation.We can indicate this as a pattern in a sequence: S..SC..CM, where '...' means permanence in a given state.Another example of the idea would be the pattern: S*C*M, where * stands for an arbitrary (and possibly empty) sequence of states.The occurrence of patterns in different sequences or within individual sequences might provide insights for the analyses we are interested in.Obviously, the search for patterns can also be seen from a dynamic perspective, namely the presence of a pattern as a function of time.
Computer graphics, optimally with colour capability, are particularly useful in the description of sequences of states.We refer the reader to Rohwer and Trappe (1999), Rohwer and Pötter (2000) as well as Billari and Piccarreta (2000) for a comprehensive discussion of this issue.

Optimal Matching Analysis
The 'optimal matching analysis' is based on the notion of similarity or dissimilarity between pairs of sequences.This method has been used for the alignment of biosequences.The initial question is the following: what would it mean to say that two sequences (life courses) are more similar than two other sequences?This is, of course, a complicated question, one that cannot be answered easily.We may well come to the conclusion that the complexity of whole life courses does not allow for comparison in terms of a single (onedimensional) metric.We shall, however, discuss optimal matching as one possible way of analysing sequences.
The basic idea behind optimal matching is to measure the dissimilarity of two sequences by considering how much effort is required to transform one sequence into the other one.Transforming sequences entails three basic operations in its most elementary method: 1) insertion: a state is inserted into the sequence; 2) deletion: a state is deleted from the sequence; 3) substitution: a state is substituted by another one.
To each elementary operation a specific cost can be assigned, and the cost of applying a series of elementary operations can be computed as the sum of the costs of single operations.The distance between two sequences can thus be defined as the minimum cost of transforming one sequence into the other one.For example, if insertions and deletions cost one unit and substitutions two units, the cost of transforming the sequence SSCCMMM into the sequence SCCMM is 2 units.Specific dynamic programming algorithms assure that the minimum cost is effectively sought out (Sankoff and Kruskal, 1983;Waterman, 1995).
The computed distance thus takes into account the entire sequence and not just present states.As a result, one obtains a distance matrix.This can be employed as an input for any kind of analysis requiring proximity data (e.g., clustering and multidimensional scaling).As a result of a series of seminal papers by Abbott, most of the sociological literature makes use of this method.Chan (1999) gives a compact and useful review of the application of optimal matching analysis in life course research.Wu (2000) gives a critical view on this approach.
Unfortunately, this approach also has some drawbacks.First, we must take each sequence as a whole and view life courses as dynamically evolving through time.One possible way to cope with this problem is to do a series of sequence analysis for each time period in the observational window.This would then result in a sequence of distance matrices.The second drawback is that it can be difficult to understand which variables in the definition of specific clusters are more relevant than the others.As Halpin and Chan (1998) state, "while the clusters are very easy to characterise in a general way, it is impossible to characterise them formally and exhaustively, that is, to define rules which will replicate the clusters exactly or close to exactly."

Clustering Binary Sequences
A proposal by Billari and Piccarreta (2001) overcomes this last-mentioned problem of the optimal matching method by explicitly building meaningful groups and by using algorithms for the clustering of binary variables.The algorithm applies to a series of parallel sequences that can be represented by binary variables (such as in the example with births discussed in an earlier Section).For a meaningful interpretation of the algorithm it is necessary that the events be non-renewable.For example, a value of 1 at a certain point in time must imply a 1 at all the following points in time.Billari and Piccarreta's proposal is to use a monothetic divisive clustering algorithm.The algorithm is hierarchically divisive, in the sense that the entire sample is first divided into two groups, and each group is further split into two subgroups.The procedure is iterated until each individual belongs to his/her own group.The algorithm is also monothetic: each group is divided into two subgroups according to the values of a single binary variable.To perform the splitting, it is also necessary to select a single relevant variable in such a way that the two subgroups are characterised by the maximum homogeneity within and the maximum heterogeneity between.Heterogeneity can be measured using either Gini Index or any entropy measure.The splitting procedure can also be represented by means of a tree diagram.
The main advantage of this algorithm is that it leads to easily interpretable clusters: the groups obtained are in fact perfectly characterised by the presence (or absence) of certain attributes (those implied by the splitting variables).Another interesting feature of the algorithm lies in the fact that enables us to identify the most relevant variables in the clustering process and to rank these variables according to their importance in the clustering process.One will in fact attribute a higher discriminating power to variables that induce the first splits.A major limitation of this algorithm, however, is that it can be applied only when sequences are initiated by non-renewable events.We shall come back to this question later, in the context of an example.

Multiple Correspondence Analysis of Sequences
van der Heijden (1987, Ch. 8) illustrates and advocates the use of multiple correspondence analysis in the study of sequences 5 .This technique has become widespread in the analysis of qualitative data in the social sciences.The type of input data is similar to that which we discussed for binary sequences.
In the context of life course research, multiple correspondence analysis is useful for synthesizing the cross-sectional situation at each point in time, as well as for analysing the differences between individuals and identifying those individuals who are particularly "distant" from the mean.Graphical inspection is fundamental to this method.The applications that have been carried out up to now (which have generally focused on diaries and time-budgets) are substantially cross-sectional, and sometimes the time points need to be aggregated.Nevertheless, this technique can become particularly useful when sequences are generated by non-renewable events, since cross-sectional situations themselves depend on the past.

A Demographic Application
In the social sciences, most applications of sequence analysis have focused on the topic of working histories.There is still a lack of applications related to demographic trajectories such as union histories, childbearing, and residential mobility.We shall illustrate here one example.Billari and Piccarreta (2001) applied the monothetic divisive algorithm illustrated in the last section to study the transition to adulthood, using data from the Italian Fertility and Family Survey.The domains considered in their paper are education, the labour market, living arrangement, behaviour regarding sexual intercourse, union formation, and parenthood.Time is measured in years, and each variable is an indicator that represents whether one has experienced the marker event for each domain in a given year between the ages of 20 and 35 (d = 6 and h = 16, using the notation introduced in Section 2).The six variables that describe the state occupied in each domain at time t are: • EDUt = having finished education by the t-th year of age • JOBt = having entered the labour market by the t-th year of age • LEAt = having left the parental home by the t-th year of age • SEXt = having had sexual intercourse by the t-th year of age • UNIt = having entered a union by the t-th year of age • CHIt = having had a child by the t-th year of age.
The property of sequences initiated by non-renewable events thus holds.Applying the divisive algorithm yields 8 groups (see Figure 1 for a tree representation of the splitting).
Our interest here is in seeing which variables are intervening at each stage of the split in order to be able to characterise the groups.At the first step, the splitting variable is UNI25 (having had a union by the age of 25).Hence, it seems that entering a union at a young age (relative to the local standards) is the most important indicator of differences between young Italians.This is consistent with certain views expressed in the literature that emphasised the role of unions in Southern European countries.Then, this 'early union' group is split according to JOB28.In the case of late (or possibly no) labour force participation, this corresponds to a traditional female pattern (in fact, this group with early union and late or no job has the highest share of women among all the groups).The third step separates out those who have not entered a union before age 25 according to UNI31; again, the central role of marriage is being underlined here.The fourth step splits those entering an early union and having an early job according to EDU26, then the group with a comparatively early end of schooling is split according to CHI23.This leads to the formation of an 'early-on-everything' group and of two other groups with a comparatively late end of schooling and parenthood.At the sixth step, EDU23 splits the group of people entering a union between ages 25 and 31.The last step divides people not having entered a union at age 31 according to SEX29; this leads to the establishment of a group of people who have had no sexual relation before their thirtieth birthday.
Although it is not easy to interpret every single split, some features appear to be constant.Among young Italians, union formation plays a crucial role in classifying life courses in the transition to adulthood.Leaving the parental home is not important, which must certainly be connected with the high degree of synchronisation of this event with union formation.The transition to parenthood plays only a marginal role.Sexual behaviour matters only for people outside unions.As a consequence of having used the monothetic divisive algorithm, group membership is perfectly defined for specific life courses, and it becomes feasible to define groups of individuals according to the timing and sequencing of events in their transition to adulthood.
Billari and Piccarreta analysed also the basic demographic determinants (gender and cohort) of group membership.Group 1, with early union and with late (or no) labour force participation, largely corresponds to traditional absence of women from the labour force.There are only very few men in this group, and younger cohorts are underrepresented.Group 3 represents an average of sorts, without cohort and gender particularities, and it represents more than a fifth of Empirical Research and Applications -Francesco C. Billari \ the sample.Group 4 exhibits a faster transition to adulthood, and it is mainly composed of women; however, no cohort pattern is evident.Group 6, with unions between ages 25 and 31, has comparatively more men but no clear cohort dynamics.Group 8 has the highest share of "young" males; it comprises those who seem to be in line with the standard view on patterns of transitions to adulthood for more recent cohorts, which means that they have had sexual relationships but not within unions.
The main weakness of this approach is that it does not explicitly allow splitting by using the sequencing of events that belong to parallel careers.This is handled only in an indirect way, by focusing on the best possible split according to age.
A suggestion to handle simultaneously the timing, sequence and the number of events in a holistic perspective can be seen in Billari et al. (2000).

Data Collection: Not Just Event Histories
What data collection procedures should be used if one wishes to construct a sequence representation of life courses?The answer is the same as for any longitudinal analysis: we need data that allow us to follow individuals over time.
In this section, I briefly discuss this topic, and in the next section with 'synthetic data.' First, it is useful to note that, when the time scale for the available information is discrete, a sequence representation is equivalent to that of event histories (Rohwer and Pötter, 2000).I have already discussed this in the example seen in Section 2. Another example may be of some help.In a fictitious life course, there are four possible states denoted by: 0 (in school, not employed), 1 (out of school, not employed), 2 (in school, employed) and 3 (out of school, employed).An event history consists of only two events: an individual starts a job at month 5, and stops schooling at month 8.The monthly sequence can be derived from the event history as: 0000022233333.Since the representation of the life course as an event history is equivalent to a representation of life course as a sequence using the same time scale, any instrument that allows the construction of event histories can also be used to produce sequences of states.This means that retrospective surveys can be, and in fact have been, used also to produce sequence data.It is not surprising that the technique of data collection known as the "life history calendar" (Freeman et al., 1988;Axinn et al., 1999) is based on the idea of representing life courses in a fashion similar to the sequence of states.Such methods are embedded in the broader spectrum of the collection of "life history matrices" (Settersten and Mayer, 1997) in life course research.
With retrospective data, we have the usual problems of having to rely on the memory of interviewees and of missing data.The problem of missing data can have different consequences in sequence analysis from it has for event history analysis.In fact, if we wish to study a life course in its entirety, the information is incomplete even if only one of the events is wrongly dated.Possible strategies to deal with this problem include both the imputation techniques and defining an extra 'missing state.'In a retrospective survey, asking for the time order (sequencing) of events rather than a simple recording of events can serve as an advantage both in terms of quality check and performing analyses based on more reliable data.
The ideal source for sequence data is a population register, provided that it contains the information we are interested in.It has the advantage of providing the same data as a retrospective survey but without the problems of recall.In addition, a population register usually contains less missing information.The disadvantage is, however, twofold: population registers are rare, costly, and in some countries it is not even legal to keep registers; and, we can only construct sequences for trajectories that are officially recorded.
Nevertheless, less information is required for constructing a sequence representation than for constructing a full event history.The linkage of records from different censuses can provide a sequence that is sufficiently long and complex that specific techniques are required.For instance, if one links three census records with the present state (e.g., residential location) at each census and two recalled states for the pre-census period, one will have a sequence of 9 time points.Of course, the information about the intervals between the time points is lost in this case.
An additional and more diffused source of sequence data is provided by panel surveys.Such surveys normally collect the states occupied by individuals at various points in time.They do not necessarily provide full event histories, so discrete-time event history models have been used for the analysis of such data.Consequently, this type of source may prove to be rather powerful for the construction of sequences of states.

Synthetic Data and Sequence Analysis: The Case of Demographic Projections
Another type of data that might prove useful for sequence analyses consists of 'artificially' created life courses, known as 'synthetic biographies'. 6 Demographic projections based on microsimulation methods can actually produce such biographies, when several states (e.g.education, labour force status, living arrangement, family status, etc.) are taken into account 7 .
As Lutz (1997) states: "Computers can be also used to generate new virtual people.Such computer-generated individuals are only partial, inadequate images of real people.They may, nevertheless, be generated in a way that they carry some of the characteristics of real people that we consider decisive in determining their own behaviour and their impact on other people".
Once the results of population projections through microsimulation techniques are available, we can further use the techniques of sequence analysis to a) check the results of projections for their consistency and sensitivity to different hypotheses; and, b) present the results of projections from a longitudinal perspective.The latter can be done by identifying individuals with specific life courses and/or grouping individuals according to their life courses by using the clustering algorithms developed for sequences.

Final Remarks
The representation of life courses as sequences and related analytical methods described in this paper can serve as powerful tools for studying the demographic components of life courses, as it has been in the study of work careers.We can expect that in the near future demographers will make a greater use of sequence analysis in their research work.
A few questions regarding the use of sequence analysis in demography still remain to be addressed (Wu, 2000).First, demographers rarely use microsimulation for projection purposes, and therefore there has been no effort until now to produce and study synthetic biographies, not to speak of applying sequence analysis to these biographies.Second, further research is needed to examine how the time scale used for sequence representations can or cannot distort the results.The 'ideal' time scale unquestionably depends on the topic under investigation and on the availability of data.Demographic data are generally available only on yearly basis, and event histories are often available as monthly data; but considerations of confidentiality may place restrictions on releasing such data.This is one of the impediments that researchers interested in doing sequence analysis may face everywhere.Third, how can one sensibly deal with right censoring, which is almost always present when analysing life courses 8 ?Given that the length of sequences has itself an impact on the analyses, we need to clearly distinguish between censored sequences and short sequences.The problem then is how to perform a joint analysis using censored and non-censored sequences.Finally, it would be useful to connect the sequence analysis outlined in this paper to other more explanatory types of longitudinal analysis, as it has been done in genetics where, for example, sequences analysis and Markov models are respectively different tools for the analysis of outcomes and for investigating how such outcomes are generated (Waterman, 1995).
3. BioBrowser (ModGen Biography Browser) is a tool developed by Statistics Canada to supplement the ModGen language used for dynamic longitudinal microsimulation modelling.It allows the user to examine the life course of an actor and her/his close relatives graphically with different representations.Even if it is not based on discrete time, the representation produced can also be used in such framework.The software and the documentation can be downloaded at http://www.statcan.ca/english/spsd/model.htm.
4. This term does not necessarily correspond to the traditional notion of "crosssectional" in demography.
5. In fact, van der Heijden refers to such data as 'event history data'.
6.I borrow the term "synthetic biography" from Frans Willekens.
7. Macrosimulation methods for demographic projection, such as those based on cohort-component methods or more sophisticated multistate dynamics embedded in software like LIPRO (van Imhoff, 1994), provide pictures of the population at aggregate levels.They do not reflect the longitudinal evolution of (groups of) individuals nor do they produce individual synthetic biographies; hence, the sequence analysis cannot be used.Other methods that keep track of the evolution of individuals in a state space across time are also possible (e.g., more theoretically oriented agent-based models).
8. Some authors (e.g.Halpin and Chan, 1998) seem to argue that in principle, censoring should not be considered a problem when one is dealing with techniques that can handle sequences of different length.However, the conceptual difference between having sequences of different length because one's labour or union career is short, on the one hand, and because we do not know how long it is, on the other, cannot be disregarded.
Figure 1.Tree Representation of the Monothetic Divisive Algorithm Application