EMPIRICAL STUDY CLASSIC Utterance Boundary: A Chunking-Based Model of Early Naturalistic Word Segmentation

: Word segmentation is a crucial step in children’s vocabulary learning. While computational models of word segmentation can capture infants’ performance in small-scale artiﬁcial tasks, the examination of early word segmentation in naturalistic settings has been limited by the lack of measures that can relate models’ performance to developmental data. Here, we extended CLASSIC (Chunking Lexical and Sublexical Sequences in Children; Jones et al., 2021), a corpus-trained chunking model that can simulate several memory and phonological and vocabulary learning phenomena to allow it to perform word segmentation using utterance boundary information, and we have


Introduction
Word segmentation is a fundamental process in infant language development.Phonological word forms are not given a priori but must be extracted from continuous speech input.While several computational models have captured basic word segmentation phenomena displayed by infants in small-scale artificial tasks, assessing whether models can scale up to naturalistic inputs has been hampered by limited sets of measures against which to compare performance.We present a new word segmentation model which extends CLASSIC (Chunking Lexical and Sublexical Sequences in Children; Jones & Rowland, 2017;Jones et al., 2021;Jones, 2016;Jones, Justice, et al., 2020), a chunking model that uses naturalistic inputs to successfully simulate key developmental phenomena in memory and language.Our extended model, CLASSIC utterance boundary (CLASSIC-UB), performs unsupervised word segmentation using large-scale naturalistic inputs.Importantly, we have assessed our model against existing segmentation models using both standard evaluation metrics and novel developmental measures to provide a more comprehensive assessment of segmentation performance.
Chunking models successfully account for adult (e.g., Frank et al., 2010) and infant (e.g., French et al., 2011;Perruchet & Vinter, 1998) word segmentation in laboratory tasks by extracting and storing frequent input sequences (chunks) as candidate words that guide subsequent segmentation.This allows chunking models (e.g., Kurumada et al., 2013) to account for lexical effects in infant segmentation such as easier extraction of novel words when they are preceded by familiar words (e.g., Bortfeld et al., 2005).Lexical effects are not predicted by competing models that assume a dedicated mechanism that estimates the location of word boundaries in speech by tracking sublexical regularities, such as through forward and backward sound transitional probabilities (e.g., Cleeremans & McClelland, 1991;Saksida et al., 2016).Further, chunking also accounts for infants' sensitivity to sublexical regularities (e.g.,
Typically, computational investigations have used artificial language tasks to assess the plausibility of learning mechanisms involved in infant (e.g., French et al., 2011;Perruchet & Vinter, 1998) and adult word segmentation (e.g., Endress & Langus, 2017;Frank et al., 2010).Although modelers have also examined scale-up to naturalistic input (e.g., Daland & Pierrehumbert, 2011;Monaghan & Christiansen, 2010;Saksida et al., 2016), such investigations have suffered from one important limitation: The benchmark for models' segmentation accuracy has been the word boundaries present in adult vocabularies, but these word boundaries are unlikely to accurately reflect infants' and children's segmentation (e.g., Monaghan & Christiansen, 2010).In contrast, we have introduced new measures based on developmental data and specifically on the composition of children's early vocabularies.The key insight is that children's vocabularies should reflect early word segmentation processes: Word forms that are more easily discovered in the input should enter children's vocabulary earlier in development.We used these novel developmental measures alongside traditional evaluation measures to provide a much richer assessment of the developmental plausibility of word segmentation mechanisms.Specifically, we used this suite of measures to compare CLASSIC-UB to other models that have shown different strengths in modeling early naturalistic segmentation.

CLASSIC
CLASSIC uses a domain-general chunking mechanism (Gobet et al., 2001) to model linguistic knowledge acquisition via experience with the sequential structure of the language.It is not a model of auditory perception or production per se (as basic processes that transfer information to the learning mechanism are not modeled) but a learning model representing performance increases derived from perceptual learning and efficiency in production (Jones, Justice, et al., 2020).The accumulation of language experience is essentially represented by the chunking of adjacent items, gradually shifting the model's representations from sublexical to lexical and multiword units.A key assumption in CLASSIC is that children already know how to identify Language Learning 73:3, September 2023, pp.942-975 944 word boundaries.This has been implemented in CLASSIC because past simulations have investigated phenomena at an age where children are likely to have already learned how to segment speech into words.We can illustrate how CLASSIC works using a simplified example in which the model repeatedly processes the phonetically transcribed utterance [d, ae, d | ɪ, z | k, ʌ, m, ɪ, ŋ] 1 (i.e., dad is coming) where | demarcates word boundaries that, as we explained above, are given as input to the model.CLASSIC first chunks adjacent phonemes that do not cross a word boundary and forms biphone representations: [dae, aed | ɪz | kʌ, ʌm, mɪ, ɪŋ].Any learned chunks can subsequently be used to encode the input.For example, at the second iteration, the model would represent the utterance as [dae, d | ɪz | kʌ, mɪ, ŋ], that is, proceeding from left to right, it uses the longest available chunks to encode each demarcated word.This way of encoding preserves the input temporal structure and represents a proxy for the increased processing efficiency derived from acquired knowledge. 2The model then continues to join adjacent chunks; for example, the third iteration would result in the representation [daed | ɪz | kʌmɪ, ŋ], where CLASSIC has learned two of three words in the utterance.When two adjacent chunks are words themselves, CLASSIC crosses word boundaries and learns multiword sequences (i.e., daed|ɪz in the example); thus, at the fourth iteration, CLASSIC would encode the utterance as a twoword sequence followed by a word: [daed|ɪz, kʌmɪŋ].Finally, in a last iteration the model would represent the whole utterance as a single multiword chunk: [daed|ɪz|kʌmɪŋ].
CLASSIC accounts for the role of sublexical, lexical, and multiword sequences in language development.For example, in Jones's (2016) study, incremental exposure to naturalistic speech supported CLASSIC's building up of chunks at different grain sizes, capturing 85% of variance in nonword repetition performance-a task closely related to vocabulary learning (e.g., Hoff et al., 2008)-from six studies involving 2-to 6-year-old children.CLASSIC has also simulated vocabulary learning more directly (Jones et al., 2021).Similar to the way 2-3-year-old children learn to produce words, CLASSIC gradually learns longer, more infrequent words that have a smaller number of similar words in the language (i.e., lower neighborhood density) and higher internal predictability (i.e., higher average biphone probability or phonotactic probability).Jones et al. (2021) also showed that novel words entering children's productive vocabularies are more likely to share large phonological chunks with words that they already use, indicating a pivotal role for phonological knowledge in vocabulary learning.In sum, these studies have shown that sublexical knowledge can be used to learn and produce

945
Language Learning 73:3, September 2023, pp.942-975 pseudowords and real words (see Baayen et al., 2019;Chuang et al., 2021, for similar conclusions using linear discriminative learning).Finally, Jones, Justice, et al. (2020) showed that phonological knowledge plays an important role in learning multiword sequences.CLASSIC captured the faster increase in children's short-term memory for digit over word sequences likely because chunks that span multiple digits are learned more quickly from random combinations of digits occurring in naturalistic speech.This study also showed how knowledge of multiword sequences facilitates lexical processing (e.g., processing of the individual items five and six becomes more efficient when the two are presented within a familiar multiword sequence five-six).
In sum, CLASSIC is a chunking-based model that has captured important developmental phenomena in word learning but has not yet been applied to word segmentation.We showed how CLASSIC can be extended to perform word segmentation, thus making the model more developmentally plausible: Infants must of course discover word forms before they can learn novel words and integrate them into their existing vocabulary (Newman et al., 2016).

CLASSIC-UB
To extend CLASSIC to perform word segmentation, we retained CLASSIC's architecture but removed word boundary information from the model input (i.e., the model was not constrained to chunk items within demarcated words).We also added utterance boundary information using positional markers (ê) that signal utterance start or end.Transcribers of the input corpora used in this study coded such positional markers based on various syntactic (e.g., utterances are centered around a main clause) and prosodic cues (e.g., pauses, intonation patterns distinguishing declarative, interrogative, or other clauses).Only written transcriptions were available for most of the input, not the original speech recordings, so it was not possible to automatically assign positional markers based on, for example, changes in phonetic features.Positional markers have been used in previous computational work (e.g., Aslin et al., 1996;Christiansen et al., 1998;Saksida et al., 2016) as a proxy for the increased saliency that phonological units at utterance boundaries gain in child-directed speech (e.g., Fernald & Mazzie, 1991).This has been modeled via conjunctive use of utterance-boundary markers and phonological units to perform distributional learning (e.g., utterance-boundary + syllable constitutes a pair of units for which transitional probabilities can be obtained; Saksida et al., 2016).In a similar way, CLASSIC-UB treats utterance-boundary markers as additional units that can be used to form chunks (i.e., a chunk becomes longer when an utterance-boundary marker is attached to a phonological sequence).
We present a version of CLASSIC-UB that uses utterance-final markers and a version that uses both initial and final markers.Infants may privilege utterance-final words (e.g., Aslin et al., 1996;Christiansen et al., 1998) because these gain perceptual prominence from syllable lengthening (Wightman et al., 1992) and sentential accent in English (Cinque, 1993).However, some studies have suggested that infants may use both initial and final markers in segmentation (Seidl & Johnson, 2006, 2008).In fact, different cues could facilitate segmentation of utterance-initial words (e.g., exaggerated amplitude, duration, pitch, and formant structure; Cruttenden, 1986).Therefore, the presence of initial markers should provide additional facilitation over utterance-final cues.We are not aware of any computational studies assessing the relative contribution of initial and final boundaries, thus our comparing CLASSIC-UB with final markers to CLASSIC-UB with both initial and final markers could shed light on the variables that facilitate word segmentation at utterance edges.
Figure 1 illustrates how CLASSIC-UB segments input after the input has been transcribed using the CMU Pronouncing Dictionary (Lenzo, 2007), which contains over 134,000 phonetic transcriptions of English words and provides an automatic way to convert large orthographic input into phonetic form using alphabetic codes for phonemes rather than IPA (e.g., AE instead of ae).When encoding the utterance-final biphone AED in the first utterance, the model learns the chunk with an associated utterance-final marker (i.e., AEDê).If the chunk AED appears in later utterances, even in word-medial positions, the model will recognize that it can be used in word-final position assuming a word boundary at this location (see the third utterance dad is coming).This also shows how the following phone IH is marked as "can begin a word" based on the model flagging AED as ending the preceding word DAED (bolded chunk of Figure 1).The same logic applies to utterance-initial markers.In essence, the function of the ê markers within chunks is akin to "this chunk can appear at the [beginning/end] of a word." Like CLASSIC, CLASSIC-UB processes phonemic input.As such, it assumes that children already know phoneme categories in line with an early phonetic category learning approach (e.g., Werker, 2018) and previous computational studies in word segmentation (e.g., Batchelder, 2002;Daland & Pierrehumbert, 2011;Goldwater et al., 2009; but there are alternative approaches that we briefly refer to in the Discussion section).Knowledge of sound categories and co-occurrences of sounds might begin to develop at the same time or soon after infants start segmenting speech into words at around  (Lenzo, 2007).
As with CLASSIC, items that co-occur often will have more opportunities to be chunked together by CLASSIC-UB.This facilitates subsequent segmentation in two ways.First, when a word is frequent in the input, its sublexical components will have more opportunities to be chunked together, reaching a whole-word representation faster.This makes the model frequency sensitive, Supporting Information online for a detailed description of this model).Second, learning words that share phonological material with other words will be facilitated by the reuse of existing chunks (e.g., learning just can make the sequence ust available to subsequently learn crust).Other models, such as PUD-DLE, do not include this mechanism and rely on frequency information alone.
The number and size of chunks changes as more input is processed.CLASSIC-UB processes input incrementally (i.e., one utterance at a time), as do other segmentation models (e.g., French et al., 2011;Monaghan & Christiansen, 2010;Perruchet & Vinter, 1998).As Figure 1 shows, each utterance is encoded from left to right by using existing chunks present in the model lexicon.Consistent with previous chunking models (e.g., Batchelder, 2002;French et al., 2011;Perruchet & Vinter, 1998), preference is given to encoding larger chunks over shorter ones.For example, the chunk AED ê that contains a boundary marker is preferred over the shorter chunk AED that does not contain a boundary marker.At the same time, new/larger chunks are stored in the model lexicon by joining adjacent encoded items together, facilitating subsequent segmentation.This makes the learning process plausible because children's learning happens incrementally as a function of their accumulating knowledge of the language (e.g., Jones et al., 2021).
Crucially, selecting larger chunks over shorter ones means that chunks formed by sublexical sequences and utterance-boundary markers are dispreferred to words, thus avoiding oversegmentation.At the same time, the presence of utterance-boundary markers prevents the model from building large undersegmented chunks.Together, these two mechanisms favor segmentation at the (intermediate) word level.However, there is no explicit rule defining when the model should stop building chunks of increasing size.In fact, at later stages, the model stores multiword chunks, which is consistent with representation of multiword sequences from 11 months of age (e.g.,Jones, Cabiddu, & Avila-Varela, 2020;Skarabela et al., 2021).Notably, such longer chunks can include multiple boundary markers, which means the model can represent multiword sequences while also retaining knowledge of the individual words composing a sequence.For example, an utterance such as I'll do it later could be encoded using the two chunks ê I'll ê do ê it ê and later ê.In sum, CLASSIC-UB learns chunks including both phonological and utteranceboundary information.Chunks gradually increase in size, facilitating subsequent segmentation.

Evaluation of Naturalistic Speech Segmentation
Corpus-based evaluations of segmentation models usually compare models' output to segmented transcriptions of child-directed speech (e.g., Monaghan & Christiansen, 2010).Precision and recall are two widely used measures.Precision is the number of words segmented by a model divided by the total number of items segmented, including segmentation errors (i.e., how many of the items found are words).Recall is the number of words segmented by a model divided by the total number of words in the input (i.e., how many words present in the input are found).In these two measures, chunking models perform better than do models that segment speech randomly (e.g., Bernard et al., 2020;Monaghan & Christiansen, 2010), which is in line with results from computational studies capturing artificial language learning (e.g., French et al., 2011).For example, in Larsen et al.'s (2017) study, the chunking model PUDDLE showed the highest performance, reaching 82% for precision and 80% for recall.In contrast, another class of models that track sound transitional probabilities (see Appendix S1 in the Supporting Information online for a detailed description) perform better than the random baseline models (e.g., Bernard et al., 2020) but less well than chunking models (e.g., 43% for precision and 51% for recall in Larsen et al.'s, 2017, study).Although these measures capture how accurately models segment the input, they do not capture their developmental plausibility.The use of segmented input to evaluate model performance makes the implicit assumption that infants segment speech in an adult-like way.However, as discussed by Larsen et al. (2017), this assumption is likely to be wrong, given evidence that infants' initial protolexicons contain words and frequent phonotactically legal nonword sequences (e.g., Ngon et al., 2013).Addressing this problem is not straightforward because how infants segment speech in naturalistic settings is not known.Larsen et al.'s (2017) solution was to link model accuracy to word age of acquisition.For example, dog was understood by a higher proportion of children at 13 months of age than was deer, and this should be reflected by a more accurate segmentation of dog than deer (i.e., dog is correctly segmented on more occasions).Theoretically, the reasoning behind using word learning as a proxy for segmentation performance is that vocabulary knowledge (word-meaning mapping) is facilitated by word segmentation (e.g., Estes et al., 2007;Hay et al., 2011). For example, in Estes et al.'s (2007) study, infants were able to extract, store, and recognize word forms previously presented in fluent speech to successfully perform a label-object association task.In sum, words that are acquired early must also be accurately segmented at earlier ages.We also capitalized on the link between vocabulary knowledge and segmentation as suggested by Larsen et al. (2017), but instead of age of acquisition derived from parental reports, we used age of first production derived from child speech (Grimm et al., 2017).Looking at production rather than comprehension has drawbacks, but it also has important advantages.The words children produce are, of course, not a direct reflection of their segmentation abilities.Production involves additional variables related to recalling stored instances from the lexicon and to articulation, and, of course, what children spontaneously produce at the time of recording does not reflect the entirety of their comprehension vocabularies.Further, there are limitations inherent in estimating children's knowledge from a small number of relatively short samples of speech filtered through adult transcribers' potentially biased judgement (e.g., leading to the omission of nonlexical productions).Nevertheless, using production vocabularies has two key advantages.First, it dramatically increases the number of words examined: The British communicative development inventory (CDI; Alcock, 2020), a parent-report measure of age of acquisition, contains only 330 words, 4 lacking sufficient statistical sensitivity.Second, we found that the CDI word sample has a word frequency distribution shifted toward high-frequency words not reflecting the Zipfian input that infants hear, that is, many low frequency and few high-frequency word types (Hendrickson & Perfors, 2019). 5Using such a sample might bias results because transitional probability models might perform well only because the distribution considered is less skewed toward low frequency words (Kurumada et al., 2013).
We have additionally proposed a new measure examining whether a model can capture word-level characteristics of child vocabularies.Previous measures did not examine whether a model capitalized on sublexical/lexical regularities (similarly to how learning is evaluated in laboratory settings).Traditional measures have focused on finding a mechanism that minimizes segmentation errors, while the age of acquisition/production measure is focused on the time course of acquisition.In contrast, with our final set of analyses, we assessed whether the characteristics of the vocabulary learned by a model reflected what children had produced in the language corpora.In other words, we assessed whether the models and children were sensitive to input characteristics in a similar way.We focused on three lexical measures-word frequency, word length, neighborhood density-and one sublexical measurephonotactic probability.These characteristics have explained approximately 50% of variance in word learning (Stokes, 2010(Stokes, , 2014;;Storkel, 2009).Finally, although word comprehension as a marker of vocabulary growth has been predominant (e.g., Fernald & Marchman, 2012), the use of evaluation measures

951
Language Learning 73:3, September 2023, pp.942-975 based on early production was reasonable given both the relation between early vocalizations and vocabulary growth (McGillion et al., 2017) and the relation between early segmentation abilities and later expressive vocabularies (Newman et al., 2006(Newman et al., , 2016)).
In summary, we asked whether a novel chunking account of word segmentation could scale up to naturalistic speech in a developmentally plausible way by comparing CLASSIC-UB to PUDDLE, a model that has shown high performance in traditional measures of naturalistic segmentation, and to backward and forward transitional probability models that might account for a high proportion of variance in child word knowledge (Larsen et al., 2017).We also asked whether utterance-initial edges play a role in segmentation beyond final edges by comparing two different implementations of CLASSIC-UB.Finally, we asked whether transitional probability models could capture developmental data better than chunking accounts by comparing PUDDLE to transitional probability models to test whether we had replicated previous results (Larsen et al., 2017) using different corpora and performance measures.

Computational Models
We compared CLASSIC-UB to forward and backward transitional probability (Saksida et al., 2016), PUDDLE (Monaghan & Christiansen, 2010), and a random baseline relying on a coin toss to place a boundary after each input unit (Lignos, 2012).A full description of these models can be found in Appendix S1 in the Supporting Information online.We implemented the models to process syllables or phonemes as basic units (see Appendix S2 in the Supporting Information online for details).Python and R scripts for preparing the input, running the models, and analyzing the output are available at the project's OSF page (https://doi.org/10.17605/osf.io/kbnep).

Corpora
We used seven English corpora following Grimm et al.'s (2017) study (see Appendix S2 in the Supporting Information online for input preprocessing and characteristics).We downloaded the corpora from the CHILDES database (MacWhinney, 2000).As target input for the models, we considered only transcripts of children aged 2 years.While infants start segmenting speech much earlier than 2 years of age, our choice to focus on this age group was motivated by the much smaller size of corpora of speech directed at children of younger ages (e.g., 54,274 utterances at age 1 year vs. 604,000 utterances at age 2 years).As we show in Appendix S2 in the Supporting Information Language Learning 73:3, September 2023, pp.942-975 952 online, this limits the representativeness of input directed at children of younger ages.In total, the input to models contained 604,000 utterances (mean length of utterance = 4.39) from 332 different speakers, directed to 53 target children.Such input was 3 to 60 times larger than input used in previous studies (Christiansen et al., 1998;Daland & Pierrehumbert, 2011;Larsen et al., 2017;Monaghan & Christiansen, 2010;Saksida et al., 2016).

Measures of Model Performance Precision and Recall
We compared the models' performance by looking at the pairwise differences in mean precision and recall scores (e.g., Monaghan & Christiansen, 2010).We tested the last 10,000 utterances of output because the models' performance was stable (see Figure 2) and because testing the entire output (i.e., 604,000) would have led to significant results even for trivial differences.We used a Welch's t test for unequal variances, with p values and bootstrap 95% confidence intervals corrected for multiple comparisons using Holm's correction.

Word Age of First Production
We used the mean length of utterance for transcripts as a proxy of word age of first production following Grimm et al.'s ( 2017) study (see Appendix S3 in the Supporting Information online for details).Mean length of utterance is a useful estimator of child gross linguistic skills (i.e., developmental stage), controlling for the fact that children with a similar age might be far apart in their language development.The sample contained 5,480 words.We fitted linear regression models predicting word age of first production as a function of the log10 number of times a target word was correctly segmented by each algorithm (Larsen et al., 2017).We weighted the number of times a word was correctly segmented by dividing it by input word frequency before fitting the regression models as the two variables correlated highly (e.g., for a random baseline, r = .92).Word frequency correlates highly with the age of word acquisition (e.g., Morrison et al., 1997), therefore failing to control for its effect might have led to results that were an artifact of frequency.Indeed, input frequency tended to strongly affect models' performance; for example, for the random model, the correlation between the number of correct segmentations and age of first production dropped from .58 to .20 after we controlled for frequency.Therefore, controlling for input frequency allowed us to assess the performance of each segmentation algorithm over and above the fact that words that appear more often are acquired earlier.Since previous studies had not used weighting by word frequency, we also included analyses for the unweighted measure in Appendix S6 in the Supporting Information online to facilitate comparison.To foreshadow our findings, differences between models were consistent when we used either the weighted or unweighted measure, with only one exception pertaining to transitional probability models that we address in the Discussion section.We based comparisons between models on pairwise differences in adjusted R 2 from the regression models; we bootstrapped the 95% confidence interval of the difference between coefficients and corrected the interval using Holm's correction (Grimm et al., 2017).We concluded that two coefficients differed significantly from one another if the corrected 95% confidence interval did not include 0.

Word-Level Measures
We compared the distributions of unique words discovered by each model to children's actual vocabulary (i.e., the words produced by children in the corpus) for phonemic length, word frequency, neighborhood density, and phonotactic probability.According to Jones et al. (2021), the distribution of words relative to sublexical and lexical characteristics should be similar between children and model if the model's learning mechanism is developmentally plausible.As in previous studies (e.g., Storkel, 2009;Swingley & Humphrey, 2018;Vitevitch & Luce, 1998), word length referred to the number of phonemes in a word; word frequency was the log10 frequency of a word across the input; phonotactic probability was the mean probability of a phoneme pair's appearing in a word; neighborhood density was the raw count of phonemic words that differed from a target word by one phoneme (i.e., by deletion, insertion, or substitution).We left phonotactic probability and neighborhood density unmarked for stress to be consistent with previous work (e.g., Storkel, 2009;Swingley & Humphrey, 2018).
We carried out a chi-square goodness of fit test to compare observed probabilities of a word's being of a certain length (in the output of a segmentation model) to the expected probabilities in children's utterances; we focused on lengths of two to eight phonemes due to the low number of words at other phonemic lengths.We defined probabilities as the proportion of types at each length.We then looked at the pairwise differences in chi-square test statistics, using bootstrap confidence intervals as we described in the previous section.In other words, this analysis first looked at how close each model was to children's performance and then used the estimates of such distance to compare models to one another.For word frequency, neighborhood density, and phonotactic probability, which are continuous measures, we followed a similar procedure to the one that we used for word-level measures, but we used a Kolmogorov-Smirnov test statistic.Following Piantadosi et al.'s (2012) study, we divided each of these measures by word length.Word length tends to be anticorrelated with word frequency (e.g., Zipf, 1936) and neighborhood density (Storkel, 2004) and positively correlated with phonotactic probability (Storkel, 2004).In our dataset, the correlations varied from moderate to strong: length and frequency (r s = −.37),length and neighborhood density (r s = −.86), and length and phonotactic probability (r s = .42).

Results
We first report results for precision/recall and age of first production and finally for word-level measures.For ease of readability, in each subsection we give only a discursive presentation of key results and point to statistical results in the appendices in the Supporting Information online.We have included both CLASSIC-UB initial and CLASSIC-UB initial-final in this section; however, for reasons of space, we have provided a discursive comparison between the two models in Appendix S11 in the Supporting Information online.

Precision and Recall
All models showed rapid learning (see Figure 2), reaching a ceiling in performance after approximately 40,000 utterances and indicating that the quantity of the input did not affect their performance (consistent with Daland & Pierrehumbert, 2011).We have provided pairwise statistical comparisons for the models in Appendix S4 in the Supporting Information online.All models segmented the input above chance (baseline), except for the transitional probability models when the input was syllabified (see Panel B in Figure 2 and Appendix S4 in the Supporting Information online).
In line with Larsen et al.'s (2017) findings, PUDDLE showed the best performance, outperforming the baseline, transitional probability, and CLASSIC-UB models.When we used phonemic input, PUDDLE found 73% of items were words for the precision measure and 79% of items were words for the recall measure.This model's accuracy was higher when segmenting syllabified input, reaching 85% for the precision measure and 89% for the recall measure.CLASSIC-UB's performance lay between the PUDDLE and the transitional probability models, with CLASSIC-UB initial-final reaching 50% for precision and recall with phonemic input, and 66% for precision and 58% for recall with syllabified input.

955
Language Learning 73:3, September 2023, pp.942-975  Note.Heteroskedasticity-robust standard errors were computed using a HC2 estimator.The 95% confidence intervals indicate lower and upper limits of bootstrap confidence intervals around the estimate based on 1,000 iterations.Holm's correction was applied by expanding the confidence intervals.
Overall, the models segmented naturalistic speech above chance.However, while traditional measures examined models' accuracy, they told us nothing regarding whether a model's segmentations reflected how infants segment speech, and we were not able to make any claim regarding the plausibility of one model compared to another.To address this issue, we turned to the next set of measures that related model performance to child data.

Word Age of First Production
Table 1 shows the adjusted R 2 estimates for all linear regression models.Although the sizes of the estimates were small, they were in line with the results of Larsen et al. (2017), who, for example, showed that PUDDLE explained .067 of variance in child age of acquisition. 6After carrying out all pairwise comparisons between adjusted R 2 estimates (see Appendix S5 in the Supporting Information online), we found that only CLASSIC-UB initial-final, CLASSIC-UB final, and PUDDLE-and only when we ran the models on phonemic input-outperformed the baseline at predicting word age of first production.Surprisingly, when the models were run on syllabic input, none of them passed the baseline test (see Appendix S5 in the Supporting Information online).We discuss this unexpected finding in Appendix S13 in the Supporting Information online.Also, the results that we have reported above were based on weighting the predictor measure by frequency as we explained in the Method section.We have reported the results

957
Language Learning 73:3, September 2023, pp.942-975 for the unweighted measure in Appendix S6 in the Supporting Information online.
Crucially, while CLASSIC-UB had lower precision and lower recall scores compared to PUDDLE (see Figure 2), the two models explained the same proportion of variance in child word age of first production (about 8%), suggesting that achieving lower segmentation accuracy might not necessarily lead to lower developmental plausibility.Nevertheless, age of first production did not consider the characteristics of the model's vocabulary, nor did it answer questions about whether model and children are sensitive to similar sublexical and lexical characteristics.The following fine-grained word-level measures addressed these questions.

Word-Level Measures
In line with the previous analysis, the models approximated children's vocabularies better than the baseline only when we ran them on phonemic input.Therefore, in the following sections we report results for the phonemic analysis.We have included the results of the syllabic analysis in Appendices S7-S10 in the Supporting Information online, and we also discuss this finding in Appendix S13 in the Supporting Information online.

Phonemic Length
Qualitatively, all models learned more short than long words (see Figure 3) as children do (e.g., Storkel, 2009).However, CLASSIC-UB (both initial and initial-final) approximated the proportion of long words learned by children better than either PUDDLE or the transitional probability models did.The two CLASSIC-UB models were also the only ones to outperform the baseline (see Appendix S7 in the Supporting Information online).Finally, PUDDLE's  performance at approximating children's vocabularies by phonemic length did not differ from forward and backward transitional probability models.

Word Frequency
Children's vocabularies are Zipfian like the input that they receive (e.g., Hendrickson & Perfors, 2019), and as such their vocabularies contain more low frequency words than high frequency words.We found no significant difference between PUDDLE and CLASSIC-UB at approximating child vocabularies by word frequency (see Figure 4 and Appendix S8 in the Supporting Information online), but chunking models outperformed transitional probability models.This result was in line with empirical evidence showing that chunking models are better than transitional probability models at capturing lexical effects (e.g., Frank et al., 2010).

Neighborhood Density
In line with the fact that the majority of words in the language have zero or few lexical neighbors (e.g., Vitevitch, 2008), child vocabularies are populated by a high number of low-neighborhood words.In this measure, only CLASSIC-UB final outperformed the baseline at approximating child vocabularies by neighborhood density, and this model performed significantly better than all other models (see Figure 5 and Appendix S9 in the Supporting Information online).

Phonotactic Probability
As Figure 6 shows, child vocabularies are populated by words with low internal predictability (e.g., Storkel, 2009).All models were equally good at   approximating child vocabularies, in line with evidence showing that both chunking and transitional probability models are sensitive to sublexical regularities in the speech input.However, the models' performance did not differ statistically from the baseline model (see Appendix S10 in the Supporting Information online), suggesting that this measure might not have provided sufficient sensitivity for evaluating segmentation models.

Discussion
We compared CLASSIC-UB, a word segmentation model that uses naturalistic input, to another chunking model (PUDDLE) as well as to nonchunking accounts of word segmentation.We broadened the assessment of model developmental plausibility by introducing new measures that related model performance to child corpus data.We found that CLASSIC-UB acquired a vocabulary that more closely captured child vocabularies than did all other models; for example, both children and CLASSIC-UB learned a higher proportion of long and low-neighborhood words compared to other models.We discuss each of these findings in turn.

Measures of Developmental Plausibility
In line with Larsen et al.'s (2017) study, we found that the results of traditional evaluation measures can be inconsistent with those of measures based on child speech.In fact, overall, CLASSIC-UB performed better than PUDDLE at predicting measures based on child speech despite segmenting approximately 30% fewer word tokens.One reason for this finding might be that traditional measures represent an adult benchmark.Infants might not segment speech into the same units as adults but might, at least initially, segment and store a protolexicon made of both word and frequent nonword units (Ngon et al., 2013).This is also consistent with different accounts (e.g., Cutler et al., 2012;Pinker, 1994) that have predicted that learners should commit segmentation errors based on the same cues that allow them to segment speech (e.g., rhythmic structure of the language, possible-word constraint, phonotactic constraints).Although researchers still do not know which specific errorsand more importantly in which proportion-infants make when segmenting naturalistic speech over the course of development, our findings nevertheless suggest that carrying out an in-depth examination of the kind of vocabulary built by models might be a first step toward assessing models' developmental plausibility.
In Larsen et al.'s (2017) study, transitional probability models explained a higher proportion of variance in age of acquisition than did chunking models.Using our adapted production measure, we showed that this result might depend on controlling for the role of word frequency.Namely, if one controls for frequency, transitional probability models do not actually perform above chance (see transitional probability models vs. the baseline model in Appendix S5 in the Supporting Information online).This means that the higher performance of transitional probability models might be largely driven by input frequency.This finding is not dependent on using a production measure; in a supplementary analysis (see CDI addendum in the project's OSF profile), we examined the models' ability to predict age of acquisition based on the UK CDI (a comprehension-based measure).When the comprehension measure was not frequency-weighted, we replicated Larsen et al.'s (2017) results.But importantly, when the measure was frequency-weighted, CLASSIC-UB again performed better than the other models (consistent with the production-based analyses reported here).

961
Language Learning 73:3, September 2023, pp.942-975 We suggest that our proposed set of word-level measures might provide a richer and more nuanced method for evaluating the developmental plausibility of segmentation models.First, findings from word-level measures were in line with the age of first production results, with chunking models outperforming transitional probability and models run on syllabified input performing at chance (see Appendices S7-S10 in the Supporting Information online).In line with previous findings capturing in-laboratory data (e.g., French et al., 2011;Kurumada et al., 2013), word-level measures also showed that, while both transitional probability and chunking models closely approximated child vocabularies at the sublexical level (phonotactic probability), chunking models performed better when lexical measures were considered (word length, word frequency, neighborhood density).
Second, word-level measures provided a more detailed test of the models' lexical characteristics, highlighting performance differences that might be attributed to architectural differences across models.Indeed, CLASSIC-UB's learning mechanism facilitated the discovery of words that overlap phonologically with previously discovered words.This allowed the model to approximate a greater proportion of children's long/low-neighborhood words than did competing models (see Figures 3 and 5).Therefore, uniquely relying on mechanisms that privilege highly probable sequences (e.g., PUDDLE, transitional probability models) makes it difficult to capture a portion of long/low-neighborhood words that are generally more difficult to learn but that children nevertheless learn and that CLASSIC-UB can learn by exploiting phonological overlap.Interestingly, this feature of CLASSIC's learning mechanism also means that the model can account for nonword repetition effects (Jones, 2016) that are due to phonological overlap across word and nonword sequences.Similarly, it is possible that CLASSIC-UB captures additional processes of storage and recall involved in word production (i.e., going beyond aspects of segmentation) and that this sensitivity explains its superior performance in approximating the characteristics of children's productions.
Although CLASSIC-UB more accurately represented the make-up of children's early lexicons, its accuracy in segmenting words was not quite as good as that of PUDDLE (i.e., PUDDLE has a larger vocabulary).One could therefore argue that, at earlier stages in PUDDLE's learning, word-level characteristics may match those of CLASSIC-UB and that it is only the subsequent increase in PUDDLE's vocabulary that skews the distribution of the word-level characteristics.We conducted additional analyses (see Appendix S12 in the Supporting Information online.) to evaluate this possibility.These analyses showed that differences in vocabulary size did not explain the differences in word-level measures.
Finally, to support our claim regarding the role of overlapping phonological sequences in CLASSIC-UB, we conducted an additional exploratory analysis showing that CLASSIC-UB's ability to better approximate children's vocabulary in word length and neighborhood density increased as word frequency increased (see Appendix S12 in the Supporting Information online.).This is in line with recent work showing that frequent words are more likely to share phonological material with previously learned words, therefore boosting child learning compared to learning less frequent words (Jones et al., 2023).Our result was also in line with evidence showing an effect of overlapping phonological sequences on vocabulary learning at around 2 years of age (e.g., Jones et al., 2023;Stokes, 2010;Storkel, 2009) but no effect at 12-15 months (Swingley & Humphrey, 2018), suggesting that children first build a diverse repertoire of phonological chunks that later boost word learning (for a computational test of this idea using CLASSIC, see Jones & Rowland, 2017).
Overall, our results speak in favor of models that exploit phonological overlap between sequences in word segmentation (e.g., French et al., 2011;Perruchet et al., 1998) and add to previous work which highlighted the significant role of the overlap between sequences in word processing and acquisition (Gathercole, 1995;Jones et al., 2021).

Limitations and Future Directions
We have shown that chunking might play a significant role in early word segmentation by comparing our new chunking-based segmentation model CLASSIC-UB to two other influential models: transitional probability and PUDDLE models.However, there are additional models that we did not consider.One important class of Bayesian models assumes that infants formulate hypotheses on the possible segmentations of utterances, ultimately preferring those segmentations that contain few frequent and short chunks (e.g., Goldwater et al., 2006Goldwater et al., , 2009)).Another account is that infants form chunks based on both frequency and transitional probabilities (forward and backward) of syllable sequences, such as through mutual information-based clustering (Swingley, 2005).Given that these accounts are primarily driven by frequency information, future comparisons to CLASSIC-UB are important for supporting our conclusion that phonological overlap between sequences plays a role in the segmentation process in addition to frequency.Such comparisons would also be important because one influence does not exclude the other.As we argued above, CLASSIC-UB's encoding efficiency uniquely increased when

963
Language Learning 73:3, September 2023, pp.942-975 items became connected to others, that is, the more opportunities to chunk sublexical items the faster lexical representations were formed.However, once CLASSIC-UB has extracted a word representation from the input, it could further benefit from tracking its frequency in the input (e.g., see Jones, Justice, et al., 2020, for how a frequency-tracking mechanism might improve CLASSIC's performance).Moreover, it is highly likely that early naturalistic segmentation involves the use of a combination of cues.Indeed, the results of this study indicate that chunking alone might not be enough to discover items that are very long (Figure 3), occur very infrequently (Figure 4), receive no facilitation from word neighbors (Figure 5), and are made up of improbable sequences of sounds (Figure 6).This suggests that CLASSIC-UB might need to have access to additional cues to word boundary to be able to account for children's ability to learn these words.We know that infants use a wide range of cues when segmenting speech such as prosodic salience of phrase edges (Gout et al., 2004), alternative ways to pronounce specific phonemes (i.e., allophonic variation; Hohne & Jusczyk, 1994), stress patterns (Jusczyk et al., 1999), degree of coarticulation of speech sounds (Johnson & Jusczyk, 2001), and others.Such cues could be considered in future work.
An alternative (and nonmutually exclusive) possibility is that long, infrequent items with few neighbors might be learned via generalization of linguistic structures at different levels, including the syntactic level (Lippeveld & Oshima-Takane, 2020).For example, in Abend et al.'s (2017) study, an ideal Bayesian learner performed one-shot learning (i.e., formation of new word representations from a single exposure) by leveraging the mapping of words to their syntactic categories.Examining the role of syntactic categories would be important in future work as infants' development of grammatical knowledge appears to start in parallel with the acquisition of phonology and the lexicon (e.g., Marino et al., 2020).
Aside from our focus on a single word segmentation cue, another limitation is that we did not consider the models' ability to capture the role of additional variables in word segmentation and learning.For example, Swingley and Humphrey (2018) showed that word concreteness, word frequency in isolation (i.e., frequency with which a word occurs in a single-word utterance), and syntactic category predict word learning at 12 and 15 months of age.These predictors could be included in the statistical models of age of acquisition/production alongside our word-level predictors to see how they moderate models' accuracy (i.e., number of correct word segmentations).Alternatively, our wordlevel evaluation measure could be extended to examine whether segmentation models can capture the distributions of these additional word-level features in children's vocabularies.We would expect models to better capture characteristics to which they are sensitive, for example, in the sense that chunking models would show sensitivity to word frequency in isolation (Kurumada et al., 2013).Moreover, including these additional variables would be important because they differently impacted word comprehension and production in Swingley and Humphrey's (2018) study; word concreteness only predicted word comprehension, and the effect of word frequency in isolation was moderated by syntactic category type only in word comprehension.Although we have highlighted limitations in using comprehension measures to investigate how well segmentation models perform, methods that look at comprehension and production should be considered complementary.Comparing comprehension and production would also allow researchers to test the extent to which CLASSIC-UB captures processes that are uniquely involved in production (such as recall and articulation).
We would also like to highlight limitations deriving from the use of phoneme-based input adopted in our study.The models did not have to deal with the complex problem of gradually abstracting phonological categories.Under an early phonetic learning approach (e.g., Werker, 2018), infants have to learn the relations between different realizations of phonemes based on contextual variation or lexical contrast (e.g., aspirated stops and unreleased stops are allophones of the phoneme /t/).Addressing this limitation in future work is important for increasing the developmental plausibility of the investigations.Alternatively, under more recent approaches, the goal of infant speech perception may not be to learn discrete phonetic categories but instead be to represent continuous dimensions of raw speech (e.g., spectral energy) that are relevant to the native language (i.e., perceptual space learning; Feldman et al., 2021;McMurray, 2022).This implies that future work would need to consider more gradient units of speech perception.For example, recent work by Schatz et al. (2021) showed that a distributional learner can learn to discriminate phonetic contrasts by clustering auditory features into categories that are significantly smaller and more variable than traditional phonetic categories.Finally, we acknowledge that the early phonetic learning approach used in our work was also in contrast to other accounts that do not assume phonemes as basic units of perception, for example, work that has argued for gradient units dependent on the temporal unfolding of speech (e.g., Browman & Goldstein, 1992;Bybee, 2001;Mowrey & Pagliuca, 1995;Port & Leary, 2005) or others that have argued for features or morphophonemic forms (e.g., Chomsky & Halle, 1965;Postal, 1968).

Conclusion
Our goal in this study was to test whether a chunking-based mechanism that has previously been successful in capturing early vocabulary learning might play a significant role in infant word segmentation.We then constructed CLASSIC-UB, which forms chunks of phonological and utterance-boundary material.
Our simulations make three important contributions: They offer proof that (a) utterance boundaries exist and carry useful information for word segmentation, (b) age of production and word-level measures can sensibly be used to evaluate model performance, and (c) CLASSIC can be augmented to form the segmentation model CLASSIC-UB, consistent with the hypothesis that chunking might be an important mechanism in early naturalistic word segmentation.
Final revised version accepted 9 December 2022

Open Research Badges
This article has earned Open Data and Open Materials badges for making publicly available the digitally-shareable data and the components of the research methods needed to reproduce the reported procedure and results.All data and materials that the authors have used and have the right to share are available at https://doi.org/10.17605/osf.io/kbnep.All proprietary materials have been precisely identified in the manuscript.

Notes
1 For ease of exposition, the example uses IPA phonetic transcription.However, in our simulations, we used a transcription based on the CMU Pronouncing Dictionary (Lenzo, 2007; see an example in Figure 1). 2 However, CLASSIC's encoding does not allow partial activation of chunks unlike in Baayen et al.'s (2011) study.3 Interestingly, when Larsen et al.'s (2017) measure was used, transitional probability models performed better than chunking models despite their discovering fewer words in the input as we mentioned above.For example, a transitional probability model explained 19% of variance in age of acquisition (the highest performance in the study), while the chunking model PUDDLE explained only 7% (Larsen et al., 2017) present in the child-directed input that the segmentation models received (i.e., CDI words that the models had the opportunity to learn).5 A discussion about the effect of sample size reduction when using the age of acquisition measure from the CDI can be found in the file CDI_addendum at the project's OSF page (https://doi.org/10.17605/osf.io/kbnep).6 Adjusted R 2 estimates cannot typically be directly compared to R 2 estimates.
However, because of our large sample size, adjusted R 2 and R 2 estimates and confidence intervals were identical, allowing us to compare our adjusted R 2 estimates to Larsen et al.'s (2017) R 2 estimates.In fact, as sample size increases expected R 2 estimates become less biased and approach adjusted R 2 unbiased estimates of the population explained variance (Karch, 2020).

Figure 1
Figure 1 CLASSIC-UB generalization of utterance-boundary markers to utterancemedial position.Solid lines indicate grouping of adjacent items into single chunks and storage into the lexicon.Dashed lines indicate use of stored chunks to segment speech.Lines are shown only for the first utterance.Time indicates independent presentations of new child-directed utterances.All English phonemes were present in the lexicon but are not shown for reasons of space.The transcription used was based on the CMU Pronouncing Dictionary(Lenzo, 2007).

Figure 2
Figure 2 Mean precision and recall performance with phonemic (Panel A) and syllabic (Panel B) input.The figure shows the random baseline, backward transitional probability (BTP) and forward transitional probability (FTP), CLASSIC-UB with utterancefinal and initial-final markers, PUDDLE.Performance was averaged every 1,000 utterances (Stage).Only the first 120 stages are shown to better appreciate changes in performance and because the performance of the models was stable.Grey confidence bands indicate the 95% confidence interval around the mean.

Figure 3
Figure 3 Proportion of word types produced by children and discovered by each model by phonemic length when phonemic input was used.

Figure 4
Figure 4 Gaussian kernel density estimate of the distribution of unique words in children's speech (Children) and discovered by each model, by log10 word frequency (weighted by dividing a word frequency value by its phonemic length).Phonemic input was used.The area under each curve represents 100% of data points.Curve peaks represent the mode of each distribution.

Figure 5
Figure 5 Distribution of unique words in child speech (Children) and discovered by each model, by neighborhood density (weighted by dividing a word neighborhood density value by its phonemic length).Phonemic input was used.

Figure 6
Figure 6 Distribution of unique words in child speech (Children) and discovered by each model by phonotactic probability (weighted by dividing a word phonotactic probability value by its phonemic length).Phonemic input used.