Sex Differences in the Classification of Conduct Problems: Implications for Treatment

Conduct problem behaviors are highly heterogeneous symptom clusters, creating many challenges in investigating etiology and planning treatment. The aim of this study was to first identify distinct subgroups of males and females with conduct problems using a data driven approach and, secondly, to investigate whether these subgroups differed in treatment outcome after an evidence-based crime prevention program. We used a latent class analysis (LCA) in Mplus` to classify 517 males and 354 females (age 6–11) into classes based on the presence of conduct disorder or oppositional defiance disorder items from the Child Behavior Checklist. All children were then enlisted into the 13-week group core component (children and parent groups) of the program Stop Now And Plan (SNAP®), a cognitive-behavioral, trauma-informed, and gender-specific program that teaches children (and their caregivers) emotion-regulation, self-control, and problem-solving skills. The LCA revealed four classes for males, which separated into (1) “rule-breaking,” (2) “aggressive,” (3) “mild,” and (4) “severe” conduct problems. While all four groups showed a significant improvement following the SNAP program, they differed in the type and magnitude of their improvements. For females, we observed two classes of conduct problems that were largely distinguishable based on severity of conduct problems. Participants in both female groups significantly improved with treatment, but did not differ in the type or magnitude of improvement. This study presents novel findings of sex differences in clustering of conduct problems and adds to the discussion of how to target treatment for individuals presenting with a variety of different problem behaviors.


Introduction
Conduct problems in childhood, characterized by defiant, rule-breaking, or violent behaviors, present a major burden for families and are associated with a myriad of negative consequences (Colman et al. 2009). Without successful intervention, these children are at a heightened risk for psychiatric problems later in life, leading to greater unemployment and economic difficulties, higher rates of alcohol and drug abuse, and increased risk of suicide (Baker 2013;. Children who engage in violent or chronic behavioral problems from a young age are also more likely to engage in criminality later in life (Howell et al. 2014).
While several effective psycho-social interventions for conduct problems exist, unsurprisingly, some children are able to benefit more than others. One likely explanation is that conduct problems encompass a wide range of behaviors that in various combinations can make up a diagnosis of conduct disorder (CD) or oppositional defiant disorder (ODD) or otherwise lead to trouble at home, in school, or with the police. Although DSM-5 distinguishes CD and ODD as two separate disorders (the former is characterized by physical violence and delinquency and the latter by oppositionality and irritability), the two disorders greatly overlap (Lahey and Waldman 2012;Maughan et al. 2004). This overlap, in combination with the heterogeneity of both disorders, increases the challenges that clinicians and researchers face when planning for treatment and evaluating outcomes of treatment.
Several categorizations of conduct problems have been proposed (e.g., age-of-onset , callous and unemotional traits (Frick and White 2008), aggressive vs non-aggressive (Barker et al. 2007;Burt 2013)); however, these categorizations have mostly not used data-driven approaches. This group of statistical methods, including latent class analysis (LCA; Muthén and Muthén 2005), are exploratory in nature and are able to identify clusters of behaviors (or symptoms) that have a high probability of occurring together (Hagenaars and McCutcheon 2002). These types of classifications are argued to have increased ecological validity relative to other, hypothesis-driven, approaches.
A number of epidemiological studies have previously used data-driven approaches to classify conduct problems. In their influential study, Nock et al. (2006) used this approach in a population-based sample of adults and demonstrated that conduct disorder could retrospectively be categorized into five distinct classes: (1) rule violations, (2) deceit/theft, (3) aggression, (4) severe covert behaviors, and (5) pervasive conduct problems (Nock et al. 2006). Similar classes were found in a second retrospective epidemiological study including over 20,000 adults (Breslau et al. 2012). In an adolescent sample, Lacourse et al. (2010) found three distinct subgroups of conduct problems that were characterized by (1) rule violation, (2) physical aggression, and (3) severe/mixed conduct problems (Lacourse et al. 2010). Finally, a study investigating categorical versus continuous conceptualizations found that children aged 6 years old were characterized by (1) oppositional behavior, (2) aggression, and (3) irritability, while 10-year-old children were characterized by (1) disobedient behavior, (2) rulebreaking, (3) aggression, and (4) irritability (Bolhuis et al. 2017). While there is some overlap in themes between the different studies, clusters seem to be dependent on age, and replication using child samples is needed-particularly when attempting to apply the classifications to treatment.
There has been much debate as to whether conduct problems are best conceptualized based on severity or as categorized into distinct subgroups. Many have argued that a categorical conceptualization does not capture the complex nature of conduct problems (Bolhuis et al. 2017), and indeed, there is evidence to suggest that continuous models provide better fits for the data (Walton et al. 2011). However, the clinical utility of the dimensional approach has limitations (Coghill and Sonuga-Barke 2012). To the best of our knowledge, no study has investigated whether either of these conceptualizations makes any valuable contributions in terms of aiding treatment selection and planning, or predicting outcomes. Without investigation of the effect of evidence-based treatment on reducing various conduct problems, the clinical practicality of these different constructs remains intangible.
Interventions for conduct behavior problems are most effective when introduced in middle childhood (Piquero et al. 2016), include both child and caregiver components (Epstein et al. 2015), and are cognitive-behaviorally based (Lipsey et al. 2007). One such intervention is Stop Now And Plan (SNAP®), an extensively validated and costeffective evidence-based program for middle years children, aged 6-11, with conduct problems (Augimeri et al. 2007;Farrington and Koegl 2015). SNAP was established in 1985 and became gender-specific in 1996 (Augimeri et al. 2017). Randomized controlled trials have demonstrated that the program significantly reduced scores on aggression, conduct behaviors, and internalizing problems for children both in the girl's program (Pepler et al. 2010) and the boy's program (Augimeri et al. 2007;Burke and Loeber 2015). In addition, SNAP is associated with improved emotionregulation and problem-solving skills (Burke and Loeber 2016) and increased selfcontrol (Augimeri et al. 2018) and has been found to increase recruitment of brain areas involved in self-regulation (Lewis et al. 2008;Woltering et al. 2015).
The SNAP Lab site, located at the Child Development Institute (CDI), a children's mental health center in Toronto, provides service to more than 120 children and their families per year, creating an ideal opportunity to investigate whether potential subgroups of children with conduct problems differ in terms of treatment outcome. In addition, most of the studies looking at classifications of conduct problems have not separated males and females in their models to allow for comparison. While males and females appear to share the underlying genetic and environmental influences on conduct problems (Burt 2009;Van Hulle et al. 2018), there is growing evidence to suggest that the two sexes partly differ in their neurobiological (González-Madruga et al. 2019;Smaragdi et al. 2017), neuropsychological (Sidlauskaite et al. 2018), and clinical (Ackermann et al. 2019) profiles. Throughout their development, these differences become more salient. Males are more likely to continue on an antisocial trajectory through adolescence and to develop antisocial personality disorder in adulthood, whereas females are more likely to diverge from an antisocial path and steer toward destructive and self-harming behavior in adulthood ). The gender-specific program allows for a comparison between males and females who have undergone the equivalent treatment, with a reduced risk of introducing bias in the results.
The current study aimed to first identify distinct subgroups of males and females with conduct problems and secondly to investigate whether these subgroups differed in the magnitude or type of treatment outcome following SNAP. We predicted that both males and females would show behavior profiles similar to those previously observed in the literature (i.e., aggressive, rule-breaking, pervasive/severe). While we were not able to make specific predictions regarding treatment outcome prior to identification of the subgroups, based on the literature on differences between aggressive and nonaggressive conduct problems (Barker et al. 2007;Burt 2013;Burt and Donnellan 2008), we expected to see differences in the type and magnitude of the treatment response between the subgroups.

Methods
The data were drawn retrospectively from a database of children aged between 6 and 11 years, who were enlisted into the SNAP program at CDI, between 2001 and 2017. A subset of the sample has previously been described in Augimeri et al. (2018); Augimeri et al. (2012); Jiang et al. (2011);and Pepler et al. (2010). During this time period, 1019 participants completed the 13-week group component of SNAP. Only participants with complete CD and ODD subscales of the Child Behavior Checklist (CBCL; Achenbach and Rescorla 2001) were included, leaving 871 participants (354 females and 517 males) for the initial LCA analysis. For the second analysis, the treatment evaluation, only participants with both pre-and post-assessment measures were analyzed, which included 278 females and 382 males (see Fig. 1). The study was approved by the CDI Research Ethics Board, and informed consent was obtained from the primary caregiver at the first assessment.
The CBCL was used to identify emotional and behavioral problems of the participants before and after the core group component of treatment. The checklist is composed of 112 "problem items" that are rated as 0 (not true of the child), 1 (sometimes or somewhat true), or 2 (very true or often true), based on the child's behavior in the preceding 2 months. The items are grouped into three composite scales (externalizing problems, internalizing problems, and total problems). The externalizing composite scale is made up of the aggressive and rule-breaking behavior subscales. Selected items from these scales also make up the DSM-IV scales of ODD and CD. The overall inter-rater reliability of this tool is .96, and test-retest reliability has been found to be .90 and .92 for the externalizing and internalizing subscales, respectively (Achenbach and Rescorla 2001).
Consent and the CBCL pre-measure were completed by the primary caregiver at admission. The first assessment was conducted between 1 month and a few days prior to the first treatment session. CBCL post-measures were collected after the core SNAP group was completed (median time 3 months and 18 days). For consistency, only one informant per child was included in this analysis.
When there were two or more primary caregivers, priority was given to the caregiver who completed both pre-and post-measures. In cases where several caregivers had completed measures, the mother's information was selected. This decision was based on previous research showing only moderate agreement between mothers and fathers in rating their child's behavior. The differences in agreement were especially pronounced for female children (Davé et al. 2008). In order to reduce gender bias in reporting, which would hinder gender comparisons, mothers in this study were selected over fathers, as they constitute over 80% of respondents. While SNAP is a comprehensive treatment approach, the core component of the program includes 13 concurrent child and parent group sessions. The group component uses well-established strategies such as role-playing, cognitive restructuring, and reinforcement learning to teach children how to improve their emotion regulation and self-control and to "make better choices in the moment." The concurrent parent education group is informed by parent management training strategies. These strategies focus on strengthening the caregiver-child relationship, as well as enhancing emotion-regulation and effective parenting strategies (Forgatch and Patterson 2010;Kazdin et al. 1992).

Analysis
LCA (Muthén and Muthén 2005) is a data-driven analysis method that provides distinct classes of symptoms (or problem behaviors) that have a high probability of occurring together. In addition, the model identifies the probability of each participant belonging to each class, allowing participants to be sorted into classes to use for group analysis. Models are run sequentially, starting with a 2-class model, and increasing in number until the best-fitting model is found (as assessed by the Bayes and Akaike Information Criteria (BIC and AIC) (Nylund et al. 2007) and Entropy). We ran the LCA in Mplus v.8.1 using binary ratings of the CD and ODD subscales of the CBCL (22 items in total). A score of 0 on an item was entered as 0; a score of either 1 or 2 was converted to a 1. Models were run for males and females separately.
Treatment outcome was assessed using the pre-and post-assessment raw scores on the aggressive, rule-breaking, total externalizing, and total internalizing subscales of the CBCL. These analyses were run in SPSS 24 for each class identified in the LCA. Differences in demographics and treatment outcome were analyzed using independent and mixed-model ANOVAs in SPSS 24.

Latent Class Analysis
Males Based on the BIC, AIC, and Entropy, the best-fitting model for the data was a fourclass model (BIC = 10,243.05, AIC = 9856.48, Entropy = .79, LMR-LRT p < .01). Prevalence and symptom probabilities for the four classes can be seen in Table 1. Class 1, "rulebreaking," was characterized by stealing both at home and outside of home, lying, and breaking rules. Class 2, "aggressive," was characterized by cruelty toward people, fighting, physically attacking, breaking rules, and threatening other people. Class 3, "mild," showed the lowest levels of conduct problems, mainly presenting with disobedience, rule-breaking, stubbornness, and temper tantrums. Finally, Class 4, "severe," was characterized by endorsing almost all problem items from the two subscales and had the highest endorsement of fire setting, vandalism, running away, and cruelty to animals. The differences in symptomology between the four groups were confirmed by the analysis at pre-assessment (rule-breaking F(513, 3) = 208.65, p < .001; aggressive F(513, 3) = 158.76, p < .001; externalizing F(513, 3) = 221.69, p < .001; internalizing F(513, 3) = 2.40, p < .001). The "mild" group had significantly lower levels of aggression, rule-breaking, externalizing, and internalizing relative to the other three classes (all p < .001). Similarly, the "severe" group scored significantly higher than the other three groups on these measures (all p < .001). As expected, the "rule-breaking" group scored significantly higher than the "aggressive" group in rule-breaking (p < .001), while the reverse was true for levels of aggression (p < .001). There were also main effects of age (F(3,513) = 3.83, p < .05), where participants in the "aggressive" group were significantly younger than in the "rule-breaking" group.

Discussion
The aim of this study was to identify subgroups of males and females with conduct problems and investigate whether these subgroups differed in levels of change in aggressive, rule-breaking, externalizing, and internalizing problems following SNAP treatment. We identified four distinct classes of males, which were characterized by (1) "rule-breaking," (2) "aggressive," (3) "mild," and (4) "severe" conduct problems. The four groups differed in treatment outcome, such that the severe group improved on all measures, the mild group on none, the aggressive group specifically on aggressive and externalizing behavior, and the rule-breaking group specifically on rule-breaking and externalizing behavior. We did not observe similar groups for females; instead, the females formed two groups that were not distinguishable by type of behavior but instead were characterized by number of behaviors endorsed (i.e., severity). Consequently, both groups of females improved on all measures, but there were no differences in rate of change between groups. The clear separation between aggressive and non-aggressive (rule-breaking) behavior classes identified in the male sample conforms with previous classifications of conduct problems (Barker et al. 2007;Burt 2013;Monuteaux et al. 2009;Niv et al. 2013). The identifications of the two classes characterized by "mild" and "severe" conduct problems also lend support to the argument that conduct problems could be conceptualized on a severity dimension (Bezdjian et al. 2011;Krueger et al. 2005). The severe group identified in this study has a higher probability of endorsing the symptoms that are shared with other groups, such as bullying, as well as endorsing additional symptoms that are not present in other groups, such as cruelty to animals and vandalism. The opposite can be seen in the mild group, where both fewer symptoms and lower probability of symptoms can be seen. This demonstrates a range of severity within our sample. Simultaneously, while there is some overlap in symptoms between aggressive and rule-breaking groups, they display distinct features that clearly distinguish the two groups from each other. On this basis, conceptualizing conduct problems as either categorical or dimensional may be too simplistic. Instead, the idea of a complex, multimodal conceptualization of conduct problems in males that considers the type of conduct problems as well as the severity, as previously demonstrated by Bolhuis et al. (2017), may be most appropriate. In this regard, we want to stress that the design and analysis of the current study are not optimally suited to test this theory, as the objective of the study was to compare subgroups, if any were identified. Future research should be conducted that are specifically designed to investigate how to best conceptualize conduct problems for males and females.
In this regard, the findings presented here have several practical implications. Firstly, they highlight how composition of the problem behaviors can aid treatment selection and planning, such that high levels of internalizing problems may be more apparent in those with high levels of aggression, but not necessarily in those with high levels of non-aggressive conduct problems. Secondly, they demonstrate that the composition of conduct problems can influence treatment outcome, thus averaging over a large group risks biasing or diluting outcome results. This is not surprising from a life-course perspective. It is well known that children with severe conduct problems are likely to have an early onset of problems as well as a higher risk of a long-lived criminal career, relative to those with less severe problems . Similarly, aggressive and non-aggressive antisocial behaviors have at least partly distinct etiologies and divergent trajectories (Barker et al. 2007).
Thirdly, the observed sex differences in the composition of conduct problems likely reflect underlying differences in risk factors and etiology between males and females and support the need for gender-specific treatment programs. Lastly, they demonstrate the effectiveness of the SNAP program to identify and target key behavior problems, especially for treating males with high levels of aggressive and externalizing behaviors. These findings are in line with a randomized controlled trial study that found that males with the most severe behavioral problems showed the greatest improvement following the SNAP treatment, while males with milder problems improved less (Burke and Loeber 2015). In a similar vein, intensive treatment has been found to be less effective for individuals presenting with low relative to high level of risk, and in some cases the treatment may worsen the behavior and outcome of low-risk individuals (Lowenkamp and Latessa 2004). While data on the level of risk was not available for this study, risk and severity of conduct problems are highly associated (Enebrink et al. 2006), and it is feasible that a relatively intensive program such as SNAP may not address the risk and need of low risk children. While this requires more examination, it may be that SNAP is the most beneficial for high-risk males with moderate to severe conduct problems. Indeed, the risk-need-responsivity principle stipulates that the type and intensity of treatment should match the level of risk and individual needs of the child (Bonta and Andrews 2007). As such, there may be more suitable interventions, specifically tailored to males with mild conduct problems. For example, a less intense version of SNAP, such as SNAP irritability (I-SNAP), has been created specifically aimed at reducing oppositional and irritable behavioral problems (Derella et al. 2020). While the key facets of the original SNAP program remain, the I-SNAP program targets less severe behavioral problems with the specific focus on increasing frustration tolerance by means of emotion-focused coping strategies and relaxation.
There are several probable reasons as to why we did not observe similar categories for males and females. Methodologically, the differences in classifications are unlikely to be due to uneven sample sizes or a lack of variability of symptom presentation at preassessment. Both males and females presented with high numbers of aggressive and externalizing symptoms at pre-assessment and the number of participants were sufficiently high to reliably identify a number of classes in both groups (Wurpts and Geiser 2014). Several studies have suggested that emotional closeness and quality of the caregiver-child relationship may play a greater role in females' internalizing and delinquent behavior relative to males (Hart et al. 2007;Lewis et al. 2015;Tisak et al. 2017). In this regard, it may be important to include additional variables such as relationship quality or internalizing symptoms when analyzing female behavior, rather than relying on conduct behaviors alone. Another explanation for the observed differences relate to the use of categorical models of conduct problems. The two female classes identified from the LCA were only distinguishable based on number of symptoms endorsed, suggesting that severity may be more informative than clusters for females, over and above what might be the case for males. Furthermore, as we did not have risk assessment data available for this study, we were unable to explore if, and to what extent, the level of risk might interact with behavior and influences treatment outcome in different ways for males and females. The study has several important strengths; we used a self-referred community sample, adopted a data-driven approach to categorize our sample rather than pre-defined, inflexible categories, and ran separate models for males and females to avoid averaging out important effects of sex. However, the results need to be interpreted with a number of limitations in mind. Firstly, by focusing on the CBCL subscales as the outcome measure, we excluded a number of important potential treatment outcomes such as increased problem-solving skills, selfcontrol, or the caregiver-child relationship. While these factors are interesting and worthy of investigation, increasing the number of variables under investigation would have decreased power and added complexity to the interpretation of results. Secondly, this study was an analysis of archive data, and a control group was not available. While utilizing archive data allowed us to include a large group of representative participants, whose data were not confined to a brief time period of data collection, we were not able to manipulate or chose any of the variables ahead of the study, such as additional standardized measures of aggression and rule-breaking, which limited our choices of analysis and thus interpretation. In addition, there was no control group available, which curbs interpretation of the intervention outcome in its own right. A control group without any conduct problems would not lend interpretation to how the conduct problem groups differed from each other, which was a key research question in this study. In this regard, using the pre-treatment scores to control for the post-treatment scores within the same groups allowed for more control over extraneous variables and less risk of introducing noise to the data based on differences between individuals. A control group with similar levels of conduct problems who underwent alternative treatment would strengthen the interpretation of SNAP in general. Several such studies, using randomized controlled trials and wait-list designs, have been published previously (see Augimeri et al. 2007Augimeri et al. , 2018Loeber 2015, 2016;Pepler et al. 2010). However, the evaluation of the SNAP program was not the primary aim of the current study.
Thirdly, it is possible that caregiver ratings of child behavior post-treatment were influenced by their own participation in the parent component of SNAP; if parenting skills and awareness of their child's behavior increased as a consequence of participating in the program, it is possible that the increased skills could influence their rating. Since parent data was not available for the current study, the possible effect of participating in the parent group remains speculative, but future prospective studies should keep this issue in mind and consider alternative assessments from teachers or other close adults.
Lastly, the authors want to stress the point that the classes identified in the current study, by definition, were dependent on the sample of children with conduct problems whose caregivers sought SNAP treatment, and classifications may not apply to other populations of children with conduct problems. Equally, the effects of treatment may be specific to the SNAP program and do not necessarily reflect outcome of other CBT-based treatment programs. Because of the heterogeneity and complexity of conduct problems, it would benefit future studies to conduct a data-driven analysis of their sample prior to investigating treatment outcomes for groups to account for the heterogeneity of the sample.
This lack of synergy between the sexes in our results suggests that while males and females with conduct problems may display similar types and frequency of problem behaviors, the compositions of these behaviors are different and, furthermore, may need to be conceptualized separately. Large-scale studies with the power to include numerous additional variables would be valuable in order to fully investigate the conceptualization of conduct problems in females, and how it relates to treatment selection, planning, and outcome.
Conclusion Despite the fact that the males and females displayed similar types of conduct problems, the composition of these behaviors differed. The distinct groups of conduct problems seen in males could not be found in females. However, a dimensional aspect to conduct problems, based on number and types of symptoms, could be seen in both sexes. This clearly warrants further consideration. Future studies must be mindful of sex differences when investigating classifications of conduct problems and particular investigations of the conceptualization of female conduct problems are warranted. Furthermore, the composition, as well as the severity, of conduct problems should be taken into account when selecting and planning for treatment to ensure that the treatment is targeted specifically for the child's needs. Thus, the targeted treatment is not only necessary based on the gender of the child but also based on the unique combination of problem behaviors shown by individuals.