Kenworthy, Vincent --- "Predicting sentencing decisions of the New Zealand courts using support vector machines" [2023] NZLawStuJl 7; (2023) 4 NZLSJ 143

	Home \| Databases \| WorldLII \| Search \| Feedback New Zealand Law Students' Journal

Predicting Sentencing Decisions of the New Zealand Courts Using Support Vector Machines

Abstract—This article builds upon the literature on judgment prediction using artificial intelligence and natural language processing. It constructs a dataset of 145 sentencing cases and uses support vector machines to predict the seriousness of each case. The unstructured nature of sentencing decisions in New Zealand, as well as the difficulties associated with accessing and processing large amounts of legal information from the New Zealand courts, limit the performance of the classifier. The article’s proposed solution is a modification to the law specifying a structure for sentencing decisions.

Consistency is vital in all areas of the law, including sentencing.^[1] But the process of determining the length of time a person should spend in prison is arcane, even if legislation sets out maximum and minimum sentence terms.^[2] Judges must weigh “innumerable factors affecting the nature of the offence, the circumstances of the offence, and the circumstances of the offender”.^[3] The law in New Zealand and England and Wales attempts to achieve consistency in sentencing by using a staged process. Aggravating and mitigating factors are identified and then used to make uplifts or discounts to a “starting point” or “provisional sentence” fixed by reference to previous cases.^[4] In Australia, the judiciary has rejected a staged process on the basis that mathematical or statistical analysis of previous sentencing decisions is unhelpful.^[5]
Despite the assertions that statistics cannot assist sentencing,^[6] there appears to be no reason why existing artificial intelligence (AI) tools developed for judgment prediction could not be applied to the sentencing process. Like all legal decisions, sentencing decisions are an exercise of applying the law to facts, which in common law jurisdictions entails reference to previously decided cases under the doctrine of stare decisis. Therefore, in principle, it should be possible to create a statistical model to represent the seriousness of offending. This article attempts to do so, using sentencing decisions for the offence of sexual violation by rape.^[7]
This article’s hypothesis is that textual analysis of the facts of previous sentencing decisions can be used to predict outcomes of future sentencing decisions. To this end, this article presents two classifiers. The first classifier categorises cases of sexual violation by rape according to which “rape band” (derived from the tariff case, R v AM) they fall into.^[8] The second classifier predicts whether these cases fall within the top, middle, or bottom range of each band. The classifiers use natural language processing to extract textual features from sentencing decisions, which are used to train support vector machines (SVMs). The process of using an algorithm to construct a model based on training data to make predictions is known as machine learning.^[9]
The structure of this article is as follows. Part II discusses sentencing procedures examined from New Zealand, Australia, and England and Wales, and contrasts the rules on reasons and judgment structure in common law jurisdictions with the rules of international courts. Part II also surveys literature on judgment prediction, focusing on judgment prediction using natural language processing. In Part III, a dataset is formed from the text of the factual backgrounds of 145 cases of sexual violation by rape. This dataset is used to train two SVMs using machine learning. Part IV discusses the accuracy of those SVMs and proposes introducing a structure to New Zealand sentencing decisions that reflects the two-stage approach already in place.

In designing sentencing classifiers, it was necessary to examine sentencing procedures, as well as the rules relating to judgment structure in common law courts in comparison to international courts. Sentencing decisions in New Zealand that take a two-staged approach are better suited to classification than sentencing decisions from Australia that use an “instinctive synthesis” approach.^[10] New Zealand’s two-staged approach separates two different sets of circumstances: the circumstances of the offending and the circumstances of the offender.^[11] This means that the process of predicting a sentence can be split into two outcomes: finding the starting point and finding the end sentence. This article is concerned with finding the starting point.

The steps a court goes through when sentencing an offender differ across common law jurisdictions. This article examines sentencing procedures in New Zealand, Australia, and England and Wales. At one end of the structural spectrum, the Sentencing Council for England and Wales has outlined 10 steps to be followed when imposing a sentence (although not all steps are applicable in every case).^[12] At the other end of the spectrum, Australia has rejected a staged process in favour of “instinctive synthesis” where all relevant factors are considered together and an end sentence is imposed that reflects them all.^[13] New Zealand falls somewhere in the middle, with a two-staged approach that considers the offending first, followed by the offender.^[14]

The Court of Appeal in R v Taueki set out the two-stage methodology for sentencing in New Zealand.^[15] First, the court determines the starting point based on an assessment of the seriousness of the offending.^[16] The starting point depends on the existence and extent of any aggravating and mitigating factors identified in the offending, such as the degree of premeditation.^[17] Secondly, the court determines whether uplifts or discounts should apply for any aggravating or mitigating factors personal to the offender, such as an early guilty plea.^[18] At both stages of the inquiry, the court must consider the principles of sentencing in s 8 of the Sentencing Act 2002, as well as the aggravating and mitigating factors listed in s 9 of that Act,^[19] to the extent that they are relevant to that stage of the inquiry. The two-stage approach is still the position in New Zealand;^[20] although from 2009 to 2019, the discount for a guilty plea was treated as a separate third stage of the process.^[21]
For certain offences, the Court of Appeal has issued a “tariff case” (also known as a “guideline judgment”) in which the Court sets out the aggravating and mitigating factors it considers relevant to a particular offence in addition to the factors in the Sentencing Act.^[22] Tariff cases for serious offending include R v Taueki (grievous bodily harm offending),^[23] R v AM (sexual violation),^[24] and Zhang v R (methamphetamine offending).^[25] Tariff cases also set “bands” that offending can fall into, and specify the range of starting points available within each band. Judges may deviate from the bands in exceptional circumstances.^[26]
The Law Commission recommended the creation of a Sentencing Council to make sentencing guidelines,^[27] and, with the help of four judges seconded to the Commission, created draft guidelines.^[28] The Commission’s recommendation was enacted in the form of the Sentencing Council Act 2007. However, a new government was elected in November 2008, which indicated that it did not wish to proceed with the creation of sentencing guidelines.^[29] Although the Sentencing Council Act did establish a Sentencing Council,^[30] no members were ever appointed,^[31] and the Sentencing Council Act was repealed in 2017.^[32]

The Australian courts have rejected a two-stage methodology in favour of an “instinctive synthesis” approach to sentencing.^[33] In Wong v The Queen, Gaudron, Gummow and Hayne JJ (with whom Kirby J concurred in a separate judgment) considered it inappropriate to use the weight of a narcotic as the preeminent factor in sentencing a drug offender, and for a lower appellate court to have prescribed a table of sentences available for different narcotic weights.^[34] Gaudron, Gummow, and Hayne JJ rejected a two-stage approach, holding that it is necessary to consider the relevant factors and impose a sentence which reflects them all.^[35] The argument against a two-stage approach is that the partial mathematisation of the process in a two-stage approach (which involves applying uplifts and discounts, often as percentages of the starting point) can result in some relevant factors not being properly considered.^[36]
The High Court of Australia once again rejected an arithmetical approach of uplifts and discounts in Markarian v The Queen.^[37] The majority (Gleeson CJ, Gummow, Hayne, and Callinan JJ) followed Wong v The Queen.^[38] But McHugh J, who wrote a concurrence, went further by discussing the argument between proponents of an instinctive synthesis approach and a two-stage approach more broadly. McHugh J explained that two-stage proponents assert that instinctive synthesis leads to arbitrariness; the judge effectively makes it up as he or she goes along.^[39] The instinctive synthesis proponents, on the other hand, say that sentencing is not a process that can be distilled to a mathematical formula, and that “[a] sentence can only be the product of human judgment”.^[40] McHugh J had previously made the same argument in favour of instinctive synthesis in AB v The Queen;^[41] his Honour’s stance in Markarian v The Queen remained unchanged. McHugh J candidly acknowledged that instinctive synthesis is not perfect, and it does not pretend to be.^[42] But his Honour’s justification for instinctive synthesis over two-stage sentencing was that two-stage sentencing merely creates an illusion of predictability and transparency.^[43] The numbers involved in a two-stage process—the number of years for the starting point, the percentage values for uplifts and discounts, and any further adjustments for totality—do not actually offer any insight into whether the end sentence is just.^[44] McHugh J held, and the majority agreed, that the numerical values assigned to uplifts and discounts do not actually make the process more transparent because the method by which the judge arrived at the quantities of those uplifts and discounts is still arbitrary.^[45]
Kirby J was more sceptical. Although his Honour concurred as to the outcome, he disagreed that the lower court had erred in adopting a staged approach or that there was, in substance, any difference between the two.^[46] Kirby J’s argument against the existence of a difference was that all the two-stage approach did was “put on paper a logical process of human reasoning”.^[47] The implication is that instinctive synthesis engages in this process too, but makes it less transparent.
The High Court of Australia, in the later case of Muldrock v The Queen, adopted McHugh J’s position in Markarian v The Queen and unanimously agreed that the correct approach in Australia is that:^[48]

[T]he judge identifies all the factors that are relevant to the sentence, discusses their significance and then makes a value judgment as to what is the appropriate sentence given all the facts of the case.
Accordingly, despite the reservations of Kirby J in Markarian v The Queen that there is no substantive difference between the two approaches, the two-stage approach in Australia is “dead in the Yarra river water”.^[49]

The Sentencing Council for England and Wales is a statutory body that makes sentencing guidelines following consultation with the judiciary.^[50] The sentencing process in England and Wales is similar to New Zealand’s in that it is staged. However, instead of a two-stage process as prescribed in R v Taueki, the Sentencing Council sets out ten stages.^[51] Not all ten stages are applicable in every case; for example, there are two stages which apply only to those convicted of specified serious offences.^[52] Nonetheless, the main difference between the English and New Zealand processes is that, in England and Wales, offending-related factors are not considered separately from offender-related factors. Instead, the court first determines a “provisional sentence” with reference to the culpability of the offender and harm caused by the offending (stage 1). Then, the court makes uplifts and discounts to that provisional sentence, not only for factors which are personal to the offender, like remorse, but also factors related to the offending itself, like the use of a weapon (stage 2). Roberts describes the factors considered at stages 1 and 2 as “primary” and “secondary” factors, respectively, as those assessed at stage 1 have “the most important influence on sentence severity”.^[53] Eight more stages follow, though some are only applicable in certain cases. This differs from the New Zealand system in that a “provisional sentence” is not directly analogous to a New Zealand “starting point”.
There are also guidelines for specific offences,^[54] which come from the Sentencing Council.^[55] These guidelines must be followed unless it is contrary to the interests of justice to do so.^[56]

Generally, the structure and content of judgments in common law jurisdictions are not prescribed by statute. However, in New Zealand, the Criminal Procedure Act 2011 requires courts to give reasons for making, varying or revoking suppression orders;^[57] to give reasons for verdicts in judge-alone trials;^[58] and, for courts exercising appellate jurisdiction, to give reasons for their determination of appeals.^[59] Courts also have a duty under the Sentencing Act to provide reasons in sentencing decisions.^[60] Yet, the relevant Criminal Procedure Act and Sentencing Act provisions do not specify any particular structure to be used when giving such reasons.
In Sena v New Zealand Police, the Supreme Court discussed the extent of the statutory duty to give reasons.^[61] The Court held that reasons should “show an engagement with the case, identify the critical issues in the case, explain how and why those issues are resolved, and generally provide a rational and considered basis for the conclusion reached”.^[62] Reasons must also “address the substance of the case advanced by the losing party”.^[63] While the Supreme Court’s guidance sets a bare minimum for what can be considered “reasons”, the guidance lacks precision. There is no distinction between the facts and the law. At best, the Supreme Court says issues must be identified before they are resolved. This is not a criticism of the decision in Sena v New Zealand Police, as the issue before the Supreme Court was whether the trial court had provided sufficient reasons to enable an appellate court to perform its function under s 232 of the Criminal Procedure Act.^[64] In contrast, the position of civil cases in New Zealand is the same as it was prior to the enactment of the Criminal Procedure Act (which did not affect civil cases): there is no duty to give reasons, although it is “good judicial practice” to do so.^[65]
In other common law jurisdictions the position is much the same, regardless of whether the duty is a creature of case law or statute.^[66] In Australia and England and Wales, a duty to provide reasons is recognised in case law.^[67] Additionally, in England and Wales there is a statutory duty to give reasons “in ordinary language and in general terms” in sentencing decisions.^[68] In Canada, there is a duty to give reasons where there is “confused and contradictory” evidence.^[69] Cohen comments that in the United States, there is no duty for federal courts to provide reasons, but there is a “judicial culture” of giving reasons.^[70] Given that the mere existence of the duty has been a matter for debate, it is unsurprising that the extent and specificity of the duty has not yet been particularised. At best, there is a statutory duty to give reasons which does not specify the format or structure those reasons should take; at worst, there is no duty at all.
However, this is not the case in international courts. The International Criminal Court has an internal document titled “Guidelines for ICC Judgment Structure”, but this does not appear to have been made publicly available (although it is referred to in the publicly available Chambers Practice Manual).^[71] The European Court of Human Rights requires, inter alia, that judgments contain the procedural history, facts, submissions, reasons and decision in a prescribed order.^[72] Similar rules exist for the International Court of Justice,^[73] as well as the Court of Justice of the European Union.^[74]

The applications of machine learning in the legal arena are diverse.^[75] This article focuses on judgment prediction. Judgment prediction is not new. Indeed, the use of computers for judgment prediction was theorised almost sixty years ago,^[76] and described in a mathematically rigorous manner more than forty years ago.^[77] However, serious attempts to use AI for judgment prediction have only been made within the last decade. Medvedeva, Vols and Wieling recently reviewed the literature on judgment prediction, concluding that there are three general goals within the field: (i) extracting the outcomes of available judgments, (ii) discovering and analysing factors associated with particular outcomes, and (iii) forecasting future court decisions.^[78] This article focuses on the third goal: forecasting future court decisions by predicting sentencing outcomes using a dataset of previous cases.

As early as 1998, Taruffo noted that judicial decisions are best suited to prediction when they concern procedures that are applied with little variation between a large number of cases.^[79] In 2016, Aletras and others used SVMs with a linear kernel to predict the decisions of the European Court of Human Rights.^[80] SVMs are discussed in greater detail below. Aletras and others extracted the text from European Court of Human Rights decisions, and extracted from the text the features of their model (n-grams, which are contiguous word sequences, and topics, which are semantically similar clusters of n-grams). The problem was framed as a binary classification problem: given a summary of facts, then predict whether a violation of the European Convention on Human Rights^[81] (the Convention) has occurred.^[82] Aletras and others achieved an average accuracy of 0.79 (the accuracy of a random guess being 0.50) across three classifiers, each of which was designed to predict whether a particular article of the Convention had been violated.^[83] However, the results achieved by Aletras and others could not be replicated by Medvedeva, Vols and Wieling, who “consistently achieved lower results” with the same amount of data or more, and who used similar methodology.^[84] Nevertheless, Medvedeva, Vols and Wieling achieved an average accuracy of 0.75 across nine classifiers (for nine different articles of the Convention), again with the accuracy of a random guess being 0.50.^[85]
In the New Zealand context, Rodger, Lensen and Betkier used linear regression to predict sentences. Using a dataset of 302 cases of violent offending ranging from common assault, to wounding with intent, to grievous bodily harm,^[86] they achieved a mean absolute error of 11.76 months.^[87] Although Rodger, Lensen and Betkier removed the sentence length from each decision, they did not remove legal reasoning or the band the offending fell into.^[88] Rodger, Lensen and Betkier state that:^[89]

... the sentencing decision contains both the description of the facts of particular offending and judicial analysis based on the relevant sentencing judgments, which lead to the determination of the sentence. The input of the AI model will have to consider all those elements.
The effect of retaining legal reasoning or band identification is borne out by the fact that the single n-gram that is most predictive of a longer sentence is “band 4”.^[90] The n-gram “band 3” is also predictive of a longer sentence.^[91] But it is impossible to know, a priori, which band offending falls into. The band can only be known by analysis of the facts to determine the number of aggravating factors,^[92] as well as the weight to be given to each aggravating factor.^[93] By including the parts of the judgment that identify the band (even if the actual sentence is removed), the classifier is provided with information pertaining to the sentence, which cannot be known until after a human judge has determined the band.
Medvedeva, Vols and Wieling raised a similar objection in relation to the work of Aletras and others. Whereas Medvedeva, Vols and Wieling only used the “procedure” and “facts” sections of European Court of Human Rights judgments,^[94] Aletras and others also used the “law” section.^[95] Medvedeva, Vols and Wieling claim that using the “law” section creates bias by providing the model with information that explicitly relates to the result of the case.^[96] Hence, despite resulting in a smaller quantity of usable data, the research in this article only uses the facts, as opposed to the entirety of the judgment text with the result of the judgment redacted.
This does not mean useful information cannot be gleaned from training a classifier on entire judgment texts as opposed to solely factual backgrounds. As Rodger, Lensen and Betkier suggest, such systems could be used to help judges cross-reference their judgments with similar cases, assist lawyers in advising their clients, or facilitate research into sentencing patterns.^[97] Rodger, Lensen and Betkier’s research performs the second task identified in Medvedeva, Vols and Wieling’s recent article: the discovery and analysis of factors associated with specific outcomes.^[98] But in light of the warning that references to the judgment outcome must be removed from the data, to include legal reasoning in the data is counter-intuitive.^[99] If the goal of judgment prediction is to imitate legal reasoning (applying the law to material facts to reach an outcome), then providing the classifier with text containing human legal reasoning defeats that purpose.

Researchers have also made efforts to predict the decisions of courts by manually extracting features from judgments.^[100] This is more labour-intensive than the automated processes of natural language processing, but it has seen some success. For example, Shaikh, Sahu and Anand manually extracted features from 86 Delhi District Court decisions in an effort to predict whether defendants charged with murder would be convicted or acquitted^[101] Similarly, Bagherian-Marandi, Ravanshadnia and Akbarzadeh-T manually extracted features from 100 construction contract dispute cases in Iran to predict whether a court would decide in favour of the plaintiff or defendant.^[102]
Researchers have also performed judgment prediction using non-textual features on New Zealand cases involving sexual offending. Simester and Brodie used a linear regression algorithm (like Rodger, Lensen and Betkier) to fit a model to a dataset of 67 sexual violation sentencing decisions of the Court of Appeal.^[103] At that point in time, the law did not prescribe sentencing “bands” for sexual offending like those which now exist following R v AM. Instead, the approach at the time, from R v Clark, was that:^[104]

... for [a] rape committed by an adult without any aggravating or mitigating features a figure of five years should be taken as a starting point in a contested case. Aggravating features can include additional violence or indignities, acting in concert with other offenders, the youth or age of the victim, intrusion into a home, kidnapping, the use of weapons, prolonged abuse. That list is not meant to be exhaustive.
The R v Clark approach of a universal starting point of five years enabled Simester and Brodie to set five years as the intercept of the model, and then use linear regression to determine weights for aggravating and mitigating factors.^[105] Simester and Brodie found that the victim being the offender’s wife or de facto partner, or even an acquaintance of the offender, were predictive of lower sentences.^[106] The same was true for cases involving male victims,^[107] despite judicial dicta that the sexual violation of a man is not of itself less serious than the violation of a woman.^[108] However, identification of the victim being male as a mitigating factors came with the caveat that the male victims in the dataset were victims of incest, and intra-family variables were predictive of lower sentences generally.^[109] This led Simester and Brodie to conclude that their research had the potential to assist lawyers in giving advice to their clients by predicting the likelihood of success on appeal.^[110] They also considered that their findings contradicted the idea that sentencing was impossible for mathematical models.^[111]
Alternatively, the metadata of cases have been used as features. Katz, Bommarito and Blackman developed a model to predict whether justices of the Supreme Court of the United States would vote to reverse or affirm lower courts’ decisions. Predictions were based on factors including the court in which the case originated, the reason for which certiorari was granted, and the relevant area of law.^[112] Medvedeva, Vols and Wieling, in their study predicting whether a violation of the European Convention on Human Rights had occurred, also found that classifiers trained on datasets containing only the surnames of judges^[113] had an average accuracy of 0.66 and achieved accuracies as high as 0.79 in the case of one particular article of the Convention.^[114]

The use of machine learning to predict sentencing decisions has also been discussed from theoretical and philosophical perspectives. Donohue urged caution, noting mandatory guidelines had resulted in a “mandated and narrow” sentencing range which was ultimately ruled unconstitutional in the United States.^[115] However, Donohue proposed that machine learning, if used responsibly, could augment, rather than replace, the exercise of judicial discretion,^[116] for example, to mitigate bias.^[117] Zalnieriute, in contrast, warned that, if used irresponsibly, the “automation of judicial decision-making process [sic] may introduce bias and undermine judicial impartiality and independence”.^[118]
Stobbs, Hunter and Bagaric discussed the possibility of AI-assisted sentencing and considered that sentencing should be particularly well-suited to prediction. This is because, by the time of sentencing, factual issues are normally resolved and the court does not have to consider credibility or reliability. Additionally, any factors relevant to sentencing were usually readily identifiable.^[119] However, Stobbs, Hunter and Bagaric viewed instinctive synthesis as a stumbling block, as under instinctive synthesis, it is not possible to discern the weight accorded to each factor.^[120] Nonetheless, they were hopeful that algorithms could still improve both consistency and efficiency in sentencing.^[121]
Staples was less optimistic and considered that there were philosophical barriers to the creation of an AI system to compute a sentence.^[122] Staples’ principal argument is that the sentencing exercise is irreducible to a set of formal rules and, as such, AI systems (which are rule-based systems) cannot perform the sentencing exercise.^[123] Staples’ contention echoes the judgment of Gaudron, Gummow and Hayne JJ in Wong v The Queen:^[124]

Further, to attempt some statistical analysis of sentences for an offence which encompasses a very wide range of conduct and criminality ... is fraught with danger, especially if the number of examples is small. It pretends to mathematical accuracy of analysis where accuracy is not possible ... The task of the sentencer is not merely one of interpolation in a graphical representation of sentences imposed in the past. Yet that is the assumption which underlies the contention that sentencing statistics give useful guidance to the sentencer.
In Markarian v The Queen, McHugh J took a similar position, suggesting that a judge would require “the statistical genius and mental agility of a Carl Friedrich Gauss” to be aided by the use of statistics.^[125]
Overall, the High Court of Australia has shown hostility towards the use of statistical analysis or automation to what its justices perceive as the art of sentencing: a judge exercising his or her discretion to arrive at the result that is “right” in all the circumstances of the case.^[126] But the assertion that sentencing is irreducible to statistical modelling is merely that—an assertion.
Certainly, there are practical problems associated with constructing a model which takes in values (say, real numbers between zero and one) for each possible aggravating and mitigating factor. First, the factors are numerous (Stobbs, Hunter and Bagaric suggest there are more than 200 of them),^[127] and each would have to be manually identified before the model could be constructed. Secondly, in order to form a dataset, researchers would have to read each judgment and manually assign values to each aggravating and mitigating factor. This is likely unfeasible—although Simester and Brodie achieved good results with that approach for sexual offending specifically.
Nonetheless, there is more than one way to skin a cat. The article by Rodger, Lensen and Betkier introduced the concept of modelling sentencing based on an automated analysis of textual features, rather than on human divination of aggravating and mitigating factors.^[128] That approach using automated analysis of textual features is also the aim of this article. The textual summary of the facts of a case is an imperfect representation of what is actually contained within the judge’s brain. Still, the assumption underlying the literature on textual feature-based judgment prediction is that the stated facts are a sufficiently close representation of the judge’s mental process to enable an algorithm to recognise a pattern over a sufficient number of cases.^[129]

The methodology used in this article has the following steps. First, to the extent cases were available, as many sentencing cases for sexual violation by rape were sourced from a variety of legal databases. These were converted from .pdf files to .txt files for natural language processing.^[130] Cases which did not mention R v AM or identify the band that the offending fell into were discarded, as were cases where the facts could not be cleanly separated from the rest of the judgment. The remaining cases were used to train an SVM using n-grams extracted from the text of the facts of those cases.

The offence of sexual violation is committed when a person either rapes or has unlawful sexual connection with another person.^[131] Rape is defined as the penetration of another person’s genitalia with one’s penis without either consent or a reasonable belief in consent.^[132] Unlawful sexual connection is more broadly defined as penetrating another person’s genitalia or anus with a body part or an object, connecting one person’s mouth or tongue to the genitalia or anus of another person, or continuing such penetration or connection, without said person’s consent or belief on reasonable grounds of said person’s consent.^[133] The Court of Appeal in the tariff case of R v AM established two sets of four bands where sexual violation may fall into: one set of bands for sexual violation by rape, sexual violation by unlawful sexual connection effected by penile penetration of the mouth or anus, or violation involving objects (“rape bands”),
and another set of bands for all other forms of sexual violation by unlawful sexual connection.^[134] The bands overlap as shown in Table 1.

The offence of sexual violation by rape was selected for classification because it is an offence that has a tariff decision, R v AM, which prescribes four bands for offending. Moreover, R v AM identified specific (though not exhaustive) aggravating and mitigating features for sexual offending.^[136] This makes cases of sexual violation suitable for multi-class classification because the bands can be used as classes. Furthermore, the existence of a tariff case implies that there is sufficient factual similarity between all or most cases of sexual violation, such that it is possible to make sensible remarks about the aggravating and mitigating features that commonly arise in them. As noted above, consistency between cases generally makes decisions easier to predict.^[137] By that logic, increased factual similarity should result in more discernible patterns between cases that can be identified using a classifier.

The classifiers were trained and the dataset was formed using broadly the same methodology as Aletras and others, and Medvedeva, Vols and Wieling,^[138] with modifications where necessary. The classifiers use a dataset of 145 cases in which the lead offence was sexual violation by rape, including both first instance sentencing decisions and appeals against sentence.^[139] This is because the classifier uses only the facts, and therefore the difference in legal reasoning at first instance and on appeal is irrelevant. In the dataset for the first classifier, each case is labelled according to which of the four bands it falls into (resulting in four classes in total: the four-class classifier). In the dataset for the second classifier, cases are labelled both according to band, and whether the case is at the top, middle, or bottom of the band (twelve classes in total: the twelve-class classifier).
Unlike European Court of Human Rights decisions,^[140] the structure of sentencing decisions of New Zealand courts is not prescribed by statute or rules of court. However, the court will usually begin by setting out the facts of the case. The court will then consider the case via the two-stage approach: first, fixing a starting point, and then applying uplifts or discounts for personal features.
Four factors hindered formation of the dataset. First, not all sentencing decisions contain the entire factual background upon which the defendant is sentenced. Instead, some decisions will refer to an existing summary of facts agreed upon by the prosecution and the defence, or the submissions of the prosecutor.^[141] Because those decisions do not contain the factual background (instead referring to an extraneous document), they are inapplicable here. Secondly, even though the public generally has the right to access sentencing decisions,^[142] not all sentencing decisions appear in legal databases such as Westlaw NZ, Lexis Advance, Judicial Decisions Online and the New Zealand Legal Information Institute. The dearth of publicly available cases is particularly true of District Court decisions, which do not appear at all in Judicial Decisions Online or in Westlaw NZ’s sentencing tracker.^[143] Thirdly, although the general format adopted by sentencing judges is to first lay out the factual background and then move on to legal analysis, this structure is not always followed. In some decisions, the factual background is interspersed throughout the identification of aggravating features.^[144] Therefore, it is not always possible to cleanly separate the factual background from the rest of the judgment. Fourthly, although R v AM is the tariff case for rape, not all sentencing cases for rape decided after R v AM refer to it. Ultimately, only 145 of more than 400 sentencing decisions and appeals sourced from various databases were suitable for use. The majority of the decisions were unsuitable either because: (i) they did not refer to R v AM (or identify the band the offending fell into); (ii) the facts were inextricable from the rest of the judgment; (iii) the facts were not present in the judgment; or (iv) the sentence was quashed on appeal.
Consequently, a limitation of this research is that 145 cases is a relatively small dataset, particularly for multi-class classification. It would have been far more desirable to have a larger dataset because even though SVMs have been shown to produce good results in small datasets,^[145] there is a limit to exactly how small a dataset can be before a loss of accuracy occurs. As the results below demonstrate, a dataset of 145 cases is arguably below that limit in this particular application.

Once the cases were selected, the sections of the cases containing the facts of the offending were extracted. Those sections were pre-processed by lower-casing and removing paragraph numbers and stop words (words like “and” and “the” that appear frequently but have no semantic content).^[146] Then, n-grams, which are contiguous word sequences,^[147] were extracted from this pre-processed text. For example, where n = 2, the sentence “Who, then, in law is my neighbour?” can be broken into 2-grams (called bi-grams): “who then”, “then in”, “in law”, “law is”, “is my”, and “my neighbour”, or where n = 3: “who then in”, “then in law”, “in law is”, “law is my”, and “is my neighbour”.^[148] From the dataset, the 2,000 most common n-grams were selected where n ∈ {1, 2, 3, 4}.
This vocabulary of 2,000 n-grams was vectorised based on their occurrence counts, and then scaled on a term frequency inverse document frequency (TF-IDF) basis.^[149] Term frequency refers to the absolute number of times an n-gram appears; inverse document frequency refers to the logarithm of the fraction of documents in the corpus in which the n-gram appears. The result of TF-IDF scaling is that n-grams which appear in many cases have a lower weight than n-grams which are equally frequent but appear in fewer cases. TF-IDF scaling is used to ensure that common n-grams that appear in many cases (and that are therefore unlikely to affect the outcome) are not assigned higher weights by the model.^[150]

The classifier is a linear SVM. Linear SVMs have been shown to achieve high performance on text classification tasks.^[151] An SVM is a supervised learning model: a model which finds (or tries to find) a function that conforms to labelled training data. Specifically, an SVM divides data points into two classes by finding a hyperplane in x-dimensional space where x is the number of features in the model, and the hyperplane is separated from the nearest vectors by as wide a margin as possible.^[152] Suppose x = 2. The SVM would then try to find a line (a line is a hyperplane in 2-dimensional space) dividing the datapoints by class (known as the “decision boundary”). Specifically, the goal of the algorithm which finds the decision boundary is for all datapoints of both classes to be on opposite sides of the decision boundary, with as wide a margin on both sides of the decision boundary as possible.^[153] The principle is the same where x > 2.
Where there are multiple classes, such as in this case, a “one-vs-rest” classification is used to create one model for each class, where each model compares one class with all the other classes. Recall that the classes used by the four-class classifier are the bands from R v AM, and the classes used by the twelve-class classifier are the bands from R v AM with the placement of the case in the band. The fact this article addresses a multi-class classification problem distinguishes it from earlier work, which has looked at binary (that is, two-class) classification problems.^[154] Like Medvedeva, Vols and Wieling,^[155] the scikit-learn library for Python^[156] was used to create the classifier, using the sklearn.svm.LinearSVC class available within that library.
Another feature of the models presented in this article that distinguishes them from the work of Aletras and others and Medvedeva, Vols and Wieling is the fact that, whereas those models used balanced datasets (that is, there were equally as many violation cases as non-violation cases), the dataset presented here is not balanced. For example, there are more cases from band 2 (62) than band 1 (17). The number of cases in each class is shown in Table 2. To offset the class imbalance, the weights for different classes are balanced so there is a greater penalty for misclassifying cases in less populated classes and a lesser penalty for misclassifying cases in more populated classes.^[157] Balancing penalties for misclassifying datapoints from different classes counteracts the problem identified by Medvedeva, Vols and Wieling that where imbalanced datasets are used, a model which learns to classify all datapoints as belonging to the majority class can have a high accuracy.^[158] Although high accuracy is generally desirable, a model which achieves a high accuracy by always predicting the majority class provides no useful information to the user.^[159] Accuracy is not the be-all and end-all of a model’s utility.

For both the four-class classifier and twelve-class classifier,^[160] the accuracy was computed using stratified k-fold cross-validation. That is, the data was partitioned into k equal samples, and k different models were created using one sample as test data and the other k − 1 samples as training data. Those models were then averaged to produce the final model. In stratified k-fold cross-validation, each subsample is constructed such that the ratio of cases in each class in each subsample is as close as possible to the ratio of cases in the dataset as a whole. For both models, k = 5.
The accuracies of the two models (four-class classifier and twelve-class classifier) were compared to two naïve models: (1) a naïve model which simply classifies all cases as belonging to the most populous class (that is, band 2 in the case of the four-class classifier and the bottom of band 2 in the case of the twelve-class classifier); and (2) a naïve model comprising a weighted random guess that adjusts its likelihood of guessing a class based on how populous that class is (for example, where a class makes up 10 per cent of the dataset, it has a 10 per cent chance of being guessed). The reason two different naïve models are used for comparison in this article and not in the work of Aletras and others or Medvedeva, Vols and Wieling is because the classes in this case are imbalanced. With imbalanced classes, it is possible for a strategy of simply guessing the most populous class to produce a higher accuracy than a weighted random guess. That is true in this particular case.
The performance of each naïve model is calculated according to the following methodology. Let the classes themselves be labelled 1, 2, ... , n, and the class sizes as fractions of the whole dataset be labelled x₁, x₂, ... , x_n. Calculating the accuracy of the most populous class model is straightforward: it is x_i, where i is the most populous class, because the model will correctly guess datapoints in i 100 per cent of the time and datapoints not in i zero per cent of the time. The average accuracy of a weighted random guess is:^[161]

This is because the probability of correctly guessing a datapoint of class i is x_i² (the probability of i being the correct class to guess is x_i, the probability of i actually being guessed is also x_i, and when multiplied the result is x_i²). The average accuracy is the sum of probabilities of getting true positives across all classes.

The performance of the classifiers is shown in Tables 3 and 4. The twelve-class classifier achieved an accuracy that was only marginally higher, by three percentage points, than the naïve model. The four-class classifier, on the other hand, achieved an accuracy that was seven percentage points higher than the naïve model. This improvement reinforces the conclusions made in previous research that mathematical models can identify correlations between the facts of cases and their results.^[162] Aletras and others and Medvedeva, Vols and Wieling achieved better results with their models.

There are two possible reasons for this lacklustre result. First, the amount of training data was limited. The smallest dataset used by Aletras and others had two classes of 40 cases each (a dataset of 80 cases in total). Here, the smallest class for the four-class classifier was 17 cases and the smallest class for the twelve-class classifier was only four cases. This is partially due to the rarity of prosecutions for extremely serious sexual offending, but it is also due to the challenges associated with unstructured judgments and access to judicial information. Another attempt at this analysis with access to more data would be worthwhile. As previously stated, the offence of sexual violation by rape was selected because R v AM, in addition to setting out four bands,^[163] identified the aggravating and mitigating features relevant to sexual violation.^[164] This enabled a multi-class classification approach, as there are four separate bands into which a classifier can sort offending. It was also hoped that the explicit listing of aggravating and mitigating features reflected the established patterns of conduct in sexual offending cases, which would in turn make it suitable for automated classification. However, given sexual violation by rape is a very serious offence, it naturally follows that there are fewer cases of it than less serious offending. Therefore, a future attempt to use classifiers to predict sentences may have more success with a less serious offence where there are more available cases, even if there is less factual commonality between them.
The second possible reason explaining the result is that the unstructured nature of sentencing decisions makes it difficult for an algorithm to discern a pattern across cases, even after discarding cases that are unusable by virtue of having no clearly discernible section for the factual background. The difficulties with pattern recognition are worsened by the fact that first instance sentencing decisions are typically delivered orally to the offender. Oral delivery requires the judge to balance using language which is legally accurate and language which the offender can understand. In doing so, the judge must bear in mind that the defendant may lack the vocabulary necessary to understand legal writing. The competing objectives of legal accuracy and accessibility to laypeople, coupled with the fact that the judge is speaking directly to the offender, mean that sentencing decisions often contain an unusual mixture of both formal and colloquial language.^[165]
In all likelihood, both the limited training data and unstructured nature of judgments have contributed to the mediocre accuracy of the classifier. Fortunately, fixing the latter problem would go some way to alleviating the former problem, as imposing structure on sentencing decisions would result in more usable data. The following section considers how that might be done.

This article, as well as previous research, demonstrates that judgment prediction from textual features, including sentencing decisions, may be possible with sufficient data.^[166] Unfortunately, accuracy is reduced when the quantity of data is limited. The quantity of data can be limited by cases not appearing in legal databases, including those databases which are managed by the courts or the Ministry of Justice, such as the District Court website and Judicial Decisions Online. It can also be limited by cases having no sharp division between facts and legal reasoning, resulting in the case being unsuitable for a textual factual analysis.
In light of the dual concerns of unstructured judgments and lack of access to judicial information, this article proposes the introduction of formal structural requirements for all sentencing decisions in New Zealand. Such a structure, in respect of first instance decisions, might comprise the following sections:

(1) Introductory matters: This section could include an acknowledgement of the victim or the victim’s family and an explanation to the offender of the process the sentencing will take.

(2) Procedural matters: If there are unresolved procedural matters, these should be dealt with at the start in a separate section and not interwoven into other sections.

(3) Disputed issues of fact: The court should identify any issues of fact which the prosecution and defence have been unable to agree upon, and should resolve those issues prior to setting out the factual background in detail.

(4) Factual background: This section should contain every fact the court relies upon in the decision. New facts should not be introduced in any subsequent section.

(5) Reasoning as to the starting point: The court should state the starting point it takes and its reasons for adopting it. This includes any aggravating or mitigating factors of the offending that the court has identified and, if applicable, any references to similar cases. If there is a tariff case which sets out bands for the offence for which the offender is being sentenced, the band which the offending falls into should be identified.

(6) Reasoning as to uplifts and discounts: This section should set out any personal aggravating or mitigating factors identified by the court, and the respective uplifts or discounts applied. As per Moses v R, a reduction for a guilty plea should be considered at this stage.

(7) Reasoning as to sentences less than imprisonment, minimum period of imprisonment, or preventive detention: This section sets out the court’s reasoning as to whether to impose a sentence less than imprisonment, a minimum period of imprisonment, or preventive detention. Imprisonment is the most restrictive option available to a court in the hierarchy of sentences available under the Sentencing Act.^[167] If the court determines that a sentence of imprisonment is inappropriate, it may impose a less restrictive sentence. Conversely, a court may impose a minimum period of imprisonment in relation to a determinate sentence of imprisonment if satisfied that the period after which the offender would normally become eligible for parole is insufficient.^[168] Furthermore, for certain offences, the High Court may impose preventive detention, and if so, must also impose a minimum period of imprisonment of not less than five years.^[169]

(9) Additional orders: If any additional orders must be made (for example, an order to destroy a firearm used in the offending), they should be dealt with in a separate section.
Importantly, each decision should contain all nine sections, with each section having its own heading. Sections should not be merged. If some sections are inapplicable to a particular case (for example, if the court does not intend to make any additional orders), those sections should still be included, with a short statement such as, “I make no additional orders”. The reason for requiring headings and separate sections is to create a clear delineation between each section. Alternatively, sections 4, 5, 6 and 8 of the proposed structure should always be included, with the option of adding the other sections in cases where it is appropriate.
There is, at present, no universal phrase which signals the start of the factual background, much less any way to reliably identify the end of it. Therefore, it is not possible to automate the extraction of the factual background from a case at present. Judges sometimes “signpost” the beginning of the factual background, using a phrase like “the facts are that” or “turning then to the facts”.^[170] Sometimes they do not.^[171]
Consistent use of standardised sections and headings (such as “Factual background” and “Reasoning as to starting point”) would enhance researchers’ capacity to analyse judgments using natural language processing. This is because the process of extracting the factual background from decisions could be simplified by writing a program to simply read from the string “Factual background” to the string “Reasoning as to starting point”.
It might be argued that the requirement to use consistent sections and headings would be too onerous and place unnecessary pressure on an already overloaded criminal justice system. However, the Sentencing Council for England and Wales already prescribes 10 steps to be followed in sentencing.^[172] The Sentencing Council’s guidelines do not state that every step must be included in a decision (and not all steps are applicable to all cases). Under the proposed structure, where a section is inapplicable in a particular case, it can simply be included with a short statement that it is inapplicable. That would hardly be a great administrative burden.
There will, of course, be certain offences for which this structure does not work. For example, murder has its own process whereby the presumptive sentence is life imprisonment^[173] with an minimum period of imprisonment of at least 10 years.^[174] In certain circumstances, the minimum period of imprisonment must not be less than 17 years.^[175] Furthermore, in recent cases it has also been necessary to consider whether the sentence ought to be life imprisonment with a very long (but finite) minimum period of imprisonment or life imprisonment without parole.^[176] Special structures would be required for offences which have their own statutory process to be followed. Ultimately, it does not matter which structure is used, so long as the facts are cleanly separated from the legal reasoning and the structure is consistent across cases concerning the same offence.

It is possible, at least in theory, to use machine learning to predict the starting point for a sentence, although there are practical difficulties in doing so. In this respect, decisions of international courts (specifically, the European Court of Human Rights) are easier to predict than decisions of the domestic courts in common law countries. This is because the structured nature of international court decisions makes it easier to separate facts from law as opposed to the unstructured nature of common law decisions. This article’s attempts to use existing machine learning techniques in the New Zealand sentencing context, where judgments are unstructured, have produced models that perform better than naïve baseline models—but not by much. To better facilitate further research on judgment prediction, as well as consistency and predictability of sentencing, New Zealand should adopt a uniform structure for all sentencing decisions.

[*] LLB, BSc Cant. Law Clerk, Plymouth Chambers, Christchurch. The author expresses his gratitude for the helpful comments of the anonymous reviewer.

[2] See Law Commission Sentencing Guidelines and Parole Reform (NZLC R94, 2006) at [32], which criticises New Zealand’s sentencing system as “highly permissive”, and notes that maximum sentences are “of only indirect and sometimes marginal relevance to day-to-day sentencing”.

[4] See R v Taueki [2005] NZCA 174, [2005] 3 NZLR 372 for New Zealand; and Sentencing Council “General guideline: overarching principles” (1 October 2019) <www.sentencingcouncil.org.uk> for England and Wales.

[6] Wong v The Queen [2001] HCA 64, (2001) 207 CLR 584 at [66] per Gaudron, Gummow and Hayne JJ; and Adrian Staples “Some Reservations about the Use of Artificial Intelligence in Sentencing Decisions” (2020) 94 ALJ 949 at 955.

[9] See Tom M Mitchell Machine Learning (McGraw-Hill, New York, 1997) at 2, where an oft-cited definition of machine learning is given: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

[15] R v Taueki, above n 4. See also Jeremy Finn and Debra Wilson Sentencing Law in New Zealand (Thomson Reuters, Wellington, 2021) at [6.2].

[22] “Aggravating” factors make offending more serious and “mitigating” factors make it less serious. Aggravating factors for sexual violation include violence greater than that inherent in the act of sexual violation itself, vulnerability of the victim, and the presence of multiple offenders: see R v AM, above n 8, at [38]–[42] and [45]–[46].

[29] As Panckhurst, above n 28, at 204 notes, shortly before a National-led government was elected, the National Party’s justice spokesman said: “National believes that we already have a body that tells judges how offenders should be sentenced. It’s called Parliament. So I’m announcing today that under a National Government there will be no Sentencing Council. There will be no extra layer of bureaucracy that is not needed.”

[34] Wong v The Queen, above n 6, at [74] per Gaudron, Gummow and Hayne JJ and [129]–[131] per Kirby J, although Kirby J at [101]–[102] reserved his position on the two-stage versus instinctive synthesis argument.

[35] R v Williscroft [1975] VicRp 27; [1975] VR 292 at 300 as cited in Wong v The Queen, above n 6, at [75], n 131 per Gaudron, Gummow and Hayne JJ; and R v Thomson [2000] NSWCCA 309, (2000) 49 NSWLR 383 as cited in Wong v The Queen, above n 6, at [75], n 131 per Gaudron, Gummow and Hayne JJ.

[36] Wong v The Queen, above n 6, at [72]–[75] and [78] per Gaudron, Gummow and Hayne JJ.

[45] At [39] per Gleeson CJ, Gummow, Hayne and Callinan JJ and [71]–[72] per McHugh J.

[48] At [51] per McHugh J as cited in Muldrock v The Queen, above n 5, at [26], n 72.

[50] Sentencing Council “About the Sentencing Council” <sentencingcouncil.org.uk>.

[52] See Sentencing Council, above n 4, specifically, step 5 (“Dangerousness”) and step 6 (“Special custodial sentence for certain offenders of particular concern”).

[53] Julian V Roberts “Sentencing Guidelines in England and Wales: Recent Developments and Emerging Issues” (2013) 76 Law & Contemp Probs 1 at 8–9.

[54] See for example Sentencing Council “Common assault / Racially or religiously aggravated common assault / Common assault on emergency worker” (1 July 2021) <www.sentencingcouncil.org.uk>, which provides guidelines for assault.

[66] See H L Ho “The judicial duty to give reasons” (2000) 20 JLS 42; and Michael Taggart “Administrative Law” [2000] NZ L Rev 439.

[70] Mathilde Cohen “When Judges Have Reasons Not to Give Reasons: A Comparative Law Approach” (2015) 72 Wash & Lee L Rev 483 at 531.

[71] International Criminal Court Chambers Practice Manual (5th ed, 2022) at [86]; and Douglas Guilfoyle “The International Criminal Court Independent Expert Review: reforming the Court: Part III” (10 February 2020) EJIL Talk <www.ejiltalk.org>.

[72] European Court of Human Rights Rules of Court (adopted 23 June 2023), r 74.

[73] International Court of Justice Rules of Court (adopted 14 April 1978), art 95(1).

[74] Court of Justice of the European Union Rules of Procedure of the Court of Justice [2012] OJ L 265/1, art 87.

[75] Kevin D Ashley “Automatically Extracting Meaning from Legal Texts: Opportunities and Challenges” (2019) 35 Ga St U L Rev 1117 at 1118–1119.

[76] See Reed C Lawlor “What Computers Can Do: Analysis and Prediction of Judicial Decisions” (1963) 49 ABA J 337.

[78] Masha Medvedeva, Michel Vols and Martijn Wieling “Rethinking the field of automatic prediction of court decisions” (2023) 31 Artif Intell Law 195 at 208.

[79] Michele Taruffo “Judicial Decisions and Artificial Intelligence” (1998) 6 Artif Intell Law 311 at 318.

[80] Nikolaos Aletras and others “Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective” (2016) PeerJ Comput Sci 2:e93.

[81] Convention for the Protection of Human Rights and Fundamental Freedoms 213 UNTS 221 (opened for signature 4 November 1950, entered into force 3 September 1953) as amended by Protocol No 11.

[84] Masha Medvedeva, Michel Vols and Martijn Wieling “Using machine learning to predict decisions of the European Court of Human Rights” (2020) 28 Artif Intell Law 237 at 242.

[86] Harry Rodger, Andrew Lensen and Marcin Betkier “Explainable artificial intelligence for assault sentence prediction in New Zealand” (2023) 53 J R Soc NZ 133.

[87] That is, the predicted sentence differed from the actual sentence imposed by an average of 11.76 months: see Rodger, Lensen and Betkier, above n 86, at 139.

[90] At 141. Violent offending, which uses the tariff cases of R v Taueki, above n 4, at [34]; and Nuku v R [2012] NZCA 584, [2013] 2 NZLR 39 at [38], only has three bands. Sexual violation has four bands: R v AM, above n 8, at [90] and [113]. Therefore, the existence of the n-gram “band 4” in a judgment tells the classifier that, not only is this a sexual case, but it is one of the most serious sexual cases (as band 4 is the highest band).

[100] A “feature”, in the context of machine learning, is an attribute associated with an input to the model.

[101] Rafe Athar Shaikh, Tirath Prasad Sahu and Veena Anand “Predicting Outcomes of Legal Cases based on Legal Factors using Classifiers” (2020) 167 Procedia Comput Sci 2393 at 2396–2397.

[102] Navid Bagherian-Marandi, Mehdi Ravanshadnia and Mohammad R Akbarzadeh T “Two-layered fuzzy logic-based model for predicting court decisions in construction contract disputes” (2021) 29 Artif Intell Law 453 at 461–464.

[103] Duncan I Simester and Roderick J Brodie “Forecasting criminal sentencing decisions” (1993) 9 Int J Forecast 49.

[112] Daniel Martin Katz, Michael J Bommarito II and Josh Blackman “A general approach for predicting the behaviour of the Supreme Court of the United States” (2017) 12(4) PLoS One 1 at 5.

[113] That is, given an article of the Convention and the names of the judges who decided a case about that article, predict whether the judges found that a violation occurred.

[115] Sentencing Reform Act of 1984 Pub L No 98–473, § 212(a)(2), 98 Stat 1987 at 1987 (1984) (under which the Federal Sentencing Guidelines were produced); and United States v Booker [2005] USSC 593; 543 US 220 (2005) (which reduced the status of the guidelines from mandatory to advisory) as cited in Michael E Donohue “A Replacement for Justitia’s Scales?: Machine Learning’s Role in Sentencing” (2019) 32 Harv J L & Tech 657 at 670, n 94.

[118] Monika Zalnieriute Technology and the Courts: Artificial Intelligence and Judicial Impartiality (submission to Australian Law Reform Commission Review of Judicial Impartiality, 16 June 2021) at 4.

[119] Nigel Stobbs, Dan Hunter and Mirko Bagaric “Can Sentencing Be Enhanced by the Use of Artificial Intelligence?” (2017) 41 Crim LJ 261 at 261–262.

[126] Compare Markarian v The Queen, above n 37, at [130], where Kirby J critiqued that notion as “the thought that there descends upon a judicial officer, following appointment, a mystical ‘instinct’ or ‘intuition’ that ensures that he or she will get the sentence right ‘instinctively’”. Kirby J continued: “That approach discourages explanation of the logical and rational process that led to the sentence, so far as it can reasonably be given and is useful.”

[127] Stobbs, Hunter and Bagaric, above n 119, at 264. This reference to more than 200 aggravating and mitigating factors refers to Australian law and spans all offences. If the scope were narrowed to a single offence at New Zealand law for which there is a tariff case (eg sexual violation, where R v AM, above n 8, at [34]–[64] identified nine aggravating and three mitigating factors), human calculation might come within the realm of possibility.

[130] A .txt file contains plain text only, whereas a .pdf file contains text as well as formatting (font, size, line spacing, etc). Formatting is irrelevant for this particular natural language processing task.

[135] At [90].

[136] At [34]–[87].

[137] Taruffo, above n 79, at 318.

[138] Aletras and others, above n 80; and Medvedeva, Vols and Wieling, above n 84.

[139] Although the “rape bands” defined in R v AM, above n 8, are broader than rape, including sexual violation by penile penetration of the mouth or anus, and violation with objects, this research only examines cases of rape as defined in s 128(2) of the Crimes Act.

[140] European Court of Human Rights, above n 72, r 74.

[141] See for example R v T [2015] NZHC 3057 at [3]: “I will not go through the factual background in detail. It is summarised in the Crown’s submissions, which are accepted by [defence counsel] as accurate.” Brewer J then gave what his Honour described as “an overall picture” in his sentencing notes.

[142] Senior Courts (Access to Court Documents) Rules 2017, r 8(3)(d); and District Court (Access to Court Documents) Rules 2017, r 8(2)(d).

[143] Some District Court cases do appear in the Westlaw NZ database generally—just not in the sentencing tracker.

[144] R v Clark [2017] NZDC 14873 at [5]–[12].

[145] Sida Wang and Christopher D Manning “Baselines and Bigrams: Simple, Good Sentiment and Topic Classification” in Haizhou Li and others Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Volume 2: Short Papers (Association for Computational Linguistics, Jeju Island (South Korea), 2012) 90 as cited in Aletras and others, above n 80, at 9.

[146] The list of stop words comes from the Natural Language Toolkit library for Python: see Steven Bird, Ewan Klein and Edward Loper Natural Language Processing with Python (O’Reilly, Sebastopol (Calif), 2009) at [2.4.1].

[147] Christopher D Manning and Hinrich Schütze Foundations of Statistical Natural Language Processing (MIT Press, Cambridge (Mass), 1999) at [6.1.2].

[148] The example sentence is taken from the seminal speech of Lord Atkin in Donoghue v Stevenson [1932] AC 562 (HL) at 580.

[149] TF-IDF is calculated according to the formula given in Scikit-Learn “sklearn.feature_extraction.text.TfidfTransformer” <www.scikit-learn.org>:
where , is a term, is the total number of documents, and is the number of documents that contain .

[150] For example, the unigram “rape” appears very frequently throughout the dataset. But, since it also appears in every single case in the dataset, its presence is not indicative of a particular band. Using TF-IDF, it is assigned a lower weight than it would be assigned if TF-IDF were not used.

[151] Neil Kalcheva, Milena Karova and Ivaylo Penev “Comparison of the accuracy of SVM kernel functions in text classification” (paper presented at the International Conference on Biomedical Innovations and Applications, Varna, Bulgaria, 2020) 141 at 143.

[152] The datapoints close to the hyperplane are called support vectors. Hence, the model is called a support vector machine.

[153] Vladimir N Vapnik Statistical Learning Theory (Wiley, New York, 1998) at 401–403 and 421–425. For a more layperson-friendly introduction to SVMs, see Andreas C Müller and Sarah Guido Introduction to Machine Learning with Python: A Guide for Data Scientists (O’Reilly, Sebastopol (Calif), 2016) at 92–97.

[154] See Aletras and others, above n 80, at 9; and Medvedeva, Vols and Wieling, above n 84, at 247.

[155] Medvedeva, Vols and Wieling, above n 84, at 250 and 252.

[156] Fabian Pedregosa and others “Scikit-learn: Machine Learning in Python” (2011) 12 J Mach Learn Res 2825.

[157] See Xulei Yang, Qing Song and Yue Wang “A Weighted Support Vector Machine for Data Classification” (2007) 21 Int J Pattern Recognit Artif Intell 961, where this technique was first presented.

[158] Medvedeva, Vols and Wieling, above n 84, at 247.

[159] See Müller and Guido, above n 153, at 277–278. Suppose one has a binary classification task with a dataset that is imbalanced 99:1 (so, the dataset contains 99 instances of the majority class for every instance of the minority class). Then a model which always predicts the majority class will have a high accuracy (99 per cent), despite not identifying any instances of the minority class. Now suppose the task is to screen patients for cancer. The majority class is patients without cancer and the minority class is patients with cancer. The consequences of a false negative (letting cancer go undetected) are more serious than a false positive (flagging the patient as having cancer when in fact the patient does not). Thus, in this hypothetical example, although the majority class classifier has an accuracy of 99 per cent, it is wholly unfit for purpose.

[160] R v AM, above n 8, recalls that the four-class classifier predicts into which band (1–4) the offending falls. The twelve-class classifier predicts into which band the offending falls, and also the location within the band (top, middle or bottom).

[161] José A Sáez, Bartosz Krawczyk and Michał Woźniak “Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets” (2016) 57 Pattern Recognit 164 at 170; and C Ferri, J Hernández-Orallo and R Modroiu “An experimental comparison of performance measures for classification” (2009) 30 Pattern Recognit Lett 27 at 29.

[162] Aletras and others, above n 80; Rodger, Lensen and Betkier, above n 86; and Simester and Brodie, above n 103.

[163] R v AM, above n 8, at [90].

[164] At [34]–[64].

[165] See for example R v Vaughan [2017] NZDC 23583 at [12], where Judge Edwards explained the tariff case R v AM in the following terms: “The Court of Appeal outlined a number of what are called culpability assessment factors or factors which make the offending worse.” In the context of grievous bodily harm offending, Downs J in R v Hambly [2023] NZHC 2506 at [25] recently described mitigating factors to a lay litigant: “I now turn to mitigating factors—things that make your offending less serious.”

[166] Aletras and others, above n 80; Medvedeva, Vols and Wieling, above n 84; and Rodger, Lensen and Betkier, above n 86.

[167] Sentencing Act, s 10A(2).

[168] Section 86.

[169] Sections 87 and 89.

[170] See for example R v Te Huna [2017] NZDC 23813 at [3] (“The facts are that many years ago ...”); and R v Akuhata [2017] NZDC 1388 at [3] (“Turning then to the facts of this matter ...”).

[171] See for example R v Taylor DC New Plymouth CRI-2010-043-3525, 13 February 2012 at [2], where the exposition of the facts began with “In broad terms you struck up a relationship with your victim at the Icons bar ...” without signposting.

[172] Sentencing Council, above n 4.

[173] Sentencing Act, s 102.

[174] Section 103(2).

[175] Section 104; and see R v Williams [2004] NZCA 328; [2005] 2 NZLR 506 (CA) at [52]–[54] which prescribed a two-stage process for cases involving s 104.

[176] See for example R v Lothian [2019] NZHC 2938; R v Tarrant [2020] NZHC 2192, [2020] 3 NZLR 15; and R v Epiha [2021] NZHC 3394.

NZLII: Copyright Policy | Disclaimers | Privacy Policy | Feedback
URL: http://www.nzlii.org/nz/journals/NZLawStuJl/2023/7.html

Band	Starting Point (years)
1	6–8
2	7–13
3	12–18
*Table 1* Starting points for the four rape bands^[135]

Band	Cases	Location in band	Cases
Band 1	17	Bottom	7
		Middle	6
		Top	4
Band 2	62	Bottom	26
		Middle	23
		Top	13
Band 3	35	Bottom	16
		Middle	9
		Top	10
Band 4	31	Bottom	19
		Middle	8
		Top	4
*Table 2* Metrics of the dataset

Model	Accuracy
Classifier	0.50
Most populous class	0.43
Weighted guess	0.30
*Table 3* Performance of the four-class classifier

Model	Accuracy
Classifier	0.21
Most populous class	0.18
Weighted guess	0.11
*Table 4* Performance of the twelve-class classifier

New Zealand Law Students' Journal

Kenworthy, Vincent --- "Predicting sentencing decisions of the New Zealand courts using support vector machines" [2023] NZLawStuJl 7; (2023) 4 NZLSJ 143