There exist different types of moderate discourses; some are more normative, others more diplomatic, assertive, neutral or evasive. We find that Llama2 is more prone to normative discourse, while GPT 3.5 and 4 use a variety of normative, diplomatic and academic discourses. Mistral, BLOOM and Google FLAN will tend to be assertive or simplistic. Different discourses are used in response to different types of questions; when prompted about identity-related issues, for example, it is more likely that normative discourse is used. Such patterns indicate that different LLMs hold different “attitudes” about the issues inquired by prompts.
Different LLMs answer to controversial questions — or “controversy prompts” — with a variety of more or less standardised forms of moderate speech. While some, like Llama2, recur primarily to the language of social justice, GPT 3.5 and 4 are modelled after more diplomatic and academic discourses. Others lack moderation altogether.
Still, we find that when queried in different languages, LLMs lose their consistency. Sometimes the differences are more fundamental, providing diverging answer to controversial yes or no questions. Other times the differences are mainly discursive, where for example GPT 3.5 may use normative language in English but choose more academic or diplomatic responses in Chinese. Of all LLMs, we found that Llama2 had the most inconsistencies in spite of using more proactive, activistic language.
For the past year, using large language models (LLMs) for content moderation appears to have presented solutions to some of the perennial issues of online speech governance. Developers have promised ‘revolutionary’ improvements (Weng, Goel and Vallone, 2023), with large language models considered capable of bypassing semantic and ideological ambiguities that hinder moderation at scale (Wang et al., 2023). In this endeavour, LLMs are trained to generate “moderate speech” – that is, not to utter offensive language; to provide neutral, balanced and reliable prompt outputs; and to demonstrate an appreciation for complexity and relativity when asked about controversial topics.
But the search for optimal content moderation obscures broader questions about what kind of speech is being generated. How does generative AI speak “moderately”? That is: under what norms, training data and larger institutions does it produce “moderate” language?
At present, content moderation training of LLMs is done with a variety of data training techniques. One consists in abstaining from objectionable content as training data. OpenAI, for example, often reiterates that GPT models stay clear of content from the “dark web” (Schade, 2023). In the absence of offensive language, GPT models are also trained to generate counter-speech when prompted about potentially offensive topics (OpenAI, 2023). Requests for autocompletions like “Israelis are” or “Palestinians are”, for example, swiftly generate eulogies for the history and diversity of each population. Another example is the generation of balanced and descriptive responses, where GPT 3.5 and above outline the premises, positions and background of groups involved in controversies (Khatun and Brown, 2023). But where there are attempts at innocence, there is often inadvertent hypocrisy: LLMs are known to “jailbreak” from moderation when given trick-questions, controversial autocompletions, or when they become the characters of provocative fiction (what Jacomy et al. study as “perturbation engines” ) (Chao et al., 2023).
Another method relies on training LLMs to detect and adjudicate offensive content. It consists in aggregating many different policy documents as a form of training data for models to discern various types of “objectionability” and sentence them accordingly. Decades of platform policy documents, including hundreds of thousands of subreddit rules (Kumar, AbuHashem and Durumeric, 2023), can serve as valuable taxonomies of objectionable content and enforcement techniques — from permanently suspending “invariable extremes” such as pornography and gore violence, to “providing context” to complex and variable issues like COVID-19 (de Keulenaar, Magalhães and Ganesh, 2023). Additional training techniques involve a human-in-the-loop, as OpenAI recently promised with GPT-4 (Weng, Goel and Vallone, 2023). GPT-4’s moderation model is tasked with labeling good and problematic prompts, and then learns from comparisons with labels made by policy experts, who may further clarify their annotations and policy documents to increase intelligibility with the model.
A third method that is gaining traction in policy circles — and AI labs themselves — is red teaming. This technique consists in using the public, AI lab employees or other users to essentially poke at an unmoderated LLM model with prompts that might lead to controversial or "harmful" answers (Metcalf and Singh, 2023). Metcalf and Singh point to how, despite having been given confidence by governmental institutions (including the White House), this method might be faced with the difficult task of having to predict the incommensurably wide kinds of "human mischief" that might break an LLM moderation system.
Indeed, the challenge may lie less in the efficiency of AI-assisted content moderation or the discursive and ethical quality of “moderate speech”, than in the very model of content moderation LLMs are trained on. Historically, to moderate has (also) meant to manage conflict, debate, and the irreconcilable cultures, ideas and knowledge from which may eventually derive objectionable content — be that offensive language, conspiratorial narratives or benign misinformation stemming from polarized issue spaces (Williams, 2023). Yet, to train “moderate speech” as is currently done arguably follows a model of platformization (Denkena and Luitse, 2021), in the sense that it consists in aggregating enormous amounts of diverse and noisy voices from the Internet by harmonizing their divergences through descriptive discourse, or concealing highly partial ones from prompt responses (Howard and Kira, 2023). This can also be argued about red teaming, which ultimately determines what a "controversial" prompt or prompt response are under morally, historically, culturally or otherwise limited lenses. What is a controversial prompt to begin with? Where does an AI lab draw the line between what may be prompted by users — and expressed by their own models — in the context of profoundly contentious negotiations for the meaning and legitimacy of public speech norms? Public debates about politically, socially or other sensitive topics will tend to preclude a certain amount of taboos, which may necessarily have to be expressed in some form of another for such debates to be resolved.
It is interesting, in this sense, to use LLMs and their added layer of moderation as an avatar of a given "collective unconscious" and its moral codes — that is, the underlying impulses, fears and aggressions of all the voices on which LLMs are trained on, as well as the moral imperatives intended to refrain them from unlimited and harmful expression. Thus, the question posed by this research is not so much deontological (i.e., we do want to determine what LLMs should or not should not say), but moreso cultural, in the sense that we look at how LLMs reproduce the vulnerabilities and inconsistencies of contemporary speech norms. We may find, for example, that though LLMs are trained to show a strong stance against problematic prompts, it will itself struggle not to replicate the very same problematic aspects of that prompt when answering it. This may not necessarily be due to an insufficiency in moderation training, but in the internal moral contradictions of the speech norms it is trained to replicate. Norms around what should and should not be expressed are imperfect and impermanent; they vary based on the definitions one attributes to key concepts like hate speech (racism or other), dis- and misinformation, and more. As they contain internal moral and other contradictions, so will the kinds of "moderate speech" that LLMs, as "stochastic parrots", will generate (Bender et al., 2021).
By examining the regulatory frameworks of LLM labs, comparing responses to moderation prompts across three LLMs and looking deeper at their training datasets, this project seeks to shed light on the norms, techniques and regulatory cultures around the generation of “moderate speech” and content moderation. This specific project focuses on using digital methods (Rogers, 2013) and contributions from content moderation studies to examine how “moderate speech” has evolved in five LLM models — GPT 3.5 and 4, Llama 2, Google FLAN T5, BLOOM and Mistral.
What kinds of “moderate speech” do GPT 3.5, 4, Llama2, Mistral (Dolphin), BLOOM and Google FLAN T5 generate when prompted with controversial questions from the Web?
How does LLM-generated moderate speech compare across LLM models and the languages of prompts?
What moderate discourses do LLMs generate based on the theme or level of controversiality of a given prompt?
As specified above, the goal of this study is to evaluate how different LLMs generate content moderation when confronted with controversial queries across different languages. For the purposes of this winter school only, this project uses a host of digital methods (Rogers, 2013) and contributions from content moderation studies to examine how “moderate speech” has evolved in five LLM models — GPT, Llama, Mistral, BLOOM and Flan—across three different languages—English, French and Chinese. Though none of these languages represent minority training datasets of these LLMs, we chose them because we could speak and interpret them natively. They are also spoken in numerous parts of the world, and may thus allow scrutiny of LLM answers to prompts about issues, populations or other topics outside of Europe and the United States (namely East Asia and the North, Centre-West and South of Africa).
There were three steps to this analysis.
The first was to examine LLM regulatory and training frameworks. This implies studying a model’s architecture and training through documentation scattered in code repositories (Hugging Face, Github), academic papers, interviews with LLM lab developers, or other documentation. In one sense, this analytical step involves getting a sense of what is considered objectionable by a given LLM lab, what moderation techniques are designed by these labs to mitigate said content, and what regulatory frameworks these labs rely on. From this step, one could derive a taxonomy of problematic content generation and a variety of mechanisms and techniques used by an LLM to counter or abstain from it. Comparisons across LLM labs allows one to distinguish the different regulatory and ethical cultures that characterise each model’s moderation training.
The content moderation policies for the selected LLMs in the course of this study were used as guidelines to define the characteristic elements of a prompt that would trigger moderation. This included OpenAI and Meta AI’s Usage policies (Meta, 2024; OpenAI, 2024), Google Research and bigscience’s Licence pages (bigscience, 2022; Google Research, 2024), and Mistral AI’s Terms of Use (Mistral AI, 2024). From these policies, we derived the types of problematic prompts each LLMs sanctions and the specific topics or issues that they are trained to detect as problematic. Types of problematic prompts or usages of prompt outputs include: using LLMs to produce adult, illegal, and harmful content, such as “content that promotes or glorifies violence” or that “expresses, incites or promotes hate based on identity” (bigscience, 2022; Meta, 2024; OpenAI, 2024); medical information; or mis- or disinformation activities such as coordinated inauthentic behaviour (OpenAI, 2024). Each of these usages and types of information were used for prompt engineering, in the sense that we actively designed and searched for prompts that would break these policy rules. We named these “controversy prompts”.
By “controversy prompts”, we refer to questions that will prompt the LLM to generate some form of “moderate speech”. This means probing an LLM about questions that lack widespread consensus and often call for the adoption of opinions, positions, arguments or even “deep disagreements” (Fogelin, 1985). Another form of controversy prompt is autocompletions, which have typically been used to study bias and moderation in Google searches (Rogers, 2023). As above, one can design autocompletions for issues that lack consensus across languages, such as to whom belong contested lands in the languages of two conflicting groups (see, for example, de Keulenaar et al., 2023). This is why prompting LLMs in more than one language is crucial.
But how does one obtain “controversy prompts”? While one way is to design prompts that touch upon controversies we may have some expertise in, the result would not be representative of the kinds of controversial questions an LLM user would be asking LLMs. There are methods to obtain prompts users have used with image-generating AIs, such as Midjourney, from spaces like Discord. But the same cannot be said about prompts used on chatGPT or other personal uses of other LLMs.
The second step to this project, then, was to find a representative sample of controversial queries that could be used as “controversy prompts”. The solution we opted for was to retrieve prompts from Web spaces where controversial questions tend to be asked, or where one could formulate such questions from. The Web is replete with Q&A forums for users to ask questions with varying levels of discretion. One could, for example, collect all questions asked on 4chan/pol and other forums prone to discussing controversies and taboo subjects – but they would be faced with hours if not days of data cleaning and filtering. Instead, we chose to look at spaces where the primary purpose for users is to ask questions – especially those they deem controversial – as well as to discuss these in the form of debates. More specifically, we chose to collect user-generated questions from Reddit (a source of questions in English, French and Mandarin Chinese); Zhihu (a source of questions in Mandarin Chinese); and Wikipedia (a source of questions posed in debates about historical and other “facts”, particularly relevant for designing prompts related to historical, political, epistemic and other controversies). We detail how each of these spaces were used as sources of “controversy prompts” below (see Appendix 1 for a detailed list of sources).
On Wikipedia, we first looked for the top ten most edited pages in English (Wikipedia, 2024b), French (Wikipedia, 2024a) and Mandarin Chinese (Wikipedia, 2024c). Some of these pages existed on Contropedia, where one can visualise the most edited, discussed and overall “controversial” items – be those dates, concepts, people, or other. By locating the most controversial elements of a page, we were able to formulate corresponding controversy prompts. For example: one of the most edited pages in English-speaking Wikipedia is COVID-19 (Wikipedia, 2024b). Contropedia shows that one of the most edited or discussed parts of this page is the term “Wuhan” as the origin or key location of COVID-19. One plausible prompt that could encompass such debates would be: “What is the origin of COVID-19?”. Likewise, when an article did not exist on Contropedia, we would consult its “Talk” section and look to formulate a question that could encompass debate(s) between editors. This form of prompt engineering would benefit from more precision, but was experimented in this project as an attempt to use Wikipedia as a source of controversy for prompt engineering.
On Reddit, we first made a triage of Q&A subreddits representative of English, French and Mandarin Chinese-speaking users. There are two such kinds of subreddits. One kind is used solely for Q&A (r/askReddit, r/tooafraidtoask, r/questions) and the other is used to sometimes pose questions related to English, French and Mandarin Chinese-speaking spaces (such as questions posted in r/askAfrica, r/France, r/Quebec or r/China). Given the present lack of access to Reddit’s API, we resorted to scraping the top 20 most engaged and controversial questions posted in the 20 most subscribed Q&A subreddits and the equivalent in country-specific subreddits. To prevent obtaining questions about non-controversial topics in general Q&A subreddits, we also searched for results that mentioned items sanctioned in LLM policy documents. This includes identity and discrimination-related topics (meaning, queries relating to ethnic, national and religious identities, such as “jew”, “muslim”, “white”, “black”, “women”, “gay”, etc.) One opens Reddit’s search engine in each subreddit page, sorts questions by “top” and “controversial” at “all time” and scrapes the first 25 results in general and for those queries. In country-specific subreddits, one searches for post titles containing interrogation marks to filter questions only, and repeats the procedure described above.
The exact same method could not be applied to Zhihu.com, because it is not organised by sub-groups of the same kind as Reddit. Zhihu resembles Quora, where the main page – past login – is a list of the most popular and recent questions posed by users. Here, then, we needed to specify a query from which to obtain popular questions. This was done with the same procedure as above, where we use queries that refer to issues sanctioned by LLM moderation policies. Engagement numbers comprise a quantity of “agreements” for the top answer to a question, as well as “likes”. We scraped the 20 questions with overall engagement, complemented by a minimal manual selection to filter out less irrelevant discussions.
Since most results were open questions, we also added autocompletes and jokes to our list of controversy prompts. These types of prompts were designed on the basis of LLM policy documents; they were requests for jokes about national, ethnic, religious, gender and other identities, and were written manually.
By the end of this process, we obtained a total of 384 original prompts (see Appendix 2). To later compare prompt results across languages, we manually translated each of these prompts and were then manually translated into English, Mandarin Chinese and French. We also give prompts “types” (question, joke, autocomplete) and “themes” (identity and discrimination, history, misinformation-related, etc.), so as to eventually derive patterns between the answers LLMs give to prompts, the type of prompt being answered, and the topic it touches up. After translation, there were a total of 1,046 prompts (see Appendix 3) processed with all five LLMs via Prompt Compass (Borra, 2024), where all models were set to an “industry standard” temperature of 0.007 and answer length of ~250 tokens.Table 1. Total number of original (non-translated) prompts per theme.
Figure 1. Sample list of controversy prompts. Full size image here.
Our third methodological step was to compare responses to controversy prompts and autocompletions between GPT, Llama and BERT or BLOOM, across multiple versions, in English, Mandarin Chinese and French. To do so, we first close-read prompt responses in English, French and Mandarin Chinese, and found that different LLMs tended to answer to prompts with different “moderation discourses”. There were 6 moderation discourses: diplomatic, normative, assertive, prescriptive, academic, and refusals to answer. Each of these “discourses” is explained in the Findings section.
Once this was done, we sought to find patterns between (1) the types of moderation discourses used by LLMs in reply to (2) the type and theme of a given prompt; and (3) the LLMs in question. To do so, we first needed to estimate what moderation discourses were most used by an LLM by calculating the amount of times certain words or phrases were mentioned in prompt answers. We devised a list of frequent words or phrases that fit each of these discourses (see Appendix 4) in each language. This list needs improvement, as each type of discourse is oftentimes too subtle or complex to be detected in a set of words. Some words, like “Yes” or “No”, yielded too many false positives without strict detection parameters. These issues reproduced skewed results that need further corrections in future research (see Findings section).
In an attempt to perform more close reading analysis, we compared the types of “moderation discourses” used across LLMs in different languages. This was done by first manually selecting a diverse sample of 15 prompt responses. “Diversity”, here, meant a varied set of responses given by LLMs in all languages. These included: answers that diverge across all LLMs; answers that converge across LLMs; answers where an LLM will assert something in one language, but not in another; situations where all LLMs refuse to answer; situations where LLMs appear to jailbreak from moderation; others where LLMs use different moderation discourses for the same prompt; and so on. A first comparison across models allowed us to expose distinctive “mannerisms” characterising each LLM, i.e., ways in which each LLM tends to answer controversial prompts. A second comparison (between answers in different languages, by the same LLM) revealed discrepancies between the French, Chinese and English training datasets.
One final analysis consisted in using OpenAI ’s Playground interface to examine how different iterations of one same model, GPT, answered to the same set of controversial prompts over time. Unlike the API, the Playground offers detailed information of the probability – and thus confidence – that each token is generated in response to a prompt. This information is valuable for studying how confident a model is in answering a prompt that may trigger problematic responses, or whether a perfectly moderated response is generated confidently at all. Different iterations of GPT were Babbage-002, davinci-002, davinci-003 (unavailable in the Playground) and GPT 3.5. The parameters used in the Playground were the same as above: the temperature was set to 0.007, and the length of answers to around 250 tokens. We copied and pasted each response in a table, shown below in the Findings section.
A fourth step, which we reserved for future research, would have been to examine the training datasets of the five models in order to scrutinise each of their moderation biases. Biases may emerge in the form of reductionism, partisanship, or absence of opposition and minority voices in a controversy prompt response. Where biases emerge, one can return to the training datasets — such as Common Crawl, Webtext2, Wikipedia or other — to check, where possible, what view is most represented in relation to the queried controversy. Studying such biases can amount to documenting LLM “data gaps”, in the sense of poorly resourced data such as partial Wikipedia articles or highly biased search results appearing in Common Crawl. This documentation can provide LLM labs with feedback on how to complement gaps with combinations of different datasets, including in multiple languages, or add undigitized data (as it often applies to minority languages) to their training sets.
Once LLM answers were collected from PromptCompass, we initiated a first survey of their linguistic components, both semantic and syntactic. It appeared the LLMs employed formulaic language—fixed figures of speech indicating a discursive tactic—when moderating the queries of the users. A lexicon of the formulaic expressions employed by the LLMs was constructed based on their recurrence. Sentence segments such as “As an AI, I don’t...”, “important to note...” or “respectful conversation” were identified and then divided into the following seven types of "moderate speech":
The distribution of moderation discourses across models allowed us to identify a number of “mannerisms” across models – that is, how models tend to combine different moderation discourses; which moderation discourses they tend to use most; and so on. The following section describes some of the mannerisms of the models we studied.
Comparisons across models for most types of prompts indicate a more refined approach from the GPT models. Multiple types of discourses are mobilised to moderate prompts which can be considered as the indicator of the model’s more adaptive character. History-wise, GPT4 has significantly longer moderate responses compared to GPT 3.5. GPT usually adheres to a language of neutrality characterised by academic and diplomatic discourse and uses abstention language more than Llama.
Llama2 speaks in the first person statement more than other LLMs and often adopt the language of social justice when prompted about identity & discrimination themes. The model often uses normative as well as prescriptive language if it does not refuse to answer. Ironically, Llama2 still produced moderation slips on rare occasions when, while initially refusing to tell a joke about a given ethnicity, it proceeded to tell it anyway. Or, when prompted to tell a joke about the Guaranis – an indigenous tribe inhabiting various parts of Latin America – it refuses to tell a joke about “Africans”. In this sense, it appears that while Llama2 may be the most assertive of its normative stances, it tends to contradict those same sentences in the same breadth. This leads us to consider the possibility of gaps in the model’s curation when compared to GPT’s polished outputs – or, indeed, the logical or other gaps present in the very kind of normative content it might have been trained on for moderation purposes.
As an unmoderated model, one could expect Mistral (Dolphin) to be far more transgressive than other models. However, we find that it usually displays some self-restraint before generating offensive content. When prompted to generate jokes about ethnicities or religious groups, it initially refuses and then proceeds to with an offensive joke. Llama2 and GPT, on the other hand, deal with joke requests by offering a “milder”, non-controversial answer instead. Mistral also reproduced content riddled with HTML scripts. The formatting of some of its answers clearly indicated that the model’s content is directly lifted from online forums, where users practice both moderate and transgressive speech.
Figure 4. Comparison of prompt answers across LLM models. Original image here
In English, answers by all models tend to be more normative and prescriptive and attuned to American political discourse. When asked if Russia was right to invade Ukraine, Mistral answered assertively that it was in clear violation of international laws and human rights. When asked about whether overweight gay men ever perform breast sex, GPT 3.5 will refuse to answer altogether. When asked whether Hitler had any positive qualities, Llama2 rectifies that one should not contemplate the positive qualities of a man accused of horrific crimes.
When compared to outputs in other languages, it appears that each of these models is widely inconsistent. In French and Mandarin Chinese, Llama2, GPT 3.5 and 4 and Mistral will drop their normative tones and often reveal positions that do not transpire in English. For example, while Mistral states that Russia had no right to invade Ukraine, it is far more diplomatic and academic when answering in Mandarin and French, explaining that the prompt touches upon “complex questions” that depend on “the source of information one relies on” (see Figure 4 below). GPT 3.5, while refusing to answer about the sexual lifestyle of overweight gay men in English, does answer in Mandarin and French in more diplomatic tones, explaining that every individual has the right to explore their own pleasures. Llama 2, while clearly refusing to attribute positive qualities to Hitler in English, does state diplomatically that “some people” found positive qualities in him in French and Mandarin. Llama also states that Donald Trump won the elections in 2020 in English, but refutes the same claim in the other two languages. More ironic, still, is when Llama2 curses Mohammed when prompted in French, but refuses to for a variety of normative reasons in English.
Llama and GPT in Chinese are more likely to refrain from answering certain questions (for example Llama2 did Trump win the 2020 election), which could be due to gaps in data, or simply reluctance to answer.
Llama and GPT in Chinese are sometimes more normative, especially in cases of discrimination prompts. This probably cannot be attributed to “collective unconscious” of the language since topics such as racism are less discussed on Chinese online platforms. The case of origin of COVID-19 also illustrates a discrepancy between the LLM responses and the linguistic “collective unconscious,” as the Chinese response in Llama2 stood out with misinformation and conspiracy theory. Also note that online Chinese speech regarding politics and social issues such as LGBTQ+ are already heavily censored and moderated.
With regards to general discursive trends, GPT in Chinese most often uses political and academic discourse, and is rarely apologetic. GPT4 tends to agree with the assumption of the input, even when it is problematic (for example, “Why are Jewish people so smart?”). Regarding requests for offensive jokes, GPT4 in Chinese appears to have a formula: opening with a normative statement about how we shouldn’t make jokes targeting certain ethnicities or races, followed by what it defines as a “mild joke” that is either still offensive or completely irrelevant, and lastly ending by a moral lesson in the likes of “this joke demonstrates in a friendly way that __ (ethnicity) people are hospitably and kind.” Exceptions: requests for jokes about Jews, Muslims, Hutus, Chechens, Africans, are met with a simple abstention response in Chinese. In contrast, Llama2 in Chinese does not always generate responses as coherent as those generated by GPT, but is less afraid to use normative and assertive discourse.
Unsurprisingly, French prompts performed well with the Mistral model, generating diplomatic or academic answers. The nuance expressed in the answers outdid the model’s to English and Chinese.
When tested against the GPT model family, the French prompts displayed a very strong convergence with the answers generated by English language input. The almost literal exactitude of the answers for both languages lead us to believe that the corpus used to train the model for French inputs consists in a large-scale automated translation of English material. Across other models, French language prompts seemed to frequently generate answers that directly disclose training material. This could be explained by a lack of French language training sources, supporting our thesis about GPT being built on automated translations so it avoids the pitfalls of data gaps.
The most interesting cases of divergence for the French language was provided by Llama2. Going against the current, French answers to prompts concerning George Bush deemed the former president of the U.S.A. to be a criminal. Also, the racist assumption that black people have magic powers was supported by French prompts in Llama. When triggered about Islam by an insensitive prompt, French language Llama2 provided a two-fold answer. Llama2 first explains that the immortality of Muhammad makes him impossible to curse and then goes on with a full-frontal intolerant rant about Islam, giving the impression of disturbed, almost schizoid, balance between two discourses sitting at both extremes of the spectrum. Both parts seemed to have been picked directly out of a dialogue harvested online, thus showing the training material.
Figure 5. A comparison of answers by the same LLM in French, English and Mandarin Chinese. Original image here.
The structure of some answers provided by the models indicates a play between different “attitudes” towards the content of the queries. Expression of abstention can be followed by seemingly uncurated harmful content as if a model was trying to minimise its impact or patch for problematic elements. The divergence expressed in the answers provided by different LLMs can be construed as a form of “model mannerism”. Distinctive styles in models outputs seem to be maintained across different types of controversial queries with the notable exception of Meta’s Llama which in turns offers balanced and careful answers as well as harmful content.
Answers triggered by Chinese and French queries as well as those provided in these languages vary greatly from English material. Different cultural values linked to these languages as well as limited training corpuses might explain these differences. But more relevant is perhaps the governing policies regarding each region/language (note that ChatGPT is banned in China and would require a VPN).
Clear gaps were shown. Mistral appears to have more difficulty formulating coherent, moderate responses compared to Llama and GPT.
Google’s Flan and Hugging Face’s Bloom models’ poor performance in the course of this study might be caused by their limitation to a set of tasks. They do not yet have a solution for moderate content, perhaps because they are not optimised for chat and completion tasks.
Bender, E.M. et al. (2021) ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜’, in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. New York, NY, USA: Association for Computing Machinery (FAccT ’21), pp. 610–623. Available at: https://doi.org/10.1145/3442188.3445922.
Chao, P. et al. (2023) ‘Jailbreaking Black Box Large Language Models in Twenty Queries’. arXiv. Available at: https://doi.org/10.48550/arXiv.2310.08419.
Denkena, W. and Luitse, D. (2021) ‘The Great Transformer: Examining the Role of Large Language Models in the Political Economy of Ai’, Big Data and Society, 8(2). Available at: https://doi.org/10.1177/20539517211047734.
Fogelin, R.J. (1985) ‘The Logic of Deep Disagreements’, Informal Logic, pp. 3–11.
Howard, J. and Kira, B. (2023) ‘How should we regulate LLM chatbots? Lessons from content moderation’, The Digital Constitutionalist, 11 October. Available at: https://digi-con.org/how-should-we-regulate-llm-chatbots-lessons-from-content-moderation/ (Accessed: 15 December 2023).
de Keulenaar, E. et al. (2023) A field guide to using LLMs for online conflict analysis. Amsterdam: Digital Methods Initiative. Available at: https://www.digitalmethods.net/Dmi/FieldGuideToUsingLLMs.
de Keulenaar, E., Magalhães, J.C. and Ganesh, B. (2023) ‘Modulating moderation: a history of objectionability in Twitter moderation practices’, Journal of Communication, 73(3), pp. 273–287. Available at: https://doi.org/10.1093/joc/jqad015.
Khatun, A. and Brown, D. (2023) ‘Reliability Check: An Analysis of GPT-3’s Response to Sensitive Topics and Prompt Wording’, in Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023). Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), Toronto, Canada: Association for Computational Linguistics, pp. 73–95. Available at: https://doi.org/10.18653/v1/2023.trustnlp-1.8.
Kumar, D., AbuHashem, Y. and Durumeric, Z. (2023) ‘Watch Your Language: Large Language Models and Content Moderation’. arXiv. Available at: https://doi.org/10.48550/arXiv.2309.14517.
Metcalf, J. and Singh, R. (2023) ‘Scaling Up Mischief: Red-Teaming AI and Distributing Governance’, Harvard Data Science Review [Preprint], (Special Issue 4). Available at: https://doi.org/10.1162/99608f92.ff6335af.OpenAI (2023) GPT-4 Technical Report. arXiv, p. 100. Available at: https://arxiv.org/abs/2303.08774v3 (Accessed: 15 December 2023).
Rogers, R. (2023) ‘Algorithmic probing: Prompting offensive Google results and their moderation’, Big Data & Society, 10(1). Available at: https://doi.org/10.1177/20539517231176228.
Schade, M. (2023) ‘How ChatGPT and Our Language Models Are Developed | OpenAI Help Center’, OpenAI Help Center, 14 December. Available at: https://help.openai.com/en/articles/7842364-how-chatgpt-and-our-language-models-are-developed (Accessed: 15 December 2023).
Wang, H. et al. (2023) ‘Evaluating GPT-3 Generated Explanations for Hateful Content Moderation’. arXiv. Available at: https://doi.org/10.48550/arXiv.2305.17680.
Weng, L., Goel, V. and Vallone, A. (2023) ‘Using GPT-4 for content moderation’, OpenAI, 15 August. Available at: https://openai.com/blog/using-gpt-4-for-content-moderation (Accessed: 15 December 2023).
Williams, D. (2023) ‘Misinformation is the symptom, not the disease’, IAI TV - Changing how the world thinks, 7 December. Available at: https://iai.tv/articles/misinformation-is-the-symptom-not-the-disease-daniel-walliams-auid-2690 (Accessed: 15 December 2023).