Skip to Main Content
Skip Nav Destination
Purpose

This study aims to enhance supply chain risk identification by introducing an AI-driven framework “SCREWS” leveraging Large Language Models (LLMs). It explores the optimization of operational parameters, particularly Temperature and Top P, to improve classification accuracy.

Design/methodology/approach

A Design of Experiments approach is used to systematically evaluate the influence of operational LLM parameters on binary classification performance. Experiments with varied configurations were conducted using data from a randomized set of news articles to identify the optimal parameter settings.

Findings

Temperature significantly impacts classification precision, with optimal values identified in the range of 0.4–0.7. Conversely, the Top P parameter showed limited influence. The study establishes a robust methodology for balancing randomness and determinism in LLM outputs to achieve reliable classifications.

Research limitations/implications

The study focuses on binary classification and a fixed model, limiting generalizability to multi-class scenarios. The fixed LLM architecture used restricts insights into the effects of model variability. Future research should explore broader outputs than binary classification and diverse model architectures.

Practical implications

This research provides actionable insights for deploying and improving LLMs in dynamic supply chain environments. The findings emphasize precise parameter tuning as critical for effective risk identification, enabling practitioners to improve operational resilience and decision-making accuracy.

Social implications

Calibrating LLM parameters strengthens early warning in supply chains. Better tuned classification reduces false alarms and missed incidents, enabling faster, targeted interventions. For critical goods such as food and medicines, this means steadier availability, fewer stockouts, and more efficient allocation of scarce resources.

Originality/value

This study addresses a key gap in AI-driven supply chain risk identification research by systematically optimizing operational LLM parameters. It offers a scalable, replicable methodology applicable to various domains requiring high-stakes decision-making.

Efficient and effective Supply Chain Management (SCM) means the assurance of optimal flow of products and materials throughout a business’s value chain while maintaining a holistic profitability and quality. This quest however, has become increasingly challenging in recent years due to an unprecedented range of risks and disruptions impacting supply chains on a broad scale. While supply chains have stood and arguably always stand under a general level of pressure because of common market volatility and uncertainty, studies show that the list of disruption-inducing drivers and risks has been growing significantly (KMPG & ASCM, 2024). These new drivers and risks range from structural rather long-term and often foreseeable but hard-to-solve aspects such as labor scarcity to ad hoc events such as the Covid-19 pandemic or the war between Russia and Ukraine which were unexpected for many. Due to the nowadays globalized nature of many supply chains, the impacts of such events were and still are felt on a world-wide scale (Gurtu and Johny, 2021). For example, the Covid-19 Pandemic largely contributed to health risks, workforce deficiencies, and substantial shifts in demand in many markets (Handfield et al., 2020). But also smaller scale disruptions such as man-made events like the Ever Given Grounding in 2021 or the damaged railway tunnel in Rastatt, Germany in 2017 showed the strong influences within bottlenecks in transportation systems pushing companies into a more reactive rather than proactive state.

As a result, striving for increased supply chain resilience, the discipline of Supply Chain Risk Management (SCRM) has moved to the centre of interest and attention throughout the research and practical landscape (Deiva Ganesh and Kalpana, 2022a). SCRM encompasses the systematic and phased identification, evaluation, mitigation, and monitoring of risks that could potentially affect a business’s supply chain (Aqlan and Lam, 2016). Here, the accuracy and timeliness of risk identification, representing the foundation and first step within the process, plays a pivotal role (Deiva Ganesh and Kalpana, 2022b).

In this context, the recent rise of Artificial Intelligence (AI) has led to new ways and opportunities to cope with huge amounts of structured and unstructured data potentially indicating supply chain risks (Deiva Ganesh and Kalpana, 2022a). Initial research exploring the integration of AI into SCRM and thus going beyond conventional approaches highlights first promising use cases such as the prediction of uncertain events (Schroeder and Lodemann, 2021), continuous monitoring (Chu et al., 2020), or extraction of unstructured data (Modgil et al., 2022). Yet, due to the novelty and ongoing advances of AI technologies and their capabilities, this field of research is still considered underrepresented and is lacking formal approaches (Richey et al., 2023). Furthermore, in the realm of AI-based risk classification utilizing Large Language Models (LLMs), prevailing research predominantly concentrates on classifying risks within overarching categories, such as environmental risks and financial risks (Chu et al., 2020). The utilization of LLMs for risk classification however, holds the potential to go far beyond broad, rather static classification by dynamically classifying and thus identifying individual supply chain risks in real time based on extensive sources, such as social media or newspaper articles. This approach has the capacity to augment existing supply chain risk identification practices by offering a more granular and timely assessment. Here, due to the sensitivity of the task and the necessity of high accuracy, optimization of the LLM set up by means of operational parameters such as Temperature and Top P, is imperative (Xu et al., 2022).

In this regard, this paper aims to further the current field of research of AI-driven supply chain risk management, or more particularly, risk identification by introducing an LLM-based supply chain risk identification framework named SCREWS (Supply Chains Risk Early Warning System) and exploring and analyzing the LLM operational parameter space for the use case of accurate risk classification.

To reach the given objective, the remainder of this article is structured as follows: Section 2 overviews the current research landscape by means of a comprehensive systematic literature review (SLR) aiding to further sharpen the research agenda and to derive the research questions. In Section 3, the followed research methodology, a Design of Experiments Approach (Doe) as defined by Kleppmann (2020) is presented. The Doe is subsequently operationalized throughout Section 4 including the initial introduction of the risk identification framework SCREWS and the outline of the optimal set up for the use case of supply chain risk classification within the showcased framework using up-to-date sample data from different digital news outlets. Section 5 then highlights the practical implications after which limitations are discussed and a conclusion is given in Section 6. This study focuses on the operational parameter space of LLMs for binary supply-chain risk classification under a fixed model architecture.

To systematically and transparently assess the current state of the field, the SLR framework of vom Brocke et al. (2009) is applied. Ensuring rigor in documentation and target-orientation, the SLR framework employs five distinct phases: definition of review scope (1), conceptualization of topic (2), literature search (3), literature analysis and synthesis (4), and research agenda (5) (see Figure 1).

Figure 1
A horizontal process diagram shows five phases labeled from Roman Numeral 1 to Roman Numeral 5 with rightward arrows.The image shows a horizontal process diagram with the label “Phase” on the left followed by five connected arrow-shaped sections arranged from left to right, the first section labeled “Roman Numeral 1. Definition of Review Scope”, the second section labeled “Roman Numeral 2. Conceptualization of Topic”, the third section labeled “Roman Numeral 3. Literature Search”, the fourth section labeled “Roman Numeral 4. Literature Analysis and Synthesis”, and the fifth section labeled “Roman Numeral 5. Research Agenda”, with each section connected by rightward arrows that show a sequential progression from the first phase to the final phase.

SLR framework following vom Brocke et al. (2009). Source: Own illustration

Figure 1
A horizontal process diagram shows five phases labeled from Roman Numeral 1 to Roman Numeral 5 with rightward arrows.The image shows a horizontal process diagram with the label “Phase” on the left followed by five connected arrow-shaped sections arranged from left to right, the first section labeled “Roman Numeral 1. Definition of Review Scope”, the second section labeled “Roman Numeral 2. Conceptualization of Topic”, the third section labeled “Roman Numeral 3. Literature Search”, the fourth section labeled “Roman Numeral 4. Literature Analysis and Synthesis”, and the fifth section labeled “Roman Numeral 5. Research Agenda”, with each section connected by rightward arrows that show a sequential progression from the first phase to the final phase.

SLR framework following vom Brocke et al. (2009). Source: Own illustration

Close modal

The following sections address the conduction of the phases against the background of the present research.

To precisely delineate the review scope and thus ensure coherence and purpose-driven execution, the established taxonomy of literature reviews by Cooper (1988) consisting of six characteristics is leveraged as suggested by vom Brocke et al. (2009).

As summarized in Figure 2, the present SLR focus (1) is versatile: It concentrates on research outcomes as well as the methods applied and their application allowing to generate an overview of the extent of already conducted LLM parameter optimization research and its application areas. The main goal (2) is to consolidate and integrate the findings. Furthermore, the SLR is organized (3) on a conceptual level (that is grouping of similar ideas), while a neutral position (4) is taken. As this review aims to assess the current field of academic research regarding the given topic (cf. Section 1), scholars are defined as the main audience (5). The present article is rooted in the domain of Supply Chain Management. Therefore, the audience is further narrowed down to scholars specializing in this field. Additionally, the increased transparency over the LLM applicability in the context of supply chain risk management and classification is also considered relevant for practitioners, namely supply chain managers, particularly risk managers, as it holds the potential to serve as basis for the practical implementation. Lastly, in terms of coverage (6), an exhaustive coverage with selective citation is followed.

Figure 2
A table presents characteristics and corresponding categories across six labeled rows.The table contains 7 rows and 2 columns. The first row contains the column headers. From left to right, the column headers are as follows: Column 1: Characteristic, Column 2: Categories. The row-wise entries in the table are as follows: Row 2: Characteristic: a. focus; Categories: research outcomes, research methods, theories, application. Row 3: Characteristic: b. goal; Categories: integration, criticism, central issues. Row 4: Characteristic: c. organisation; Categories: historical, conceptual, methodological. Row 5: Characteristic: d. perspective; Categories: neutral representation, espousal of position. Row 6: Characteristic: e. audience; Categories: specialised scholars, general scholars, practitioners or politicians, general or public. Row 7: Characteristic: f. coverage; Categories: exhaustive, exhaustive and selective, representative, central or pivotal.

Applied research taxonomy. Source: Own illustration based on Cooper (1988) 

Figure 2
A table presents characteristics and corresponding categories across six labeled rows.The table contains 7 rows and 2 columns. The first row contains the column headers. From left to right, the column headers are as follows: Column 1: Characteristic, Column 2: Categories. The row-wise entries in the table are as follows: Row 2: Characteristic: a. focus; Categories: research outcomes, research methods, theories, application. Row 3: Characteristic: b. goal; Categories: integration, criticism, central issues. Row 4: Characteristic: c. organisation; Categories: historical, conceptual, methodological. Row 5: Characteristic: d. perspective; Categories: neutral representation, espousal of position. Row 6: Characteristic: e. audience; Categories: specialised scholars, general scholars, practitioners or politicians, general or public. Row 7: Characteristic: f. coverage; Categories: exhaustive, exhaustive and selective, representative, central or pivotal.

Applied research taxonomy. Source: Own illustration based on Cooper (1988) 

Close modal

Serving as basis for both a unified understanding and groundwork for the subsequent literature search, terms central to the given topic, namely Supply Chain Risks, Supply Chain Risk Management, Artificial Intelligence, and Large Language Models are briefly outlined in the following.

2.2.1 Supply chain risks

While the term risk is commonly used and easily understood in everyday language, the scientific concept of risk has been subject to various definitions, interpretations, and change also depending on the context (Heckmann et al., 2015). Ivanov (2021) describes risk as “a measure of the set of possible (negative) outcomes from a single rational decision and their probabilistic values”. In simpler words, risk can be summarized as the product of an event’s probability of occurrence and the corresponding consequences (Deiva Ganesh and Kalpana, 2022a). In contrast to this, the term uncertainty, which is often used in relation to risk, refers to a broader perspective, where also events causing positive deviations from an expected outcome are included (Ivanov, 2021).

In the domain of supply chains, risks can be defined as the combination of an event’s likelihood of occurrence and its consequences on any part of a supply chain resulting in operational, tactical, or strategical disruptions or irregularities (Ho et al., 2015).

Similar to the definition of the term itself, the classification of supply chain risks does not follow a unified understanding, either. For example, Truong Quang and Hara (2018) distinguish seven different risk categories (external risks, time risks, information risks, financial risks, supply risks, operational risks, and demand risks), while Tang (2006) narrows it down to two categories (operational risks and disruption risks).

Combining the research of Chopra and Sodhi (2004), Tang (2006), Tang and Nurmaya Musa (2011), Ho et al. (2015), and Truong Quang and Hara (2018) this study follows the supply chain risk model as proposed by Ivanov (2021) depicted in Figure 3.

Figure 3
A conceptual diagram shows supply chain risks classified into seven categories with example risk factors.The image shows a conceptual diagram with a central circle labeled “Supply Chain Risks” connected to seven surrounding rectangular sections, where the top section is labeled “Process Risks”, which lists “Production Capacity Breakdowns”, “Facility Disruptions”, “Logistics Risks”, and “Strikes”. The upper left section is labeled “Supply Risks”, which lists “Delivery Delays”, “Product Quality Risks”, and “Supplier Disruptions”. The middle-left section is labeled “Financial Risks”, which lists “Liquidity Risks”, “Financial Crisis”, and “Credit Risks”. The bottom left section is labeled “Natural Risks”, which lists “Climate Change and Natural Disasters”, “Natural Resource Shortages”, and “Epidemics or Pandemics”. The bottom-right section is labeled “Law and Cultural Risks”, which lists “Legal Risks”, “Cultural Risks”, and “Trust and Image Risks”. The middle-right section is labeled “Information Risks”, which lists “Information Distortion” and “Cyber Attacks”. The upper-right section is labeled “Demand Risks”, which lists “Price Risks”, “Demand Fluctuations”, and “Market Disruptions”.

Classification of supply chain risks. Source: Own illustration based on Ivanov (2021) 

Figure 3
A conceptual diagram shows supply chain risks classified into seven categories with example risk factors.The image shows a conceptual diagram with a central circle labeled “Supply Chain Risks” connected to seven surrounding rectangular sections, where the top section is labeled “Process Risks”, which lists “Production Capacity Breakdowns”, “Facility Disruptions”, “Logistics Risks”, and “Strikes”. The upper left section is labeled “Supply Risks”, which lists “Delivery Delays”, “Product Quality Risks”, and “Supplier Disruptions”. The middle-left section is labeled “Financial Risks”, which lists “Liquidity Risks”, “Financial Crisis”, and “Credit Risks”. The bottom left section is labeled “Natural Risks”, which lists “Climate Change and Natural Disasters”, “Natural Resource Shortages”, and “Epidemics or Pandemics”. The bottom-right section is labeled “Law and Cultural Risks”, which lists “Legal Risks”, “Cultural Risks”, and “Trust and Image Risks”. The middle-right section is labeled “Information Risks”, which lists “Information Distortion” and “Cyber Attacks”. The upper-right section is labeled “Demand Risks”, which lists “Price Risks”, “Demand Fluctuations”, and “Market Disruptions”.

Classification of supply chain risks. Source: Own illustration based on Ivanov (2021) 

Close modal

2.2.2 Supply chain risk management

Supply Chain Risk Management (SCRM) signifies a subfield of Supply Chain Management (SCM). As already mentioned in Section 1, it represents the approach of the systematic and phased identification, evaluation, mitigation, and monitoring of the above-highlighted risks that could potentially affect a business’s supply chain (Aqlan and Lam, 2016). Due to so-called cascading effects or ripple effects, which can occur up- and/or downstream the supply chain or within a whole supply network, the initial risk identification portrays a highly challenging endeavor (Cigolini and Rossi, 2010; Ivanov, 2021). Overall, effective SCRM aims to reduce vulnerabilities and improve resilience and robustness of the supply chain (Deiva Ganesh and Kalpana, 2022a).

2.2.3 Artificial Intelligence

Due to the recent AI hype within research and practise and the resulting rapid technological advancements, the lines between what can be considered an AI and what cannot are blurred. Overall, Artificial Intelligence (AI) can be described as an umbrella term for technologies and techniques that enable machines to fulfill complex tasks typically associated with intelligent beings (Sheikh et al., 2023). A particularly relevant topic in this domain is Machine Learning (ML). ML, being the underlying technique of many modern AI applications, enables computers to learn patterns and make decisions or predictions based on structured and unstructured data without being explicitly programmed (Rebala et al., 2019). It uses algorithms to self-improve its performance over time by adapting to new data, mimicking how humans learn from experience (Rebala et al., 2019).

2.2.4 Large Language Models

Leveraging Natural Language Processing (NLP) and Generation (NLG), Large Language Models (LLMs) are characterized by their ability to perform a wide range of tasks (e.g. summarization, translation, classification, or sentiment analysis) using qualitative often unstructured data, namely text. From a technical perspective, a so-called transformer serves as backbone for many modern, high performing LLMs such as ChatGPT or Gemini. As introduced by Vaswani et al. (2017), transformers are deep-learning architectures designed to handle sequential data efficiently by using a self-attention mechanism. The output quality of an LLM is highly dependent on the underlying training data, the given prompt as well as the applied operational parameters (e.g. Temperature, Top P, Max Tokens) against the background of a given use case. Especially the detailed finetuning of the model’s parameters is considered crucial as already outlined in Section 1.

Phase 3 encompasses data base selection and the according keyword, backward, and forward search as well as a parallel evaluation of sources (vom Brocke et al., 2009). The keyword combinations used aim to uncover publications related to Supply Chain Risk Classification including the methods and approaches addressed (cf. Figure 4). Based on this, the extent to which LLM-based classification and optimization is covered can be assessed and discussed serving as groundwork for the remainder of this paper.

Figure 4
A keyword selection diagram shows three keyword blocks linked by “AND” with listed terms under each block.The image shows a keyword selection diagram with three rectangular keyword blocks arranged horizontally and separated by the text “AND”, where the first block is labeled “Keyword 1 (Domain: Supply Chain) in title” and contains the terms “Supply Chain” and “Logistics”, the second block is labeled “Keyword 2 (Refinement 1: Risk) in title” and contains the terms “Risk”, “Threat”, “Disruption”, and “Uncertainty”, and the third block is labeled “Keyword 3 (Refinement 2: Discipline) in all metadata” and contains the terms “Identification”, “Identify”, “Classification”, “Classify”, and “Detect”.

Applied keyword search combinations. Source: Own illustration

Figure 4
A keyword selection diagram shows three keyword blocks linked by “AND” with listed terms under each block.The image shows a keyword selection diagram with three rectangular keyword blocks arranged horizontally and separated by the text “AND”, where the first block is labeled “Keyword 1 (Domain: Supply Chain) in title” and contains the terms “Supply Chain” and “Logistics”, the second block is labeled “Keyword 2 (Refinement 1: Risk) in title” and contains the terms “Risk”, “Threat”, “Disruption”, and “Uncertainty”, and the third block is labeled “Keyword 3 (Refinement 2: Discipline) in all metadata” and contains the terms “Identification”, “Identify”, “Classification”, “Classify”, and “Detect”.

Applied keyword search combinations. Source: Own illustration

Close modal

For increased currency, relevance, authority, accuracy, and purpose amongst the identified publication items, the following inclusion and exclusion criteria are applied:

Inclusion criteria:

  1. Keywords 1 and 2 are present in title (cf. Figure 4)

  2. Articles published in scholarly journals or proceedings of renowned conferences, to increase the likelihood of high quality due to peer-review processes

  3. Publication after 2019

Exclusion criteria:

  1. Bachelor and master theses, patents, citations

  2. Full texts not available

  3. Publication not written in English language

  4. Duplications

As depicted in Figure 5, the according search procedure within 4 distinct academic databases resulted in 2.905 identified articles of which 38 were determined relevant based on initial analysis.

Figure 5
A flow diagram shows databases, keyword combinations, and article selection steps with article counts.The image shows a flow diagram that presents databases, keyword combinations, and article selection steps. The left column is labeled “Databases” and lists “S C O P U S n equals 1,052”, “Web of Science n equals 538”, “Emerald Insight n equals 1,056”, and “I E E E Xplore n equals 259”. At the top, on the right, three keyword blocks are arranged horizontally. The first block is labeled “Keyword 1 (Domain: Supply Chain)” and contains “Supply Chain” and “Risk”. The second block is labeled “Keyword 2 (Refinement 1: Risk)” and contains “Risk”, “Threat”, “Disruption”, and “Uncertainty”. The third block is labeled “Keyword 3 (Refinement 2: Discipline)” and contains “Identification”, “Identify”, “Classification”, “Classify”, and “Detect”. The text “AND” is present between the boxes “Keyword 1” and “Keyword 2”. Another text “AND” is present between the boxes “Keyword 2” and “Keyword 3”. Dashed connector lines form a group of all three keyword blocks. Dashed connector lines link the databases, and the group of keyword blocks and point to a box present below labeled “Identified articles, n equals 2.905”. Below this, a box labeled “Selected articles through examination of titles and abstracts, n equals 37” appears, connecting the above box with a dashed line. At the bottom, a final box labeled “Selected full texts n equals 38” is present, connecting the above box with a dashed line. The boxes “Selected articles through examination of titles and abstracts, n equals 37” and “Selected full texts n equals 38” connect to the box on the right labeled “Additional articles through forward and backward search n equals 1” with dotted lines.

Results of the literature search. Source: Own illustration

Figure 5
A flow diagram shows databases, keyword combinations, and article selection steps with article counts.The image shows a flow diagram that presents databases, keyword combinations, and article selection steps. The left column is labeled “Databases” and lists “S C O P U S n equals 1,052”, “Web of Science n equals 538”, “Emerald Insight n equals 1,056”, and “I E E E Xplore n equals 259”. At the top, on the right, three keyword blocks are arranged horizontally. The first block is labeled “Keyword 1 (Domain: Supply Chain)” and contains “Supply Chain” and “Risk”. The second block is labeled “Keyword 2 (Refinement 1: Risk)” and contains “Risk”, “Threat”, “Disruption”, and “Uncertainty”. The third block is labeled “Keyword 3 (Refinement 2: Discipline)” and contains “Identification”, “Identify”, “Classification”, “Classify”, and “Detect”. The text “AND” is present between the boxes “Keyword 1” and “Keyword 2”. Another text “AND” is present between the boxes “Keyword 2” and “Keyword 3”. Dashed connector lines form a group of all three keyword blocks. Dashed connector lines link the databases, and the group of keyword blocks and point to a box present below labeled “Identified articles, n equals 2.905”. Below this, a box labeled “Selected articles through examination of titles and abstracts, n equals 37” appears, connecting the above box with a dashed line. At the bottom, a final box labeled “Selected full texts n equals 38” is present, connecting the above box with a dashed line. The boxes “Selected articles through examination of titles and abstracts, n equals 37” and “Selected full texts n equals 38” connect to the box on the right labeled “Additional articles through forward and backward search n equals 1” with dotted lines.

Results of the literature search. Source: Own illustration

Close modal

To assess the current research landscape, a concept matrix, as proposed by vom Brocke et al. (2009), is developed based on the analysis of the 38 selected articles (see  Appendix). The matrix reveals that existing studies encompass a broad range of topics within the domain of supply chain risk identification. Overall, in terms of the research scope, the literature can be divided into two different categories: (1) a static, generalized, and often one-time identification of supply chain risks and (2) a dynamic, targeted identification of risks, frequently at the level of individual focal firms, which represents the primary focus of this paper.

Static risk identification is predominantly conducted leveraging systematic literature reviews sometimes complemented by expert interviews and surveys (e.g. Rosales et al., 2019; Kusrini et al., 2021; Ramiah et al., 2022). Additionally, on the basis of this approach, deep dives into particular focus areas such as certain industries, supply chain operations, or risk types are given: For example, Zhao et al. (2024a) and Rosales et al. (2019) examine risks associated with the agri food industry while further studies explore the solar industry (Ramiah et al., 2022), or the energy industry (Zhang and He, 2024). In the context of supply chain operation-specific analyses, Mismar et al. (2022) highlight risks within last mile-processes, while Panjehfouladgaran and Lim (2020) outline risks related to reverse logistics activities. Furthermore, Pandey et al. (2020) outline particular insights into risks grounded in cyber security.

With regards to the dynamic, more individual risk identification studies, aiming for an increased degree of information coverage, accuracy and proactivity, a wide range of advanced methods and models have been employed, often introducing innovative tools and approaches. For instance, Liu et al. (2024) utilize Big Data Analytics (BDA) and Machine Learning (ML) in a two-step procedure that integrates a generative adversarial network, a stacked auto-encoder, and a deep neural network to achieve precise risk classification. Similarly, predictive approaches are emphasized by Rezki and Mansouri (2024) and Nagy et al. (2022), enabling the identification of potential risks before they materialize. Notably, Aboutorab et al. (2022) introduce a “Reinforcement Learning approach for Proactive Risk Identification” (RL-PRI), showcasing the potential of adaptive AI-driven methods. Additional frameworks include the Fuzzy Inference Decision Support System (FIDDS) proposed by Salamai et al. (2019), a dynamic voting classifier developed by Salamai et al. (2021), and a hybrid concept integrating network theory with a cascade failure model, as outlined by Wang and Zhou (2024).

Moreover, due to the pressing need for handling unstructured data as already mentioned in Section 1, a few studies explore opportunities for LLM-integrating solutions. Shahsavari et al. (2024) present a solution by means of their CERIA framework (Contributing Event-based Risk Identification and Assessment), which further integrates their LUEI framework (Lightweight Unsupervised Event Identification) within its modular setup. Here, LLM capabilities are leveraged for analyzing news article content, extracting relevant event data such as location and date, and outlining cause-and-effect relationships. Next to this, Shishehgarkhaneh et al. (2024) introduce LLM-based recognition of entities related to supply chain risks in news articles. The work of Zhao et al. (2024b) depicts a particularly high relation to the present research. Their introduced framework incorporates LLM-based risk identification in news articles and subsequent risk classification using the Cambridge Taxonomy of Business Risks (CTBR). This framework is demonstrated by means of a software prototype named LARD-SC (LLMs for Automated Risk Detection in Supply Chains). To enhance output accuracy and relevance, sophisticated prompt engineering is applied.

However, it becomes evident that none of the above-mentioned LLM-based approaches address operational parameter optimization as discussed in Section 1 portraying room for potential improvement and a clear research gap (cf.  Appendix).

Aiming to make an initial step into filling the above-highlighted research gap, the following research questions are derived:

RQ1.

Which operational LLM parameters have the highest impact on the accuracy of supply chain risk classification based on external data?

RQ2.

Which operational parameter configurations maximize the predictive accuracy of LLMs in supply chain risk classification, while ensuring robustness and adaptability across diverse operational scenarios?

As mentioned, the Doe approach as defined by Kleppmann (2020) is leveraged for a systematic exploration of the parameter space by varying those in a controlled and structured manner. This systematic variation supports in efficiently allocating resources by minimizing the number of experiments to be performed while providing meaningful insights into effects.

Through Doe, the impact of individual parameters and their interactions on classification precision for supply chain risk identification is analyzed on a quantitative basis. Doe aims for achieving a comprehensive understanding of the system under study, enhancing the reliability and robustness of the study results and potential allowing for the knowledge transfer to other application cases. Due to the missing research on the impact of LLM parameters in the light of the presented application in supply chain risk identification this approach has the potential to help resolve the knowledge gap and also to further the understanding regarding the impact of LLM-Parameters in other cases.

In Figure 6 the Doe process is outlined. The process starts by creating an understanding of the initial situation (1). This is essential to clearly define what knowledge is to be gained through the research. The research objectives can then be derived on this basis (2). Specific hypotheses may be drawn upon. These are to be placed in a targeted manner in the context of supply chain risk identification. The next step is to define how success can be measured in the present use case (3). As this is a field with substantial room for interpretation, this step is particularly relevant. In order to better understand the relationships between the input and target value(s) developed in the previous step, an analytical research plan is then developed (4). This step concludes the planning part of the Doe process. This is followed by the implementation of the research (5). The data is obtained by carrying out experiments and subsequently being analyzed. Based on this evaluation, the results are interpreted and measures are determined as to how they can be translated back into practice and lead to added value (6). The effectiveness of these developed measures is then validated to delineate potential anomalies in the test data for instance (7). Based on this reconciliation and validation, the data obtained via the process, as shown in the figure, can then be reintegrated back into the process and further investigations can be carried out on this new foundation (8).

Figure 6
A stepwise research process diagram lists 8 steps from understanding the initial situation to verification of results.The diagram is divided into two vertically labeled sections: the top section, labeled Roman numeral 1: Research Planning, and the bottom section, labeled Roman numeral 2: Research Execution. Steps 1 to 4 appear in the Research Planning section, and Steps 5 to 8 appear in the Research Execution section. Step 1, labeled “Understanding the initial Situation”, shows an icon of a magnifying glass on the left and includes the text “Establishes the baseline context and identifies relevant variables and constraints”. Step 2, labeled “Determine the research objective”, shows an icon of a target symbol on the left and includes the text “Formulates specific hypotheses and research questions to guide the investigation”. Step 3, labeled “Define target values and factors”, shows an icon of a growth arrow on the left and includes the text “Identifies the response variables of interest and the factors hypothesized to influence them”. Step 4, labeled “Design an experimental Plan”, shows an icon of interconnected nodes on the left, and includes the text “Structured framework for systematically varying factors and controlling experimental conditions”. Step 5, labeled “Conduct Experiment(s) and collect Data”, shows an icon of laboratory equipment on the left, and includes the text “Implements the experimental design to generate empirical observations under controlled conditions”. Step 6, labeled “Evaluate experimental results”, shows an icon of a bar chart with a magnifying glass on the left, and includes the text “Analyzes data using statistical methods to assess the effects of factors on response variables”. Step 7, labeled “Interpret results and derive measures”, shows an icon of a monitor displaying analytical graphics on the left, and includes the text “Draws conclusions from the data analysis, interpreting observed patterns and trends”. Step 8, labeled “Verification of the predicted results”, shows an icon of a check mark on a screen on the left, and includes the text “Validation through theoretical expectations and prior research, for reliability plus generalizability”. Step 1 connects to Step 2 with a downward arrow, Step 2 connects to Step 3 with a downward arrow, Step 3 connects to Step 4 with a downward arrow, Step 4 connects to Step 5 with a downward arrow, Step 5 connects to Step 6 with a downward arrow, Step 6 connects to Step 7 with a downward arrow, Step 7 connects to Step 8 with a downward arrow, and an upward arrow emerges from Step 8 and points back to Step 1.

Process illustration of the application of the DoE-concept. Source: Own illustration

Figure 6
A stepwise research process diagram lists 8 steps from understanding the initial situation to verification of results.The diagram is divided into two vertically labeled sections: the top section, labeled Roman numeral 1: Research Planning, and the bottom section, labeled Roman numeral 2: Research Execution. Steps 1 to 4 appear in the Research Planning section, and Steps 5 to 8 appear in the Research Execution section. Step 1, labeled “Understanding the initial Situation”, shows an icon of a magnifying glass on the left and includes the text “Establishes the baseline context and identifies relevant variables and constraints”. Step 2, labeled “Determine the research objective”, shows an icon of a target symbol on the left and includes the text “Formulates specific hypotheses and research questions to guide the investigation”. Step 3, labeled “Define target values and factors”, shows an icon of a growth arrow on the left and includes the text “Identifies the response variables of interest and the factors hypothesized to influence them”. Step 4, labeled “Design an experimental Plan”, shows an icon of interconnected nodes on the left, and includes the text “Structured framework for systematically varying factors and controlling experimental conditions”. Step 5, labeled “Conduct Experiment(s) and collect Data”, shows an icon of laboratory equipment on the left, and includes the text “Implements the experimental design to generate empirical observations under controlled conditions”. Step 6, labeled “Evaluate experimental results”, shows an icon of a bar chart with a magnifying glass on the left, and includes the text “Analyzes data using statistical methods to assess the effects of factors on response variables”. Step 7, labeled “Interpret results and derive measures”, shows an icon of a monitor displaying analytical graphics on the left, and includes the text “Draws conclusions from the data analysis, interpreting observed patterns and trends”. Step 8, labeled “Verification of the predicted results”, shows an icon of a check mark on a screen on the left, and includes the text “Validation through theoretical expectations and prior research, for reliability plus generalizability”. Step 1 connects to Step 2 with a downward arrow, Step 2 connects to Step 3 with a downward arrow, Step 3 connects to Step 4 with a downward arrow, Step 4 connects to Step 5 with a downward arrow, Step 5 connects to Step 6 with a downward arrow, Step 6 connects to Step 7 with a downward arrow, Step 7 connects to Step 8 with a downward arrow, and an upward arrow emerges from Step 8 and points back to Step 1.

Process illustration of the application of the DoE-concept. Source: Own illustration

Close modal

The Doe approach itself requires that the LLM architecture be kept constant, and the focus on binary outputs controls potential unknown effects and enables a more isolated analysis of the causal effects of temperature and Top P, increasing the reproducibility of the Doe approach.

The analysis of the initial situation is the first phase of a DoE-Process and serves to precisely formulate the problem and to identify relevant variables and potential restrictions. Without a sound knowledge of the research context, there is a risk of inadequate operationalization or neglect of confounding factors. This step ensures that the research design is based on realistic assumptions and is methodologically and conceptually coherent with the existing body of knowledge in the light of the application of Doe.

This study is conducted within an overarching research project “WiReSt” [1]. The focus lays on improving supply chain resilience through a robust early warning system for risks (SCREWS). Here, the utilization of novel technologies such as LLMs for real time risk classification represents a core part as it enables to acquire and process vast amounts of quantitative and qualitative data that could unveil latent risks in early stages. Previous studies within this research project have already outlined first concepts and an initial framework for AI-based real time global risk assessment into which the outcomes of the present study will be assimilated (Eschenbächer et al.). Figure 7 displays the conceptual process of SCREWS from which real-world events and company specific information leads to different reports on potential sources of supply chain risks. The research on the early warning system is focused on the steps I–IV (marked blue). The last but most essential step lies in the data classification. The classification is carried out on a case-by-case basis, taking into consideration the individual circumstances of the company for which the evaluation is performed. Due to the integration of this way of evaluating the individual news the value of each alert is increased. This is also where the results of this work are applied.

Figure 7
A process flow diagram shows event sources, data steps, reporting outputs, and icons connected by arrows.The image shows a process flow diagram that begins with a top horizontal box labeled “Event occurs”, displaying an icon of a target with an arrow on the left. Downward arrows from this box point to three boxes labeled “News Outlets”, “Press Agencies”, and “Social Media”, where the News Outlets box shows a newspaper icon, the Press Agencies box shows a grid and globe icon, and the Social Media box shows an icon of connected nodes. The News Outlets box connects downward to a box labeled “Google News” displaying the Google News icon, and this box connects with a rightward arrow to a box labeled “G D E L T”, while the Social Media box also connects downward to the same G D E L T box. On the right side, a vertical box labeled “Company Parameters” displays a factory icon and includes the text “Due to differences in geographical exposure, sector, and also individual risk appetite and company size, various risks may or may not be material. In order to take account of this fact, company specific factors are also taken into consideration”, and this box connects downward to later stages. Below the sources, four horizontally arranged boxes labeled “Roman numeral 1. Data Ingestion”, “Roman numeral 2. Data Aggregation”, “Roman Numeral 3. Data Categorization”, and “Roman Numeral 4. Data Classification” appear with respective icons and are connected by rightward arrows. The Data Ingestion box shows a cloud icon and includes the text “Google News articles can be retrieved geo-regionally as an R S S feed and used to rank results by relevance and attention to a topic”. The Data Aggregation box shows a funnel icon and includes the text “By using Microsoft Power Automate, the U R L provided by the R S S element can be used to aggregate the content of the article with the help of an L L M”. The Data Categorization box shows an icon of connected shapes and includes the text “The articles are divided into different categories to enable a more targeted analysis on a quantitative level and to improve the qualitative analysis”. The Data Classification box shows a dashboard icon and includes the text “The articles are divided into relevant and non-relevant articles in order to assess whether a particular reported topic should be emphasized or discarded”. A downward arrow from Data Categorization leads to a box labeled “Generic Reporting” showing a bar chart icon and including the text “General reporting that quantifies major, overarching developments in the general economic climate”. A downward arrow from Data Classification leads to a box labeled “Specific Reporting” showing a bar chart icon and including the text “Specific reporting that presents concrete, material risks in relation to the individual company specifics and is focused on triggering alerts”. The event occurs box connects with three downward arrows to News Outlets, Press Agencies, and Social Media, respectively. News Outlets and Press Agencies are grouped and connected with a downward arrow to Google News. Google News connects with a rightward arrow to G D E L T and also connects with a downward arrow to Data Ingestion. Social Media connects with a downward arrow to G D E L T. Data Ingestion connects to Data Aggregation with a rightward arrow. Data Aggregation connects to Data Categorization with a rightward arrow. Data Categorization connects to Data Classification with a rightward arrow. Company Parameters connects to Data Classification with a downward arrow. G D E L T connects to Generic Reporting with a downward arrow. Generic Reporting connects to Specific Reporting with a rightward arrow.

High level illustration of the SCREWS framework. Note: Operational parameter optimization is carried out in IV. Data classification, highlighted in red. Source: Own illustration

Figure 7
A process flow diagram shows event sources, data steps, reporting outputs, and icons connected by arrows.The image shows a process flow diagram that begins with a top horizontal box labeled “Event occurs”, displaying an icon of a target with an arrow on the left. Downward arrows from this box point to three boxes labeled “News Outlets”, “Press Agencies”, and “Social Media”, where the News Outlets box shows a newspaper icon, the Press Agencies box shows a grid and globe icon, and the Social Media box shows an icon of connected nodes. The News Outlets box connects downward to a box labeled “Google News” displaying the Google News icon, and this box connects with a rightward arrow to a box labeled “G D E L T”, while the Social Media box also connects downward to the same G D E L T box. On the right side, a vertical box labeled “Company Parameters” displays a factory icon and includes the text “Due to differences in geographical exposure, sector, and also individual risk appetite and company size, various risks may or may not be material. In order to take account of this fact, company specific factors are also taken into consideration”, and this box connects downward to later stages. Below the sources, four horizontally arranged boxes labeled “Roman numeral 1. Data Ingestion”, “Roman numeral 2. Data Aggregation”, “Roman Numeral 3. Data Categorization”, and “Roman Numeral 4. Data Classification” appear with respective icons and are connected by rightward arrows. The Data Ingestion box shows a cloud icon and includes the text “Google News articles can be retrieved geo-regionally as an R S S feed and used to rank results by relevance and attention to a topic”. The Data Aggregation box shows a funnel icon and includes the text “By using Microsoft Power Automate, the U R L provided by the R S S element can be used to aggregate the content of the article with the help of an L L M”. The Data Categorization box shows an icon of connected shapes and includes the text “The articles are divided into different categories to enable a more targeted analysis on a quantitative level and to improve the qualitative analysis”. The Data Classification box shows a dashboard icon and includes the text “The articles are divided into relevant and non-relevant articles in order to assess whether a particular reported topic should be emphasized or discarded”. A downward arrow from Data Categorization leads to a box labeled “Generic Reporting” showing a bar chart icon and including the text “General reporting that quantifies major, overarching developments in the general economic climate”. A downward arrow from Data Classification leads to a box labeled “Specific Reporting” showing a bar chart icon and including the text “Specific reporting that presents concrete, material risks in relation to the individual company specifics and is focused on triggering alerts”. The event occurs box connects with three downward arrows to News Outlets, Press Agencies, and Social Media, respectively. News Outlets and Press Agencies are grouped and connected with a downward arrow to Google News. Google News connects with a rightward arrow to G D E L T and also connects with a downward arrow to Data Ingestion. Social Media connects with a downward arrow to G D E L T. Data Ingestion connects to Data Aggregation with a rightward arrow. Data Aggregation connects to Data Categorization with a rightward arrow. Data Categorization connects to Data Classification with a rightward arrow. Company Parameters connects to Data Classification with a downward arrow. G D E L T connects to Generic Reporting with a downward arrow. Generic Reporting connects to Specific Reporting with a rightward arrow.

High level illustration of the SCREWS framework. Note: Operational parameter optimization is carried out in IV. Data classification, highlighted in red. Source: Own illustration

Close modal

Deriving precise research objectives and hypotheses is essential to focus on clear knowledge objectives. This step enables a hypothesis-driven approach that allows a structured examination of specific cause-and-effect relationships. In this context in particular, the formulation of research objectives helps to maximize the validity and relevance of the results by directly targeting existing research gaps.

The aim of this study is to increase the knowledge on how to adjust operational LLM parameters for supply chain risk identification (in this case correct classification weather the content of a news article is to be assessed as critical or not). There are three modifiable elements in the overall process at the meta-level. These are the input itself (e.g. a news article in combination with a prompt), the set variables and the LLM itself. The objective of the study, to evaluate the parameters, emerges from the tabular representation of the two factors consistency and workability (in the context of this study) in Figure 8. These two conditions have to be met for an examination to be relevant in the context of this study [2]. The objective of this study therefore includes investigating the effects that modifications to the variables have on supply chain risk identification. This aligns with the determined research gap in this context.

Figure 8
A workflow diagram shows text and variable inputs, L L M data processing, evaluation, and a comparison table.The image shows a workflow diagram in which a box labeled “Change in monitored newsfeed” is shown with a news icon. A large highlighted area labeled “Potential Scope of Work” is present below. Inside this area, a box labeled “Text-Input (Instructions)” with a document upload icon marked with the number 1 appears on the left. The box “Change in monitored newsfeed” connects by a downward arrow to “Text-Input (Instructions)”. A box labeled “Variables-Input (as A P I-Param.)” with a system screen icon marked with the number 2 appears on the right. Both boxes connect by arrows to a central box labeled “L L M-Data Processing” with a network icon marked with the number 3. A downward arrow from this box leads to a box labeled “Result Evaluation” with a person, a check, and a cross icon. On the right side, a table with three columns labeled “Step”, “Consistency”, and “Modifiability” lists Step 1 with a cross under Consistency and a check under Modifiability. Step 2 shows a check under Consistency and a check under Modifiability. Step 3 shows a check under Consistency and a cross under Modifiability. A downward arrow below the table points to a box labeled “Analysis of the Variables for the L L M as research objective” with a system screen icon.

Determining the research objective based on the potential scope of work. Source: Own illustration

Figure 8
A workflow diagram shows text and variable inputs, L L M data processing, evaluation, and a comparison table.The image shows a workflow diagram in which a box labeled “Change in monitored newsfeed” is shown with a news icon. A large highlighted area labeled “Potential Scope of Work” is present below. Inside this area, a box labeled “Text-Input (Instructions)” with a document upload icon marked with the number 1 appears on the left. The box “Change in monitored newsfeed” connects by a downward arrow to “Text-Input (Instructions)”. A box labeled “Variables-Input (as A P I-Param.)” with a system screen icon marked with the number 2 appears on the right. Both boxes connect by arrows to a central box labeled “L L M-Data Processing” with a network icon marked with the number 3. A downward arrow from this box leads to a box labeled “Result Evaluation” with a person, a check, and a cross icon. On the right side, a table with three columns labeled “Step”, “Consistency”, and “Modifiability” lists Step 1 with a cross under Consistency and a check under Modifiability. Step 2 shows a check under Consistency and a check under Modifiability. Step 3 shows a check under Consistency and a cross under Modifiability. A downward arrow below the table points to a box labeled “Analysis of the Variables for the L L M as research objective” with a system screen icon.

Determining the research objective based on the potential scope of work. Source: Own illustration

Close modal

The identification of the target values (response variables) and the experimental factors is the central aspect of Doe. By systematically selecting the factors, main effects and interactions can be investigated, enabling a differentiated analysis. Insufficient specification of the variables, on the other hand, can lead to erroneous conclusions.

Next, the target values and the factors with a potential impact on these values are to be determined. As highlighted in the chapter on the methodological approach, the definition of the target variable is of particular interest. How the target variable is assessed in the context of this study is shown on the right-hand side of Figure 9. In the first instance, a distinction is made between two categories based on the output of the LLM. Firstly, there are responses that are in the expected format (either “0” for a non-critical or “1” for a potentially critical event) or those that are not (other outputs, e.g. words or other numbers). If the response is in the expected format, the correspondence with the target evaluation of the corresponding information is to be assessed as shown in Figure 9. For each article that is later used as input, the researchers assess whether it should be rated as critical or uncritical. This rating represents the value that the model should output.

Figure 9
A flow diagram shows model factors, output formats, and assessment outcomes mapped to target values.The image shows a flow diagram divided into three vertical sections labeled “Factors” on the left and “Target Values” on the right. The left section lists stacked boxes with icons and labels, including “Model to use (Choice)”. “Temperature (Parameter)”. “Max. Length (Parameter)”. “Top P (Parameter)”. “Frequency penalty (parm.)”. “Presence penalty (parm.)”. And “Best of (param.) (deprec.)”. These boxes connect by a rightward arrow to a connecting nodes icon. The connecting nodes icon connects to a box labeled “Expected format” with a funnel icon and also connects to a box labeled “Unexpected format” with a crossed funnel icon. The expected format path branches rightward to four stacked boxes showing user icons and text pairs labeled “Target: 0 Output: 0”. “Target: 1 Output: 1”. “Target: 0 Output: 1”. And “Target: 1 Output: 0”. These output boxes connect into a large rectangular Target Values area that contains three text boxes labeled “Right Assessment” with a check icon, “Wrong Assessment” with a cross icon, and “No Assessment” with a warning icon. The first two boxes connect to a box “Right Assessment”. The latter two boxes connect to a box “Wrong Assessment”. The unexpected format path connects directly to the “No Assessment” text.

Definition of target values and factors. Note: Target = 0 equals an assessment as uncritical, 1 equals an assessment as critical. Source: Own illustration

Figure 9
A flow diagram shows model factors, output formats, and assessment outcomes mapped to target values.The image shows a flow diagram divided into three vertical sections labeled “Factors” on the left and “Target Values” on the right. The left section lists stacked boxes with icons and labels, including “Model to use (Choice)”. “Temperature (Parameter)”. “Max. Length (Parameter)”. “Top P (Parameter)”. “Frequency penalty (parm.)”. “Presence penalty (parm.)”. And “Best of (param.) (deprec.)”. These boxes connect by a rightward arrow to a connecting nodes icon. The connecting nodes icon connects to a box labeled “Expected format” with a funnel icon and also connects to a box labeled “Unexpected format” with a crossed funnel icon. The expected format path branches rightward to four stacked boxes showing user icons and text pairs labeled “Target: 0 Output: 0”. “Target: 1 Output: 1”. “Target: 0 Output: 1”. And “Target: 1 Output: 0”. These output boxes connect into a large rectangular Target Values area that contains three text boxes labeled “Right Assessment” with a check icon, “Wrong Assessment” with a cross icon, and “No Assessment” with a warning icon. The first two boxes connect to a box “Right Assessment”. The latter two boxes connect to a box “Wrong Assessment”. The unexpected format path connects directly to the “No Assessment” text.

Definition of target values and factors. Note: Target = 0 equals an assessment as uncritical, 1 equals an assessment as critical. Source: Own illustration

Close modal

In terms of factors, there are potentially seven operational parameters that could be examined as part of this study. To focus on the parameters with the greatest potential added value in the context of the use case at hand, a pre-selection was performed by the researchers. An exclusionary procedure was used, whereby variables that could be problematic in practical application were excluded and the remaining variables were further researched. The variables that are relevant to the research are marked with an “R” in the figure (Temperature and top P). The temperature parameter regulates the randomness of the model’s output, with higher values promoting increased diversity in generated responses (La Vega, 2023). On the other hand, the top P parameter controls the probability of distribution of potential next tokens by considering only the most probable ones, based on their cumulative probability mass (La Vega, 2023). These parameters play a crucial role in shaping the coherence, diversity, and quality of language generation in LLMs, influencing their overall efficacy in various natural language processing tasks (La Vega, 2023). The model selection was excluded due to the frequent version changes in the LLMs made available. The parameters of maximum length, frequency penalty and presence penalty were excluded from the analysis due to the expected output. As the evaluation should be either “0” or “1”, parameters that focus on the effect of values that have already been output are not expedient. Using the maximum length is not practicable due to the varying length of the input in the form of news articles. The use of the “best of …” parameter is potentially interesting, as various results are generated here and only the most suitable [3] result is output. However, this parameter is only available for legacy models.

The design of an experimental plan ensures that statistical efficiency is maximized, and experimental effort is minimized. In a scientific context, this step serves to systematically control disturbance variables and reduce measurement uncertainty, enabling more precise modelling and interpretation.

Figure 10 shows the experiment plan to be followed. The temperature parameter is changed first, ceteris paribus. This is due to the fact that it is likely to have a stronger influence on the results due to the direct impact on the next token; the results from this procedure are then subjected to analysis. Based on the potential optimum value determined from the analysis, the top P parameter is now adjusted ceteris paribus and the results are evaluated. In the third and final step, the effect of changing both parameters is to be examined based on the findings from test 1 and test 2. This is carried out using random samples to ensure stability of the previously determined optima.

Figure 10
A three-panel test diagram compares constant values, modified parameters, and evaluation of results across tests.The image shows three parallel diagrams arranged horizontally and labeled at the bottom as “Test 1”, “Test 2”, and “Test 3”, with rightward arrows connecting Test 1 to Test 2 and Test 2 to Test 3. The Test 1 diagram contains a top section labeled “Constant Values” with a lock icon, which includes stacked rectangular boxes labeled “Model to use”, “Max. Length”, “Frequency”, “Presence”, “Best of”, and “Top P (Parameter)”. Below this, a section labeled “Modified Parameter” shows a rectangular box labeled “Temperature (Parameter)” with a light bulb icon. On the right side of Test 1, a vertical section labeled “Evaluation of Results” displays an icon of a circular target arrow. The Test 2 diagram follows the same structure, where the top section labeled “Constant Values” with a lock icon includes stacked boxes labeled “Model to use”, “Max. Length”, “Frequency”, “Presence”, “Best of”, and “Temperature (Parameter)”. The “Modified Parameter” section in Test 2 shows a rectangular box labeled “Top P (Parameter)” with a cloud icon. On the right, the “Evaluation of Results” section again shows the circular target arrow icon. The Test 3 diagram also follows the same layout, where stacked boxes labeled “Model to use”, “Max. Length”, “Frequency”, “Presence”, and “Best of (param.) (deprec.)”. The “Modified Parameter” section in Test 3 contains two rectangular boxes labeled “Top P” and “Temperature (Parameter)” with respective icons. On the right side of Test 3, the “Evaluation of Results” section again shows the circular target arrow icon.

Experiment plan. Source: Own illustration

Figure 10
A three-panel test diagram compares constant values, modified parameters, and evaluation of results across tests.The image shows three parallel diagrams arranged horizontally and labeled at the bottom as “Test 1”, “Test 2”, and “Test 3”, with rightward arrows connecting Test 1 to Test 2 and Test 2 to Test 3. The Test 1 diagram contains a top section labeled “Constant Values” with a lock icon, which includes stacked rectangular boxes labeled “Model to use”, “Max. Length”, “Frequency”, “Presence”, “Best of”, and “Top P (Parameter)”. Below this, a section labeled “Modified Parameter” shows a rectangular box labeled “Temperature (Parameter)” with a light bulb icon. On the right side of Test 1, a vertical section labeled “Evaluation of Results” displays an icon of a circular target arrow. The Test 2 diagram follows the same structure, where the top section labeled “Constant Values” with a lock icon includes stacked boxes labeled “Model to use”, “Max. Length”, “Frequency”, “Presence”, “Best of”, and “Temperature (Parameter)”. The “Modified Parameter” section in Test 2 shows a rectangular box labeled “Top P (Parameter)” with a cloud icon. On the right, the “Evaluation of Results” section again shows the circular target arrow icon. The Test 3 diagram also follows the same layout, where stacked boxes labeled “Model to use”, “Max. Length”, “Frequency”, “Presence”, and “Best of (param.) (deprec.)”. The “Modified Parameter” section in Test 3 contains two rectangular boxes labeled “Top P” and “Temperature (Parameter)” with respective icons. On the right side of Test 3, the “Evaluation of Results” section again shows the circular target arrow icon.

Experiment plan. Source: Own illustration

Close modal

The conduct of the experience according to the designed plan under controlled conditions is a critical step that ensures that the collected data is valid and reliable. The control of experimental conditions is necessary to ensure internal validity. It is essential to consider both replication and randomization to neutralize confounding variables and maximize the generalizability of the results.

To increase the reliability and validity of the findings, a large sample size is expedient, as, for instance, the temperature parameter inherently functions to amplify the randomness of responses with increasing values. 21 assessment objects are used, each of which occurs 41 times in each processing batch, to reduce the number of random effects. The examinations with a direct focus on the temperature parameter are carried out 5 times with the same values, resulting in a total number of 4,305 evaluations as basis for the calculation of mean values. As depicted in Figure 11, the modifiable variables are stored directly in a Microsoft Excel table used for the study and the evaluation results are then automatically transferred back into this table so that the table can be evaluated [4].

Figure 11
A flowchart shows an automated workflow connected to spreadsheets that display parameters, content, and evaluated results.The image presents a combined process view where the left side shows a vertical workflow and the right side shows two related spreadsheets. On the left side, rectangular plus icon blocks are arranged vertically starting from the top with a rectangle labeled “H T T P Trigger”. A downward arrow with a plus icon emerges from “H T T P Trigger” and points to the next rectangle labeled “Bearer”. Another downward arrow with a plus icon emerges from “Bearer” and connects to a rectangle labeled “Get all Row I D’s”. From this rectangle, a downward arrow with a plus icon connects to a larger rectangle that contains a sub-rectangular heading labeled “For each”. A downward arrow with a plus icon emerges from this “For each” block and points to another enclosed rectangle labeled “Get and Convert Values”, which contains the text “2 Actions”. From this rectangle, a downward arrow with a plus icon points to another rectangle labeled “Execute A P I Request and write results to Excel”. A downward arrow with a plus icon emerges from this rectangle and points to a rectangular block labeled “Trigger H T T P A P I-Request”, followed by another downward arrow with a plus icon pointing to a rectangle labeled “J S O N”. Below this, a smaller rectangle labeled “For each” appears with the text “1 Action” is present. A downward arrow with a plus icon emerges from “J S O N” and points to “For each” appears with the text “1 Action”. A plus is present below it. Outside the rectangle, another plus sign is present. On the right side, the top spreadsheet is titled “Parameters and Content” and contains seven columns with a single header row labeled “Model”, “Prompt”, “Temp”, “max”, “top underscore p”, “freque”, and “presen”. In the first column, repeated entries read “gpt-3.5-tu”. In the “Prompt” column, repeated text begins with “You are an artificial intelligence that has the purpose of evaluating whether news can be used to conclude wh”. The remaining columns display numeric values such as 2.00, 1000, 1, and 0 across multiple rows. A large downward arrow connects this spreadsheet to a second spreadsheet titled “Results and Evaluation”. This lower spreadsheet shows three columns labeled “Reply”, “Result”, and “Body”. The “Reply” column contains numeric values and short text entries. The “Result” column shows numeric values and error text such as “hash WERT!”, with several cells highlighted green containing the value 0 and one cell in the last row highlighted yellow containing the value 1. The “Body” column displays long text strings that include links or identifiers beginning with “id”: “cmpl-” followed by alphanumeric characters.

Presentation of the operational process for test execution for large(r) scale automated testing. Source: Own illustration

Figure 11
A flowchart shows an automated workflow connected to spreadsheets that display parameters, content, and evaluated results.The image presents a combined process view where the left side shows a vertical workflow and the right side shows two related spreadsheets. On the left side, rectangular plus icon blocks are arranged vertically starting from the top with a rectangle labeled “H T T P Trigger”. A downward arrow with a plus icon emerges from “H T T P Trigger” and points to the next rectangle labeled “Bearer”. Another downward arrow with a plus icon emerges from “Bearer” and connects to a rectangle labeled “Get all Row I D’s”. From this rectangle, a downward arrow with a plus icon connects to a larger rectangle that contains a sub-rectangular heading labeled “For each”. A downward arrow with a plus icon emerges from this “For each” block and points to another enclosed rectangle labeled “Get and Convert Values”, which contains the text “2 Actions”. From this rectangle, a downward arrow with a plus icon points to another rectangle labeled “Execute A P I Request and write results to Excel”. A downward arrow with a plus icon emerges from this rectangle and points to a rectangular block labeled “Trigger H T T P A P I-Request”, followed by another downward arrow with a plus icon pointing to a rectangle labeled “J S O N”. Below this, a smaller rectangle labeled “For each” appears with the text “1 Action” is present. A downward arrow with a plus icon emerges from “J S O N” and points to “For each” appears with the text “1 Action”. A plus is present below it. Outside the rectangle, another plus sign is present. On the right side, the top spreadsheet is titled “Parameters and Content” and contains seven columns with a single header row labeled “Model”, “Prompt”, “Temp”, “max”, “top underscore p”, “freque”, and “presen”. In the first column, repeated entries read “gpt-3.5-tu”. In the “Prompt” column, repeated text begins with “You are an artificial intelligence that has the purpose of evaluating whether news can be used to conclude wh”. The remaining columns display numeric values such as 2.00, 1000, 1, and 0 across multiple rows. A large downward arrow connects this spreadsheet to a second spreadsheet titled “Results and Evaluation”. This lower spreadsheet shows three columns labeled “Reply”, “Result”, and “Body”. The “Reply” column contains numeric values and short text entries. The “Result” column shows numeric values and error text such as “hash WERT!”, with several cells highlighted green containing the value 0 and one cell in the last row highlighted yellow containing the value 1. The “Body” column displays long text strings that include links or identifiers beginning with “id”: “cmpl-” followed by alphanumeric characters.

Presentation of the operational process for test execution for large(r) scale automated testing. Source: Own illustration

Close modal

The evaluation of the experimental results is an integral part of Doe as it forms the basis for hypothesis testing and causal inference. Thus, making it possible to quantify main effects and interactions.

During this step, the data obtained in the previous steps is analyzed. This is carried out using Microsoft Power BI to enable automated evaluation of new data and analysis of individual effects in a large data set. The interim results are presented in the following sub-chapters.

4.6.1 Impact-evaluation for the temperature parameter

Figure 12 shows the result for 5 tests carried out, each with 861 samples consisting of 21 questions in 0.05 increments for the temperature parameter in an interval of 0.00–2.00. The deviations show that the relative stability of the results begins to decrease for the range from a temperature value of 1.40 onwards. Optima in relation to desired results are present at both 0.25 and 0.60. Due to the greater distance to more unstable results, 0.25 is defined as the potential optimum value for the temperature parameter for fixed other variables.

Figure 12
A stacked vertical bar chart showing evaluation outcomes across temperature parameters.The vertical clustered bar chart is labeled “Results for Temperature Variation (total n equals 4305)”. The legend at the top labeled “Benutzerdefiniert” shows the categories “No underscore valid underscore Input”, “Right underscore Evaluation”, and “Wrong underscore Evaluation”, and shows that the light blue bars represent “No underscore valid underscore Input”, the dark blue bars represent “Right underscore Evaluation”, and the orange bars represent “Wrong underscore Evaluation”. The bar chart shows the horizontal axis labeled “Temperature Parameter (0.05-Step-Increments)” ranging from 0.00 to 2.00 in increments of 0.05, and the vertical axis labeled “Number of Results per Category” ranging from 0 to 100 in increments of 20 units. Each vertical stacked bar has the value “105” written at the top, representing the total number of results. The data for the bars per category is shown as follows: Temperature: 0.00, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 60, Wrong underscore evaluation: 45. Temperature: 0.05, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 59, Wrong underscore evaluation: 46. Temperature: 0.10, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 62, Wrong underscore evaluation: 43. Temperature: 0.15, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 59, Wrong underscore evaluation: 46. Temperature: 0.20, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 64, Wrong underscore evaluation: 41. Temperature: 0.25, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 65, Wrong underscore evaluation: 40. Temperature: 0.30, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 62, Wrong underscore evaluation: 43. Temperature: 0.35, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 57, Wrong underscore evaluation: 48. Temperature: 0.40, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 59, Wrong underscore evaluation: 46. Temperature: 0.45, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 59, Wrong underscore evaluation: 46. Temperature: 0.50, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 56, Wrong underscore evaluation: 48. Temperature: 0.55, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 59, Wrong underscore evaluation: 46. Temperature: 0.60, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 62, Wrong underscore evaluation: 43. Temperature: 0.65, Number of results per category: No underscore valid underscore Input: 4, Right underscore evaluation: 54, Wrong underscore evaluation: 47. Temperature: 0.70, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 58, Wrong underscore evaluation: 47. Temperature: 0.75, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 64, Wrong underscore evaluation: 41. Temperature: 0.80, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 65, Wrong underscore evaluation: 40. Temperature: 0.85, Number of results per category: No underscore valid underscore Input: 4, Right underscore evaluation: 61, Wrong underscore evaluation: 40. Temperature: 0.90, Number of results per category: No underscore valid underscore Input: 2, Right underscore evaluation: 56, Wrong underscore evaluation: 47. Temperature: 0.95, Number of results per category: No underscore valid underscore Input: 8, Right underscore evaluation: 54, Wrong underscore evaluation: 43. Temperature: 1.00, Number of results per category: No underscore valid underscore Input: 8, Right underscore evaluation: 48, Wrong underscore evaluation: 49. Temperature: 1.05, Number of results per category: No underscore valid underscore Input: 13, Right underscore evaluation: 42, Wrong underscore evaluation: 50. Temperature: 1.10, Number of results per category: No underscore valid underscore Input: 9, Right underscore evaluation: 53, Wrong underscore evaluation: 43. Temperature: 1.15, Number of results per category: No underscore valid underscore Input: 9, Right underscore evaluation: 58, Wrong underscore evaluation: 38. Temperature: 1.20, Number of results per category: No underscore valid underscore Input: 16, Right underscore evaluation: 50, Wrong underscore evaluation: 39. Temperature: 1.25, Number of results per category: No underscore valid underscore Input: 17, Right underscore evaluation: 51, Wrong underscore evaluation: 37. Temperature: 1.30, Number of results per category: No underscore valid underscore Input: 25, Right underscore evaluation: 44, Wrong underscore evaluation: 36. Temperature: 1.35, Number of results per category: No underscore valid underscore Input: 28, Right underscore evaluation: 43, Wrong underscore evaluation: 34. Temperature: 1.40, Number of results per category: No underscore valid underscore Input: 31, Right underscore evaluation: 44, Wrong underscore evaluation: 30. Temperature: 1.45, Number of results per category: No underscore valid underscore Input: 47, Right underscore evaluation: 32, Wrong underscore evaluation: 26. Temperature: 1.50, Number of results per category: No underscore valid underscore Input: 47, Right underscore evaluation: 32, Wrong underscore evaluation: 26. Temperature: 1.55, Number of results per category: No underscore valid underscore Input: 62, Right underscore evaluation: 22, Wrong underscore evaluation: 21. Temperature: 1.60, Number of results per category: No underscore valid underscore Input: 57, Right underscore evaluation: 32, Wrong underscore evaluation: 16. Temperature: 1.65, Number of results per category: No underscore valid underscore Input: 69, Right underscore evaluation: 21, Wrong underscore evaluation: 15. Temperature: 1.70, Number of results per category: No underscore valid underscore Input: 77, Right underscore evaluation: 21, Wrong underscore evaluation: 7. Temperature: 1.75, Number of results per category: No underscore valid underscore Input: 75, Right underscore evaluation: 18, Wrong underscore evaluation: 12. Temperature: 1.80, Number of results per category: No underscore valid underscore Input: 76, Right underscore evaluation: 12, Wrong underscore evaluation: 17. Temperature: 1.85, Number of results per category: No underscore valid underscore Input: 92, Right underscore evaluation: 6, Wrong underscore evaluation: 7. Temperature: 1.90, Number of results per category: No underscore valid underscore Input: 88, Right underscore evaluation: 7, Wrong underscore evaluation: 10. Temperature: 1.95, Number of results per category: No underscore valid underscore Input: 92, Right underscore evaluation: 6, Wrong underscore evaluation: 7. Temperature: 2.00, Number of results per category: No underscore valid underscore Input: 95, Right underscore evaluation: 6, Wrong underscore evaluation: 4.

Evaluation of the data when adjusting the temperature parameter. Source: Own illustration

Figure 12
A stacked vertical bar chart showing evaluation outcomes across temperature parameters.The vertical clustered bar chart is labeled “Results for Temperature Variation (total n equals 4305)”. The legend at the top labeled “Benutzerdefiniert” shows the categories “No underscore valid underscore Input”, “Right underscore Evaluation”, and “Wrong underscore Evaluation”, and shows that the light blue bars represent “No underscore valid underscore Input”, the dark blue bars represent “Right underscore Evaluation”, and the orange bars represent “Wrong underscore Evaluation”. The bar chart shows the horizontal axis labeled “Temperature Parameter (0.05-Step-Increments)” ranging from 0.00 to 2.00 in increments of 0.05, and the vertical axis labeled “Number of Results per Category” ranging from 0 to 100 in increments of 20 units. Each vertical stacked bar has the value “105” written at the top, representing the total number of results. The data for the bars per category is shown as follows: Temperature: 0.00, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 60, Wrong underscore evaluation: 45. Temperature: 0.05, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 59, Wrong underscore evaluation: 46. Temperature: 0.10, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 62, Wrong underscore evaluation: 43. Temperature: 0.15, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 59, Wrong underscore evaluation: 46. Temperature: 0.20, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 64, Wrong underscore evaluation: 41. Temperature: 0.25, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 65, Wrong underscore evaluation: 40. Temperature: 0.30, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 62, Wrong underscore evaluation: 43. Temperature: 0.35, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 57, Wrong underscore evaluation: 48. Temperature: 0.40, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 59, Wrong underscore evaluation: 46. Temperature: 0.45, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 59, Wrong underscore evaluation: 46. Temperature: 0.50, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 56, Wrong underscore evaluation: 48. Temperature: 0.55, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 59, Wrong underscore evaluation: 46. Temperature: 0.60, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 62, Wrong underscore evaluation: 43. Temperature: 0.65, Number of results per category: No underscore valid underscore Input: 4, Right underscore evaluation: 54, Wrong underscore evaluation: 47. Temperature: 0.70, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 58, Wrong underscore evaluation: 47. Temperature: 0.75, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 64, Wrong underscore evaluation: 41. Temperature: 0.80, Number of results per category: No underscore valid underscore Input: 0, Right underscore evaluation: 65, Wrong underscore evaluation: 40. Temperature: 0.85, Number of results per category: No underscore valid underscore Input: 4, Right underscore evaluation: 61, Wrong underscore evaluation: 40. Temperature: 0.90, Number of results per category: No underscore valid underscore Input: 2, Right underscore evaluation: 56, Wrong underscore evaluation: 47. Temperature: 0.95, Number of results per category: No underscore valid underscore Input: 8, Right underscore evaluation: 54, Wrong underscore evaluation: 43. Temperature: 1.00, Number of results per category: No underscore valid underscore Input: 8, Right underscore evaluation: 48, Wrong underscore evaluation: 49. Temperature: 1.05, Number of results per category: No underscore valid underscore Input: 13, Right underscore evaluation: 42, Wrong underscore evaluation: 50. Temperature: 1.10, Number of results per category: No underscore valid underscore Input: 9, Right underscore evaluation: 53, Wrong underscore evaluation: 43. Temperature: 1.15, Number of results per category: No underscore valid underscore Input: 9, Right underscore evaluation: 58, Wrong underscore evaluation: 38. Temperature: 1.20, Number of results per category: No underscore valid underscore Input: 16, Right underscore evaluation: 50, Wrong underscore evaluation: 39. Temperature: 1.25, Number of results per category: No underscore valid underscore Input: 17, Right underscore evaluation: 51, Wrong underscore evaluation: 37. Temperature: 1.30, Number of results per category: No underscore valid underscore Input: 25, Right underscore evaluation: 44, Wrong underscore evaluation: 36. Temperature: 1.35, Number of results per category: No underscore valid underscore Input: 28, Right underscore evaluation: 43, Wrong underscore evaluation: 34. Temperature: 1.40, Number of results per category: No underscore valid underscore Input: 31, Right underscore evaluation: 44, Wrong underscore evaluation: 30. Temperature: 1.45, Number of results per category: No underscore valid underscore Input: 47, Right underscore evaluation: 32, Wrong underscore evaluation: 26. Temperature: 1.50, Number of results per category: No underscore valid underscore Input: 47, Right underscore evaluation: 32, Wrong underscore evaluation: 26. Temperature: 1.55, Number of results per category: No underscore valid underscore Input: 62, Right underscore evaluation: 22, Wrong underscore evaluation: 21. Temperature: 1.60, Number of results per category: No underscore valid underscore Input: 57, Right underscore evaluation: 32, Wrong underscore evaluation: 16. Temperature: 1.65, Number of results per category: No underscore valid underscore Input: 69, Right underscore evaluation: 21, Wrong underscore evaluation: 15. Temperature: 1.70, Number of results per category: No underscore valid underscore Input: 77, Right underscore evaluation: 21, Wrong underscore evaluation: 7. Temperature: 1.75, Number of results per category: No underscore valid underscore Input: 75, Right underscore evaluation: 18, Wrong underscore evaluation: 12. Temperature: 1.80, Number of results per category: No underscore valid underscore Input: 76, Right underscore evaluation: 12, Wrong underscore evaluation: 17. Temperature: 1.85, Number of results per category: No underscore valid underscore Input: 92, Right underscore evaluation: 6, Wrong underscore evaluation: 7. Temperature: 1.90, Number of results per category: No underscore valid underscore Input: 88, Right underscore evaluation: 7, Wrong underscore evaluation: 10. Temperature: 1.95, Number of results per category: No underscore valid underscore Input: 92, Right underscore evaluation: 6, Wrong underscore evaluation: 7. Temperature: 2.00, Number of results per category: No underscore valid underscore Input: 95, Right underscore evaluation: 6, Wrong underscore evaluation: 4.

Evaluation of the data when adjusting the temperature parameter. Source: Own illustration

Close modal

Figure 13 shows an example of the data for the execution at two different prompts with variation of the temperature parameter in 5 test runs. This shows the high tendency of the LLM to output changing results measured against a deterministic output.

Figure 13
A paired stacked bar chart labeled “I D 005” and “I D 006” showing evaluation results across temperature parameters.The vertical clustered bar chart shows two panels. The left clustered bar chart is labeled “I D 005”. The legend at the top shows that the light blue bars represent “No underscore valid underscore Input”, the dark blue bars represent “Right underscore Evaluation”, and the orange bars represent “Wrong underscore Evaluation”. The bar chart shows the horizontal axis labeled “Temperature Parameter (0.05-Step-Increments)” ranging from 0.00 to 2.00 in increments of 0.05, and the vertical axis labeled “Number of Results per Category” ranging from 0 to 4 in increments of 2 units. Each vertical stacked bar displays numeric values inside the segments representing the number of results per category. The data for the bars per category is shown as follows: Temperature: 0.00, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.05, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.10, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.15, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.20, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.25, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.30, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.35, Number of results per category: No underscore valid underscore Input: 1, Right underscore Evaluation: 4, Wrong underscore Evaluation: 0. Temperature: 0.40, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.45, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.50, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 4, Wrong underscore Evaluation: 1. Temperature: 0.55, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.60, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 4, Wrong underscore Evaluation: 1. Temperature: 0.65, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 0.70, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.75, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.80, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 4, Wrong underscore Evaluation: 1. Temperature: 0.85, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.90, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 0.95, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 2, Wrong underscore Evaluation: 1. Temperature: 1.00, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 1.05, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 1.10, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 1.15, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 1.20, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 4, Wrong underscore Evaluation: 1. Temperature: 1.25, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 2, Wrong underscore Evaluation: 1. Temperature: 1.30, Number of results per category: No underscore valid underscore Input: 1, Right underscore Evaluation: 2, Wrong underscore Evaluation: 2. Temperature: 1.35, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 1, Wrong underscore Evaluation: 2. Temperature: 1.40, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 1.45, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 2, Wrong underscore Evaluation: 1. Temperature: 1.50, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 3, Wrong underscore Evaluation: 0. Temperature: 1.55, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 1, Wrong underscore Evaluation: 0. Temperature: 1.60, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 2, Wrong underscore Evaluation: 1. Temperature: 1.65, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 1, Wrong underscore Evaluation: 1. Temperature: 1.70, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 2, Wrong underscore Evaluation: 0. Temperature: 1.75, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 1, Wrong underscore Evaluation: 1. Temperature: 1.80, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 0, Wrong underscore Evaluation: 1. Temperature: 1.85, Number of results per category: No underscore valid underscore Input: 5, Right underscore Evaluation: 0, Wrong underscore Evaluation: 0. Temperature: 1.90, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 0, Wrong underscore Evaluation: 1. Temperature: 1.95, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 1, Wrong underscore Evaluation: 0. Temperature: 2.00, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 1, Wrong underscore Evaluation: 0. The right clustered bar chart is labeled “I D 006”. The legend at the top shows that the light blue bars represent “No underscore valid underscore Input”, the dark blue bars represent “Right underscore Evaluation”, and the orange bars represent “Wrong underscore Evaluation”. The bar chart shows the horizontal axis labeled “Temperature Parameter (0.05-Step-Increments)” ranging from 0.00 to 2.00 in increments of 0.05, and the vertical axis labeled “Number of Results per Category” ranging from 0 to 5 in increments of 1 unit. Each vertical stacked bar displays numeric values inside the segments representing the number of results per category. The data for the bars per category is shown as follows: Temperature: 0.00, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 0, Wrong underscore Evaluation: 5. Temperature: 0.05, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 0, Wrong underscore Evaluation: 5. Temperature: 0.10, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 0, Wrong underscore Evaluation: 5. Temperature: 0.15, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 0, Wrong underscore Evaluation: 5. Temperature: 0.20, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 0.25, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 0.30, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 0.35, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 0, Wrong underscore Evaluation: 5. Temperature: 0.40, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 0.45, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 4, Wrong underscore Evaluation: 1. Temperature: 0.50, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 0.55, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 0.60, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 0.65, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 0.70, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 0, Wrong underscore Evaluation: 5. Temperature: 0.75, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 0.80, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 0.85, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 0.90, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 0.95, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 1.00, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 1, Wrong underscore Evaluation: 2. Temperature: 1.05, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 1.10, Number of results per category: No underscore valid underscore Input: 1, Right underscore Evaluation: 3, Wrong underscore Evaluation: 1. Temperature: 1.15, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 1.20, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 1.25, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 1.30, Number of results per category: No underscore valid underscore Input: 1, Right underscore Evaluation: 1, Wrong underscore Evaluation: 3. Temperature: 1.35, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 1, Wrong underscore Evaluation: 2. Temperature: 1.40, Number of results per category: No underscore valid underscore Input: 1, Right underscore Evaluation: 3, Wrong underscore Evaluation: 1. Temperature: 1.45, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 1, Wrong underscore Evaluation: 2. Temperature: 1.50, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 4, Wrong underscore Evaluation: 1. Temperature: 1.55, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 1, Wrong underscore Evaluation: 1. Temperature: 1.60, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 1, Wrong underscore Evaluation: 1. Temperature: 1.65, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 0, Wrong underscore Evaluation: 1. Temperature: 1.70, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 2, Wrong underscore Evaluation: 0. Temperature: 1.75, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 2, Wrong underscore Evaluation: 1. Temperature: 1.80, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 2, Wrong underscore Evaluation: 0. Temperature: 1.85, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 0, Wrong underscore Evaluation: 1. Temperature: 1.90, Number of results per category: No underscore valid underscore Input: 5, Right underscore Evaluation: 0, Wrong underscore Evaluation: 0. Temperature: 1.95, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 1, Wrong underscore Evaluation: 1. Temperature: 2.00, Number of results per category: No underscore valid underscore Input: 5, Right underscore Evaluation: 0, Wrong underscore Evaluation: 0.

Exemplary presentation of the data for two data points to be evaluated. Source: Own illustration

Figure 13
A paired stacked bar chart labeled “I D 005” and “I D 006” showing evaluation results across temperature parameters.The vertical clustered bar chart shows two panels. The left clustered bar chart is labeled “I D 005”. The legend at the top shows that the light blue bars represent “No underscore valid underscore Input”, the dark blue bars represent “Right underscore Evaluation”, and the orange bars represent “Wrong underscore Evaluation”. The bar chart shows the horizontal axis labeled “Temperature Parameter (0.05-Step-Increments)” ranging from 0.00 to 2.00 in increments of 0.05, and the vertical axis labeled “Number of Results per Category” ranging from 0 to 4 in increments of 2 units. Each vertical stacked bar displays numeric values inside the segments representing the number of results per category. The data for the bars per category is shown as follows: Temperature: 0.00, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.05, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.10, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.15, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.20, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.25, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.30, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.35, Number of results per category: No underscore valid underscore Input: 1, Right underscore Evaluation: 4, Wrong underscore Evaluation: 0. Temperature: 0.40, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.45, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.50, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 4, Wrong underscore Evaluation: 1. Temperature: 0.55, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.60, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 4, Wrong underscore Evaluation: 1. Temperature: 0.65, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 0.70, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.75, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.80, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 4, Wrong underscore Evaluation: 1. Temperature: 0.85, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 5, Wrong underscore Evaluation: 0. Temperature: 0.90, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 0.95, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 2, Wrong underscore Evaluation: 1. Temperature: 1.00, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 1.05, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 1.10, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 1.15, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 1.20, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 4, Wrong underscore Evaluation: 1. Temperature: 1.25, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 2, Wrong underscore Evaluation: 1. Temperature: 1.30, Number of results per category: No underscore valid underscore Input: 1, Right underscore Evaluation: 2, Wrong underscore Evaluation: 2. Temperature: 1.35, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 1, Wrong underscore Evaluation: 2. Temperature: 1.40, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 1.45, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 2, Wrong underscore Evaluation: 1. Temperature: 1.50, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 3, Wrong underscore Evaluation: 0. Temperature: 1.55, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 1, Wrong underscore Evaluation: 0. Temperature: 1.60, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 2, Wrong underscore Evaluation: 1. Temperature: 1.65, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 1, Wrong underscore Evaluation: 1. Temperature: 1.70, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 2, Wrong underscore Evaluation: 0. Temperature: 1.75, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 1, Wrong underscore Evaluation: 1. Temperature: 1.80, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 0, Wrong underscore Evaluation: 1. Temperature: 1.85, Number of results per category: No underscore valid underscore Input: 5, Right underscore Evaluation: 0, Wrong underscore Evaluation: 0. Temperature: 1.90, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 0, Wrong underscore Evaluation: 1. Temperature: 1.95, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 1, Wrong underscore Evaluation: 0. Temperature: 2.00, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 1, Wrong underscore Evaluation: 0. The right clustered bar chart is labeled “I D 006”. The legend at the top shows that the light blue bars represent “No underscore valid underscore Input”, the dark blue bars represent “Right underscore Evaluation”, and the orange bars represent “Wrong underscore Evaluation”. The bar chart shows the horizontal axis labeled “Temperature Parameter (0.05-Step-Increments)” ranging from 0.00 to 2.00 in increments of 0.05, and the vertical axis labeled “Number of Results per Category” ranging from 0 to 5 in increments of 1 unit. Each vertical stacked bar displays numeric values inside the segments representing the number of results per category. The data for the bars per category is shown as follows: Temperature: 0.00, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 0, Wrong underscore Evaluation: 5. Temperature: 0.05, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 0, Wrong underscore Evaluation: 5. Temperature: 0.10, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 0, Wrong underscore Evaluation: 5. Temperature: 0.15, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 0, Wrong underscore Evaluation: 5. Temperature: 0.20, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 0.25, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 0.30, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 0.35, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 0, Wrong underscore Evaluation: 5. Temperature: 0.40, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 0.45, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 4, Wrong underscore Evaluation: 1. Temperature: 0.50, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 0.55, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 0.60, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 0.65, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 0.70, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 0, Wrong underscore Evaluation: 5. Temperature: 0.75, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 0.80, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 0.85, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 0.90, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 0.95, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 1, Wrong underscore Evaluation: 4. Temperature: 1.00, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 1, Wrong underscore Evaluation: 2. Temperature: 1.05, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 1.10, Number of results per category: No underscore valid underscore Input: 1, Right underscore Evaluation: 3, Wrong underscore Evaluation: 1. Temperature: 1.15, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 1.20, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 2, Wrong underscore Evaluation: 3. Temperature: 1.25, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 3, Wrong underscore Evaluation: 2. Temperature: 1.30, Number of results per category: No underscore valid underscore Input: 1, Right underscore Evaluation: 1, Wrong underscore Evaluation: 3. Temperature: 1.35, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 1, Wrong underscore Evaluation: 2. Temperature: 1.40, Number of results per category: No underscore valid underscore Input: 1, Right underscore Evaluation: 3, Wrong underscore Evaluation: 1. Temperature: 1.45, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 1, Wrong underscore Evaluation: 2. Temperature: 1.50, Number of results per category: No underscore valid underscore Input: 0, Right underscore Evaluation: 4, Wrong underscore Evaluation: 1. Temperature: 1.55, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 1, Wrong underscore Evaluation: 1. Temperature: 1.60, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 1, Wrong underscore Evaluation: 1. Temperature: 1.65, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 0, Wrong underscore Evaluation: 1. Temperature: 1.70, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 2, Wrong underscore Evaluation: 0. Temperature: 1.75, Number of results per category: No underscore valid underscore Input: 2, Right underscore Evaluation: 2, Wrong underscore Evaluation: 1. Temperature: 1.80, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 2, Wrong underscore Evaluation: 0. Temperature: 1.85, Number of results per category: No underscore valid underscore Input: 4, Right underscore Evaluation: 0, Wrong underscore Evaluation: 1. Temperature: 1.90, Number of results per category: No underscore valid underscore Input: 5, Right underscore Evaluation: 0, Wrong underscore Evaluation: 0. Temperature: 1.95, Number of results per category: No underscore valid underscore Input: 3, Right underscore Evaluation: 1, Wrong underscore Evaluation: 1. Temperature: 2.00, Number of results per category: No underscore valid underscore Input: 5, Right underscore Evaluation: 0, Wrong underscore Evaluation: 0.

Exemplary presentation of the data for two data points to be evaluated. Source: Own illustration

Close modal

4.6.2 Impact-evaluation for the top P parameter

The influence of the top P parameter is evaluated in the same way as already described in Section 4.6.1. The evaluation shows that there are no significant fluctuations in the results up to a threshold value of around 0.5 (see Figure 14). After the threshold value is exceeded, the fluctuation between correct and incorrect categorizations increases. There are no non-valid inputs. Due to the fluctuations in both positive and negative directions, no reliable optimum can be determined for the top P value at this point.

Figure 14
A stacked vertical bar chart titled “Results for Top P-Variation” showing right and wrong evaluation counts.The vertical clustered bar chart is labeled “Results for Top P-Variation (total n equals 861)”. The legend at the top labeled “Benutzerdefiniert” shows the categories “Right underscore Evaluation” and “Wrong underscore Evaluation”, and shows that the light blue bars represent “Right underscore Evaluation” and the dark blue bars represent “Wrong underscore Evaluation”. The bar chart shows the horizontal axis labeled “Top P-Parameter (0.025-Step-Increments)” ranging from 0.00 to 1.00 in increments of 0.025, and the vertical axis labeled “Number of Results per Category” ranging from 0 to 20 in increments of 5 units. Each vertical stacked bar has the total number of results represented by the combined height of the blue and orange segments, with numeric values written inside each segment. The data for the bars per category is shown as follows: Top P value 0.00, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9; Top P value: 0.025, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.050, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.075, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.100, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.125, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.150, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.175, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.200, Number of results per category: Right underscore Evaluation: 11, Wrong underscore Evaluation: 10. Top P value: 0.225, Number of results per category: Right underscore Evaluation: 11, Wrong underscore Evaluation: 10. Top P value: 0.250, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.275, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.300, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.325, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.350, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.375, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.400, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.425, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.450, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.475, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.500, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.525, Number of results per category: Right underscore Evaluation: 13, Wrong underscore Evaluation: 8. Top P value: 0.550, Number of results per category: Right underscore Evaluation: 13, Wrong underscore Evaluation: 8. Top P value: 0.575, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.600, Number of results per category: Right underscore Evaluation: 13, Wrong underscore Evaluation: 8. Top P value: 0.625, Number of results per category: Right underscore Evaluation: 11, Wrong underscore Evaluation: 10. Top P value: 0.650, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.675, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.700, Number of results per category: Right underscore Evaluation: 13, Wrong underscore Evaluation: 8. Top P value: 0.725, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.750, Number of results per category: Right underscore Evaluation: 11, Wrong underscore Evaluation: 10. Top P value: 0.775, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.800, Number of results per category: Right underscore Evaluation: 10, Wrong underscore Evaluation: 11. Top P value: 0.825, Number of results per category: Right underscore Evaluation: 13, Wrong underscore Evaluation: 8. Top P value: 0.850, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.875, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.900, Number of results per category: Right underscore Evaluation: 14, Wrong underscore Evaluation: 7. Top P value: 0.925, Number of results per category: Right underscore Evaluation: 13, Wrong underscore Evaluation: 8. Top P value: 0.950, Number of results per category: Right underscore Evaluation: 10, Wrong underscore Evaluation: 11. Top P value: 0.975, Number of results per category: Right underscore Evaluation: 10, Wrong underscore Evaluation: 11. Top P value: 1.000, Number of results per category: Right underscore Evaluation: 14, Wrong underscore Evaluation: 7. Each vertical stacked bar has the total number of results as 21 on top.

Results for top P-variation (total n = 861). Source: Own illustration

Figure 14
A stacked vertical bar chart titled “Results for Top P-Variation” showing right and wrong evaluation counts.The vertical clustered bar chart is labeled “Results for Top P-Variation (total n equals 861)”. The legend at the top labeled “Benutzerdefiniert” shows the categories “Right underscore Evaluation” and “Wrong underscore Evaluation”, and shows that the light blue bars represent “Right underscore Evaluation” and the dark blue bars represent “Wrong underscore Evaluation”. The bar chart shows the horizontal axis labeled “Top P-Parameter (0.025-Step-Increments)” ranging from 0.00 to 1.00 in increments of 0.025, and the vertical axis labeled “Number of Results per Category” ranging from 0 to 20 in increments of 5 units. Each vertical stacked bar has the total number of results represented by the combined height of the blue and orange segments, with numeric values written inside each segment. The data for the bars per category is shown as follows: Top P value 0.00, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9; Top P value: 0.025, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.050, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.075, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.100, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.125, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.150, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.175, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.200, Number of results per category: Right underscore Evaluation: 11, Wrong underscore Evaluation: 10. Top P value: 0.225, Number of results per category: Right underscore Evaluation: 11, Wrong underscore Evaluation: 10. Top P value: 0.250, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.275, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.300, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.325, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.350, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.375, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.400, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.425, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.450, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.475, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.500, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.525, Number of results per category: Right underscore Evaluation: 13, Wrong underscore Evaluation: 8. Top P value: 0.550, Number of results per category: Right underscore Evaluation: 13, Wrong underscore Evaluation: 8. Top P value: 0.575, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.600, Number of results per category: Right underscore Evaluation: 13, Wrong underscore Evaluation: 8. Top P value: 0.625, Number of results per category: Right underscore Evaluation: 11, Wrong underscore Evaluation: 10. Top P value: 0.650, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.675, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.700, Number of results per category: Right underscore Evaluation: 13, Wrong underscore Evaluation: 8. Top P value: 0.725, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.750, Number of results per category: Right underscore Evaluation: 11, Wrong underscore Evaluation: 10. Top P value: 0.775, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.800, Number of results per category: Right underscore Evaluation: 10, Wrong underscore Evaluation: 11. Top P value: 0.825, Number of results per category: Right underscore Evaluation: 13, Wrong underscore Evaluation: 8. Top P value: 0.850, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.875, Number of results per category: Right underscore Evaluation: 12, Wrong underscore Evaluation: 9. Top P value: 0.900, Number of results per category: Right underscore Evaluation: 14, Wrong underscore Evaluation: 7. Top P value: 0.925, Number of results per category: Right underscore Evaluation: 13, Wrong underscore Evaluation: 8. Top P value: 0.950, Number of results per category: Right underscore Evaluation: 10, Wrong underscore Evaluation: 11. Top P value: 0.975, Number of results per category: Right underscore Evaluation: 10, Wrong underscore Evaluation: 11. Top P value: 1.000, Number of results per category: Right underscore Evaluation: 14, Wrong underscore Evaluation: 7. Each vertical stacked bar has the total number of results as 21 on top.

Results for top P-variation (total n = 861). Source: Own illustration

Close modal

4.6.3 Impact-evaluation for the temperature and top P parameter

In this case, the procedure is as described in 4.6.1 and 4.6.2. 41 combinations of random values in the ranges of 0.00–2.00 (temperature) and 0.00–1.00 (top P) are generated and fed into the process. To make the parameter–performance relationship easier to read, a simple naive Bayes classifier was fit to the experiment outcomes, using Temperature and Top P as features and the binary label correct vs. incorrect as the target. The curves plot the model’s estimated class-conditional likelihoods for correct vs. incorrect predictions over the alterations explored in the Doe. Regions where the correct-class curve sits above the incorrect-class curve indicate empirically more reliable settings. Figure 15 shows that the optimal value for temperature to support in leading to a valid assessment by the LLM is at around 1.1., the biggest positive distinction from wrong prediction is until the lines intersect at approx. 0.8. The graph compares the probabilities estimated by the model for correct (red) and incorrect predictions across parameter ranges. The increased probability for the correct class at medium temperatures (≈0.4–0.7; decreasing by ≈ 0.8) visualizes the stability band identified in the Doe, while Top P shows weaker, threshold-like effects.

Figure 15
A two-panel line chart showing probability curves for right and wrong evaluation across temperature and top p values.The image shows two panels displayed side by side. In the left panel, the horizontal axis is labeled “temperature” and ranges from approximately negative 2.0 to 4.0, and the vertical axis is labeled “probability” and ranges from 0.00 to 0.70. The legend at the top left shows that the blue line represents “Wrong Evaluation” and the red line represents “Right Evaluation”. The blue curve starts at approximately (negative 1.86, 0.002), increases smoothly, reaches its peak at (1.179, 0.693), and then decreases symmetrically to end near (4.103, 0.002). The red curve follows a similar path, starting at approximately (negative 1.0, 0.002), rising to a peak near (1.146, 0.67), and then declining to end around (4.0, 0.002). In the right panel, the horizontal axis is labeled “top underscore p” and ranges from negative 0.8 to 2.0 in increments of 0.2, and the vertical axis is labeled “probability” and ranges from 0.0 to 1.5. The legend is the same as in the left panel, indicating the blue line as “Wrong Evaluation” and the red line as “Right Evaluation”. The blue curve in the right panel starts near (negative 0.849, 0.002), rises steadily, reaches a maximum at (0.496, 1.465), and then decreases to end near (1.853, 0.002). The red curve similarly starts from (negative 0.138, 0.103), peaks at (0.481, 1.49), and declines to end around (1.363, 0.002). Note: All numerical data points are approximated.

Results of naive Bayes model to predict the model performance. Note: Target class (correct prediction) is shown in red. Source: Own illustration

Figure 15
A two-panel line chart showing probability curves for right and wrong evaluation across temperature and top p values.The image shows two panels displayed side by side. In the left panel, the horizontal axis is labeled “temperature” and ranges from approximately negative 2.0 to 4.0, and the vertical axis is labeled “probability” and ranges from 0.00 to 0.70. The legend at the top left shows that the blue line represents “Wrong Evaluation” and the red line represents “Right Evaluation”. The blue curve starts at approximately (negative 1.86, 0.002), increases smoothly, reaches its peak at (1.179, 0.693), and then decreases symmetrically to end near (4.103, 0.002). The red curve follows a similar path, starting at approximately (negative 1.0, 0.002), rising to a peak near (1.146, 0.67), and then declining to end around (4.0, 0.002). In the right panel, the horizontal axis is labeled “top underscore p” and ranges from negative 0.8 to 2.0 in increments of 0.2, and the vertical axis is labeled “probability” and ranges from 0.0 to 1.5. The legend is the same as in the left panel, indicating the blue line as “Wrong Evaluation” and the red line as “Right Evaluation”. The blue curve in the right panel starts near (negative 0.849, 0.002), rises steadily, reaches a maximum at (0.496, 1.465), and then decreases to end near (1.853, 0.002). The red curve similarly starts from (negative 0.138, 0.103), peaks at (0.481, 1.49), and declines to end around (1.363, 0.002). Note: All numerical data points are approximated.

Results of naive Bayes model to predict the model performance. Note: Target class (correct prediction) is shown in red. Source: Own illustration

Close modal

Interpreting the results aids in laying the foundation to verify the results and also to derive practical implications from the gained knowledge later on. The interaction between the operational parameters, particularly Temperature and Top P, and the target variables provided critical insights into the configurations required for optimal classification performance. The Temperature parameter demonstrated a dominant influence on the model’s behavior, directly modulating the randomness inherent in the generated outputs. Configurations within the range of 0.4–0.7 (on a scale from 0 to 2, 0 meaning deterministic and no randomness while 2 is the maximum amount of randomness possible) allowed for an equilibrium between diversity and determinism, thereby facilitating consistent binary classification outcomes aligned with the study’s objectives. However, when the Temperature exceeded 0.8, the outputs became erratic, often deviating from the intended binary classifications, which suggests that excessive randomness compromises the coherence of the predictions and conflicting with a binary classification task.

In contrast, the Top P parameter exhibited a comparatively limited effect within the tested ranges. Accuracy variations became more pronounced only beyond a threshold of 0.5, but its overall contribution to classification precision remained marginal. This finding underscores the dominant role of the Temperature parameter in shaping model performance, particularly in applications requiring robust and high-fidelity outputs. When analyzed in combination, the interplay between Temperature and Top P reaffirmed the preeminence of Temperature as the critical determinant of output quality. Optimal results were achieved with a Temperature setting of 0.6, accompanied by a Top P value of 0.5, although the latter did not consistently enhance performance.

The verification process was undertaken to confirm the robustness, consistency, and validity of the experimental findings. Additional experiments using randomly selected input samples were conducted, reinforcing the observed parameter effects and validating the reproducibility of the results. These supplementary evaluations confirmed that the Temperature parameter exerts a substantial and predictable influence on classification accuracy, while the Top P parameter’s role is secondary and context-dependent within this specific use case.

The incorporation of randomly selected samples ensured that the observed parameter effects were not artifacts of specific input sets but instead reflected generalized patterns across diverse data contexts. By repeatedly testing these random samples under controlled experimental conditions, also with varying prompts and content and parameters, the observed results fit in the context of the derived conclusions of this work.

The outcomes of this research present practical implications for the deployment of Large Language Models (LLMs) in supply chain risk management, as well as for the broader domain of operational parameter optimization in AI-driven decision-making systems. For practitioners, particularly those addressing the complexities of risk classification, the findings underscore the critical importance of precise parameter calibration. Achieving reliable and interpretable outputs in dynamic and high-stakes scenarios necessitates a deep understanding of parameter behavior. The study’s systematic delineation of the optimal range for the Temperature parameter provides a robust framework for enhancing model efficacy by balancing the inherent trade-offs between randomness and determinism. This balance is essential for risk identification tasks, where even minor inaccuracies in classification can lead to significant operational or financial consequences. In public-interest supply chains (food, pharmaceuticals, medical devices) calibrated early warnings and the potential to reduce false alarms and missed incidents, stabilizing upstream planning and last-mile availability would yield a positive societal impact.

The identification of the Top P parameter’s limited influence within specific ranges further refines the approach to parameter tuning. This insight allows practitioners to prioritize configurations that yield the most substantial performance gains, streamlining the optimization process. By reducing computational demands and simplifying implementation, these findings make advanced LLM configuration strategies more accessible, even to those with limited expertise in machine learning.

Beyond the immediate application to supply chain risk management, this study offers a replicable and generalizable methodology for systematically optimizing LLM performance in diverse contexts. The Design of Experiments framework used herein ensures that the findings are not only statistically robust but also practically adaptable to a wide array of classification tasks, including fraud detection, customer sentiment analysis, and crisis management. This methodological adaptability highlights the scalability of the approach, underscoring its relevance to numerous domains where high accuracy and reliability are indispensable.

The study’s emphasis on iterative validation, with a particular focus on random sampling, further solidifies its practical contributions. Random sampling proved instrumental in verifying the consistency and robustness of the proposed parameter configurations, ensuring that the observed effects are generalizable across varied data conditions. This iterative and data-driven process minimizes the risks of overfitting while enhancing the external validity of the findings. The integration of such rigorous testing frameworks establishes a benchmark for deploying LLMs in environments characterized by variability and uncertainty.

In conclusion, the insights derived from this research extend far beyond the specific case of supply chain risk management as outlined in SCREWS. They provide a foundational approach for optimizing LLM configurations that is both systematic and adaptable, paving the way for further innovation in the deployment of AI-driven systems across complex and evolving operational landscapes. By coupling theoretical rigor with practical applicability, this study bridges the gap between academic research and real-world implementation, offering valuable tools for practitioners and researchers alike.

The aim of this research to increase the knowledge related to the impact of individual parameters and their interactions on classification precision for supply chain risk identification is to be considered achieved. By using the Doe process, a research design was set up in a structured manner which enabled findings against the background of the present use case and led to new insights on the performance of LLMs for supply chain risk identification. Finally, based on these findings, the initially derived research questions can be answered accordingly. Providing an answer to RQ1 (Which operational LLM parameters have the highest impact on the accuracy of supply chain risk classification based on external data?), it was found that the Temperature signifies the highest impact on the classification accuracy against the background of the given use case.

With regards to RQ2 (Which operational parameter configurations maximize the predictive accuracy of LLMs in supply chain risk classification, while ensuring robustness and adaptability across diverse operational scenarios?), it was found that the Temperature parameter significantly influences the model’s behavior, directly controlling the randomness of the generated outputs. Settings between 0.4 and 0.7 achieved a balance between diversity and determinism, enabling reliable binary classification results that aligned with the study’s goals. Marginal, positive accuracy variations through Top P became noticeable when the threshold exceeded 0.5. In combination, optimal results were achieved with a Temperature setting of 0.6, accompanied by a Top P value of 0.5.

In conclusion, this study underscores the pivotal role of operational parameter optimization in enhancing the efficacy of LLMs within the domain of supply chain risk classification. The findings emphasize the necessity of fine-tuning the Temperature and Top P to balance classification accuracy and output stability effectively. By leveraging a systematic Design of Experiments framework, this research has not only made an initial step into filling a critical gap but has also established a replicable, scalable methodology for optimizing LLM performance in this and related contexts.

However, certain limitations apply. It is to note, that a “randomness” variable is inherently not the intended use-case for the application of Doe. The randomness makes it more difficult to test the changes of other variables while maintaining a constant “randomness”. A high randomness is the same parameter but causes different outputs each time the process is initialized. This was mitigated by the high number of tests done.

Another key limitation lies in the binary nature of the classification task. While this approach allowed for focused evaluation of the input variables, it inherently limits the generalizability of the results to multi-class classification scenarios or more nuanced tasks that may involve multiple risk categories. Expanding the framework to encompass such complexities would enhance the practical relevance and applicability of the findings. Multi-class taxonomies for risk could also pose a viable alternative.

Finally, the experimental setup employed fixed LLM architectures without exploring the impact of different model architectures or sizes. Given the rapid evolution of LLM technologies, comparative analyses involving diverse architectures would add significant depth to the findings and extend their applicability.

Future research should seek to expand upon these contributions by addressing the identified limitations. Incorporating multi-class classification schemes, integrating more heterogeneous data sources, exploring a broader set of parameters, and systematically testing different LLM architectures will be essential for ensuring that the optimization framework remains both robust and versatile. This iterative process will further enable the refinement of AI-driven approaches, equipping practitioners and researchers with tools to navigate increasingly complex and dynamic operational environments with greater confidence and precision.

The paper is related to the project “Economic resilience in the Steinfurt district (WiReSt).” “The ‘WiReSt’ project is being implemented within the ‘Region gestalten’ program of the Federal Ministry of Housing, Urban Development, and Construction in cooperation with the Federal Institute for Research on Building, Urban, and Spatial Research. Das Vorhaben, WiReSt” wird innerhalb des Programms Region gestalten des Bundesministeriums für Wohnen, Stadtentwicklung und Bauwesen in Zusammenarbeit mit dem Bundesinstitut für Bau-, Stadt- und Raumforschung gefördert.

Table A1

Concept matrix

Scope of risk identificationMethod/approach/model applied for risk identificationLLM integrationData used
SourceStatic/one-timeDynamicReal rimeCompany individualLiterature reviewInterviews/surveysCase studyData mining/text miningDev of own approachLLM/NLP supportedParameter optimizationInternal data (e.g. supplier data)External data (e.g. news articles)Focus
Aboutorab et al. (2022)           
Aboutorab et al. (2023)            
Shishehgarkhaneh et al. (2024)          
Alamdari et al. (2021)             Green Construction 
Chukwuka et al. (2023)            Emergency Supply Chains 
de Sousa Jabbour et al. (2024)             Firm Reputation 
Deiva Ganesh and Kalpana (2022b)           
Hong and Kolios (2020)            Manufacturing 
Hou and Zao (2020)             
Karmaker et al. (2023)             Emerging Markets 
Krstić et al. (2024)             Circular SC 
Kusrini et al. (2021)            Organic Farming 
Li et al. (2022)             
Liu et al. (2024)          SC Finance 
Meziani et al. (2022)              
Mismar et al. (2022)            Last Mile 
Nagy et al. (2022)         
Pandey et al. (2020)           Cyber Security Risk 
Panjehfouladgaran and Lim (2020)            Reverse Logistics 
Ramiah et al. (2022)            Solar Photovolatic 
Rathor et al. (2024)           
Rezki and Mansouri (2024)           Delivery Delay Risk 
Rosales et al. (2019)            Brazil/Agri Food/ 
Salamai et al. (2019)        
Salamai et al. (2021)            
Shahsavari et al. (2024)          
Shahsavari et al. (2024)          
Shahsavari et al. (2024)          
Shishehgarkhaneh et al. (2024)          Construction SC 
Tama et al. (2019)            Chip SC 
Wang and Hu (2023)           Energy/Power 
Wang and Zhou (2024)            Construction SC 
Yadav et al. (2023)            
Zhang and He (2024)             Energy/Power 
Zhao et al. (2024a, b)           
Zhao et al. (2024a, b)             Agri Food 
Zhao et al. (2024a, b)           
Zhu et al. (2019)             
Scope of risk identificationMethod/approach/model applied for risk identificationLLM integrationData used
SourceStatic/one-timeDynamicReal rimeCompany individualLiterature reviewInterviews/surveysCase studyData mining/text miningDev of own approachLLM/NLP supportedParameter optimizationInternal data (e.g. supplier data)External data (e.g. news articles)Focus
Aboutorab et al. (2022)           
Aboutorab et al. (2023)            
Shishehgarkhaneh et al. (2024)          
Alamdari et al. (2021)             Green Construction 
Chukwuka et al. (2023)            Emergency Supply Chains 
de Sousa Jabbour et al. (2024)             Firm Reputation 
Deiva Ganesh and Kalpana (2022b)           
Hong and Kolios (2020)            Manufacturing 
Hou and Zao (2020)             
Karmaker et al. (2023)             Emerging Markets 
Krstić et al. (2024)             Circular SC 
Kusrini et al. (2021)            Organic Farming 
Li et al. (2022)             
Liu et al. (2024)          SC Finance 
Meziani et al. (2022)              
Mismar et al. (2022)            Last Mile 
Nagy et al. (2022)         
Pandey et al. (2020)           Cyber Security Risk 
Panjehfouladgaran and Lim (2020)            Reverse Logistics 
Ramiah et al. (2022)            Solar Photovolatic 
Rathor et al. (2024)           
Rezki and Mansouri (2024)           Delivery Delay Risk 
Rosales et al. (2019)            Brazil/Agri Food/ 
Salamai et al. (2019)        
Salamai et al. (2021)            
Shahsavari et al. (2024)          
Shahsavari et al. (2024)          
Shahsavari et al. (2024)          
Shishehgarkhaneh et al. (2024)          Construction SC 
Tama et al. (2019)            Chip SC 
Wang and Hu (2023)           Energy/Power 
Wang and Zhou (2024)            Construction SC 
Yadav et al. (2023)            
Zhang and He (2024)             Energy/Power 
Zhao et al. (2024a, b)           
Zhao et al. (2024a, b)             Agri Food 
Zhao et al. (2024a, b)           
Zhu et al. (2019)             
Source(s): Own illustration

1.

WiReSt is a multi-year project funded by the German Federal Ministry of Housing, Urban Development and Building and the Federal Institute for Research on Building, Urban Affairs and Spatial Development. The project is led by Münster University of Applied Sciences and the WESt mbH (https://westmbh.de/wirest/).

2.

It has to be noted that a customization/alignment of the LLM itself is also possible in theory. However, this is outside the application-oriented focus of this work.

3.

According to the documentation of OpenAI – However, there is no specification on how the selection is conducted.

4.

Due to the operational nature of this step, no further details on operationalization are presented at this point. Both raw data and the configuration of the set-up for automated testing will be made available by the researchers on request.

Aboutorab
,
H.
,
Hussain
,
O.K.
,
Saberi
,
M.
and
Hussain
,
F.K.
(
2022
), “
A reinforcement learning-based framework for disruption risk identification in supply chains
”,
Future Generation Computer Systems
, Vol. 
126
, pp. 
110
-
122
, doi: .
Aboutorab
,
H.
,
Saberi
,
M.
,
Hussain
,
O.K.
and
Hussain
,
F.K.
(
2023
), “
POSSUM: PrOactive diSruption riSk identification for sUpply chain Management
”,
2023 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)
, pp.
319
-
321
. doi: .
Alamdari
,
A.
,
Jabarzadeh
,
Y.
,
Samson
,
D.
and
Sanoubar
,
N.
(
2021
), “
Supply chain risk factors in green construction of residential mega projects – interactions and categorization
”,
Engineering, Construction and Architectural Management
, Vol. 
30
No. 
2
, pp.
568
-
597
, doi: .
Aqlan
,
F.
and
Lam
,
S.S.
(
2016
), “
Supply chain optimization under risk and uncertainty: a case study for high-end server manufacturing
”,
Computers and Industrial Engineering
, Vol. 
93
, pp. 
78
-
87
, doi: .
Chopra
,
S.
and
Sodhi
,
M.S.
(
2004
), “
Managing risk to avoid supply-chain breakdown
”,
MIT Sloan Management Review
, Vol. 
46
, pp. 
52
-
61
.
Chu
,
C.-Y.
,
Park
,
K.
and
Kremer
,
G.E.
(
2020
), “
A global supply chain risk management framework: an application of text-mining to identify region-specific supply chain risks
”,
Advanced Engineering Informatics
, Vol. 
45
, 101053, doi: .
Chukwuka
,
O.J.
,
Ren
,
J.
,
Wang
,
J.
and
Paraskevadakis
,
D.
(
2023
), “
A comprehensive research on analyzing risk factors in emergency supply chains
”,
Journal of Humanitarian Logistics and Supply Chain Management
, Vol. 
13
No. 
3
, pp.
249
-
292
, doi: .
Cigolini
,
R.
and
Rossi
,
T.
(
2010
), “
Managing operational risks along the oil supply chain
”,
Production Planning and Control
, Vol. 
21
No. 
5
, pp. 
452
-
467
, doi: .
Cooper
,
H.M.
(
1988
), “
Organizing knowledge syntheses: a taxonomy of literature reviews
”,
Knowledge in Society
, Vol. 
1
No. 
1
, pp. 
104
-
126
, doi: .
de Sousa Jabbour
,
A.B.
,
Fiorini
,
P.D.C.
,
Latan
,
H.
,
Laguir
,
I.
and
Chiappetta Jabbour
,
C.J.
(
2024
), “
Supply chain risk identification: signaling companies’ social sustainability reputation
”,
Journal of Cleaner Production
, Vol. 
478
 
N/A
, 143817, doi: .
Deiva Ganesh
,
A.
and
Kalpana
,
P.
(
2022a
), “
Future of artificial intelligence and its influence on supply chain risk management – a systematic review
”,
Computers and Industrial Engineering
, Vol. 
169
, 108206, doi: .
Deiva Ganesh
,
A.
and
Kalpana
,
P.
(
2022b
), “
Supply chain risk identification: a real-time data-mining approach
”,
Industrial Management and Data Systems
, Vol. 
122
No. 
5
, pp. 
1333
-
1354
, doi: .
Eschenbächer
,
J.
,
Dircksen
,
M.
,
Kühl
,
L.
and
Wiethölter
,
J.
, “
Initial approach for AI based real time global risk assessment in SCM
”,
Proceedings of the 27th International Symposium on Logistics
, pp. 
75
-
76
,
available at:
 https://www.islconf.org/wp-content/uploads/2023/07/ISL_2023_Final_Proceedings.pdf
Gurtu
,
A.
and
Johny
,
J.
(
2021
), “
Supply chain risk management: literature review
”,
Risks
, Vol. 
9
No. 
1
, p.
16
, doi: .
Handfield
,
R.B.
,
Graham
,
G.
and
Burns
,
L.
(
2020
), “
Corona virus, tariffs, trade wars and supply chain evolutionary design
”,
International Journal of Operations and Production Management
, Vol. 
40
No. 
10
, pp. 
1649
-
1660
, doi: .
Heckmann
,
I.
,
Comes
,
T.
and
Nickel
,
S.
(
2015
), “
A critical review on supply chain risk – definition, measure and modeling
”,
Omega
, Vol. 
52
, pp. 
119
-
132
, doi: .
Ho
,
W.
,
Zheng
,
T.
,
Yildiz
,
H.
and
Talluri
,
S.
(
2015
), “
Supply chain risk management: a literature review
”,
International Journal of Production Research
, Vol. 
53
No. 
16
, pp. 
5031
-
5069
, doi: .
Hong
,
T.
and
Kolios
,
A.
(
2020
), “
A framework for risk management of large-scale organisation supply chains
”,
2020 International Conference on Decision Aid Sciences and Application (DASA)
, pp.
948
-
953
, doi: .
Hou
,
J.
and
Zao
,
X.
(
2020
), “
Toward a supply chain risk identification and filtering framework using systems theory
”,
Asia Pacific Journal of Marketing and Logistics
, Vol. 
33
No. 
6
, pp.
1482
-
1497
, doi: .
Ivanov
,
D.
(
2021
), “Supply chain risks, disruptions, and ripple effect”, in
Ivanov
,
D.
(Ed.),
Introduction to Supply Chain Resilience
,
Springer International Publishing
,
Cham
, pp. 
1
-
28
.
Karmaker
,
C.L.
,
Aziz
,
R.A.
,
Palit
,
T.
and
Bari
,
A.B.M.M.
(
2023
), “
Analyzing supply chain risk factors in the small and medium enterprises under fuzzy environment: Implications towards sustainability for emerging economies
”,
Sustainable Technology and Entrepreneurship
, Vol. 
2
No. 
1
, 100032, doi: .
Kleppmann
,
W.
(
2020
),
Versuchsplanung: Produkte und Prozesse optimieren
, (10th ed.) ,
Hanser
,
München
.
KMPG & ASCM
(
2024
), “
Navigating supply chain volatility: actionable insights from the 2023 index
”,
available at:
 https://kpmg.com/kpmg-us/content/dam/kpmg/pdf/2024/navigating-supply-chain-volatility.pdf
Krstić
,
M.
,
Agnusdei
,
L.
,
Palmi
,
P.
and
Baležentis
,
T.
(
2024
), “
Enabling organizations to strategically manage risks in circular supply chains
”,
Business Strategy and the Environment
, Vol. 
33
No. 
6
, pp.
5996
-
6009
, doi: .
Kusrini
,
E.
,
Aini
,
N.
,
Putri
,
A.R.
and
Syufrian
,
B.
(
2021
), “
Risk mitigation strategy using the house of risk (HOR) method for organic farming supplier in sustainable supply chain
”,
2021 International Conference on Data Analytics for Business and Industry (ICDABI)
,
Sakheer, Bahrain
,
25-26 October
,
IEEE
, pp. 
486
-
492
.
La Vega
,
M.
(
2023
), “
Understanding open AI’s ‘Temperature’, and ‘Top P’ parameters in language models
”.
Li
,
Y.
,
Ma
,
X.
,
Liu
,
Y.
and
Hu
,
L.
(
2022
), “Research on risk analysis of supply chain system based on SML control model”,
Proceedings of the 2022 2nd International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI)
, pp.
277
-
281
, doi: .
Liu
,
Y.
,
Li
,
S.
,
Yu
,
C.
and
Lv
,
M.
(
2024
), “
Research on green supply chain finance risk identification based on two-stage deep learning
”,
Operations Research Perspectives
, Vol. 
13
, 100311, doi: .
Meziani
,
A.
,
Bourouis
,
A.
and
Chebout
,
M.S.
(
2022
), “
Neutrosophic data analytic hierarchy process for multi criteria decision making: applied to supply chain risk managment
”,
2022 International Conference on Advanced Aspects of Software Engineering (ICAASE)
, pp.
1
-
6
, doi: .
Mismar
,
H.
,
Shamayleh
,
A.
and
Qazi
,
A.
(
2022
), “
Prioritizing risks in last mile delivery: a Bayesian belief network approach
”,
IEEE Access
, Vol. 
10
, pp. 
118551
-
118562
, doi: .
Modgil
,
S.
,
Singh
,
R.K.
and
Hannibal
,
C.
(
2022
), “
Artificial intelligence for supply chain resilience: learning from Covid-19
”,
The International Journal of Logistics Management
, Vol. 
33
No. 
4
, pp. 
1246
-
1268
, doi: .
Nagy
,
J.
,
Foltin
,
P.
and
Ondryhal
,
V.
(
2022
), “
Use of big data analysis to identify possible sources of supply chain disruption through the DOTMLPFI method
”,
Logforum
, Vol. 
18
No. 
3
, pp. 
309
-
319
, doi: .
Pandey
,
S.
,
Singh
,
R.K.
,
Gunasekaran
,
A.
and
Kaushik
,
A.
(
2020
), “
Cyber security risks in globalized supply chains: conceptual framework
”,
Journal of Global Operations and Strategic Sourcing
, Vol. 
13
No. 
1
, pp. 
103
-
128
, doi: .
Panjehfouladgaran
,
H.
and
Lim
,
S.F.W.
(
2020
), “
Reverse logistics risk management: identification, clustering and risk mitigation strategies
”,
Management Decision
, Vol. 
58
No. 
7
, pp. 
1449
-
1474
, doi: .
Ramiah
,
C.
,
Dookhun
,
V.
,
Ramgolam
,
Y.K.
and
Sultan
,
R.
(
2022
), “
Risk identification of the solar PV value chain in Mauritius
”,
2022 7th International Conference on Environment Friendly Energies and Applications (EFEA)
,
Bagatelle Moka MU, Mauritius
,
14-16 December
,
IEEE
, pp. 
1
-
5
.
Rathor
,
K.
,
BV
,
D.
,
D
,
A.J.
,
Pal
,
S.
,
P
,
B.
and
Nishant
,
N.
(
2024
), “
Temporal threat recognition in supply chains: integrating hidden Markov models for proactive security with AI-driven automated threat hunting
”, in
2024 International Conference on Advances in Modern Age Technologies for Health and Engineering Science (AMATHE)
,
Springer
, pp.
1
-
6
. doi: .
Rebala
,
G.
,
Ravi
,
A.
and
Churiwala
,
S.
(
2019
), “Machine learning definition and basics”, in
Rebala
,
G.
,
Ravi
,
A.
and
Churiwala
,
S.
(Eds),
An Introduction to Machine Learning
,
Springer International Publishing
,
Cham
, pp. 
1
-
17
.
Rezki
,
N.
and
Mansouri
,
M.
(
2024
), “
Machine learning for proactive supply chain risk management: predicting delays and enhancing operational efficiency
”,
Management Systems in Production Engineering
, Vol. 
32
No. 
3
, pp. 
345
-
356
, doi: .
Richey
,
R.G.
,
Chowdhury
,
S.
,
Davis‐Sramek
,
B.
,
Giannakis
,
M.
and
Dwivedi
,
Y.K.
(
2023
), “
Artificial intelligence in logistics and supply chain management: a primer and roadmap for research
”,
Journal of Business Logistics
, Vol. 
44
No. 
4
, pp. 
532
-
549
, doi: .
Rosales
,
F.P.
,
Oprime
,
P.C.
,
Royer
,
A.
and
Batalha
,
M.O.
(
2019
), “
Supply chain risks: findings from Brazilian slaughterhouses
”,
Supply Chain Management: An International Journal
, Vol. 
25
No. 
3
, pp. 
343
-
357
, doi: .
Salamai
,
A.
,
El-Kenawy
,
E.S.M.
and
Abdelhameed
,
I.
(
2021
 
In this issue
), “
Dynamic Voting Classifier for Risk Identification in Supply Chain 4.0
”,
Computers, Materials & Continua
, Vol. 
69
No. 
3
, pp.
3749
-
3766
, doi: .
Salamai
,
A.
,
Hussain
,
O.
and
Saberi
,
M.
(
2019
 
In this issue
), “
Decision support system for risk assessment using fuzzy inference in supply chain big data
”,
International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS)
, pp.
248
-
253
, doi: .
Schroeder
,
M.
and
Lodemann
,
S.
(
2021
), “
A systematic investigation of the integration of machine learning into supply chain risk management
”,
Logistics
, Vol. 
5
No. 
3
, p.
62
, doi: .
Shahsavari
,
M.
,
Hussain
,
O.K.
,
Saberi
,
M.
and
Sharma
,
P.
(
2024
), “
Event identification for supply chain risk management through news analysis by using large language models
”,
The Review of Socionetwork Strategies
, Vol. 
18
No. 
2
, pp. 
255
-
278
, doi: .
Sheikh
,
H.
,
Prins
,
C.
and
Schrijvers
,
E.
(
2023
), “Artificial intelligence: definition and background”, in
Sheikh
,
H.
,
Prins
,
C.
and
Schrijvers
,
E.
(Eds),
Mission AI
,
Springer International Publishing
,
Cham
, pp. 
15
-
41
.
Shishehgarkhaneh
,
M.B.
,
Moehler
,
R.C.
,
Fang
,
Y.
,
Hijazi
,
A.A.
and
Aboutorab
,
H.
(
2024
), “
Transformer-based named entity recognition in construction supply chain risk management in Australia
”,
IEEE Access
, Vol. 
12
, pp. 
41829
-
41851
, doi: .
Shishehgarkhaneh
,
M.B.
,
Moehler
,
R.C.
,
Fang
,
Y.
,
Hijazi
,
A.A.
and
Aboutorab
,
H.
(
2024
), “
Transformer-based named entity recognition in construction supply chain risk management in Australia
”,
IEEE Access
, Vol. 
12
, pp.
41829
-
41851
, doi: .
Tama
,
I.
,
Yuniarti
,
R.
,
Eunike
,
A.
,
Hamdala
,
I.
and
Azlia
,
W.
(
2019
), “
Risk identification in cassava chip supply chain using SCOR (Supply Chain Operation Reference)
”,
IOP Conference Series: Materials Science and Engineering
, Vol. 
494
, 012050, doi: .
Tang
,
C.S.
(
2006
), “
Perspectives in supply chain risk management
”,
International Journal of Production Economics
, Vol. 
103
No. 
2
, pp. 
451
-
488
, doi: .
Tang
,
O.
and
Nurmaya Musa
,
S.
(
2011
), “
Identifying risk issues and research advancements in supply chain risk management
”,
International Journal of Production Economics
, Vol. 
133
No. 
1
, pp. 
25
-
34
, doi: .
Truong Quang
,
H.
and
Hara
,
Y.
(
2018
), “
Risks and performance in supply chain: the push effect
”,
International Journal of Production Research
, Vol. 
56
No. 
4
, pp. 
1369
-
1388
, doi: .
Vaswani
,
A.
,
Shazeer
,
N.
,
Parmar
,
N.
,
Uszkoreit
,
J.
,
Jones
,
L.
,
Gomez
,
A.N.
,
Kaiser
,
L.
and
Polosukhin
,
I.
(
2017
), “
Attention is all you need
”, doi: .
vom Brocke
,
J.
,
Simons
,
A.
,
Niehaves
,
B.
,
Riemer
,
K.
,
Plattfaut
,
R.
and
Cleven
,
A.
(
2009
), “
Reconstructing the giant: on the importance of rigour in documenting the literature search process
”,
ECIS 2009 Proceedings
.
Wang
,
L.
and
Hu
,
X.
(
2023
), “Research on power supply chain risk early warning system based on ACO-SVM algorithm”, in
2023 International Symposium on Intelligent Robotics and Systems (ISoIRS)
, pp.
101
-
105
.
Wang
,
H.
and
Zhou
,
Z.
(
2024
), “
Identification of key risk nodes and invulnerability analysis of construction supply chain networks
”,
Buildings
, Vol. 
14
No. 
7
, p.
1997
, doi: .
Xu
,
F.F.
,
Alon
,
U.
,
Neubig
,
G.
and
Hellendoorn
,
V.J.
(
2022
), “
A systematic evaluation of large language models of code
”,
in Chaudhuri, S. and Sutton, C. (Eds), Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, MAPS '22: 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego CA USA
,
13 06 2022 13 06 2022
,
ACM
,
New York, NY
, pp. 
1
-
10
.
Yadav
,
S.
,
Pilli
,
D.
,
Senthil Kumar
,
M.K.
,
Kaushal
,
D.
,
Kaliappan
,
S.
and
Maranan
,
R.
(
2023
), “Managing and assessing the risk management of supply chain using the A-BiGRU-CNN approach”, in
2023 7th International Conference on Electronics, Communication and Aerospace Technology (ICECA)
, pp.
553
-
558
.
Zhang
,
Q.
and
He
,
Y.
(
2024
), “
The application of cluster analysis algorithm in supply chain risk identification
”,
Scalable Computing: Practice and Experience
, Vol. 
25
No. 
5
, pp. 
3580
-
3586
, doi: .
Zhao
,
G.
,
Olan
,
F.
,
Liu
,
S.
,
Hormazabal
,
J.H.
,
Lopez
,
C.
,
Zubairu
,
N.
,
Zhang
,
J.
and
Chen
,
X.
(
2024a
), “
Links between risk source identification and resilience capability building in agri-food supply chains: a comprehensive analysis
”,
IEEE Transactions on Engineering Management
, Vol. 
71
, pp. 
13362
-
13379
, doi: .
Zhao
,
M.
,
Hussain
,
O.
,
Zhang
,
Y.
,
Saberi
,
M.
and
Leshob
,
A.
(
2024b
), “
Enhancing supply chain risk management with large language models: software prototyping and interactive visualization
”,
2024 IEEE International Conference on e-Business Engineering (ICEBE)
,
Shanghai, China
,
11-13 October
,
IEEE
, pp. 
284
-
291
.
Zhu
,
Q.
,
Liu
,
L.
and
He
,
Y.
(
2019
), “
Application of process analysis based on value objective improvement in risk identification of supply chain
”, in
2019
 
Chinese Automation Congress (CAC)
, doi:
Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at Link to the terms of the CC BY 4.0 licence.

or Create an Account

Close Modal
Close Modal