
Humans and machines: Ethical collaborations in evaluation
Ethical collaborations between humans and AI in EU policy evaluations at ICF enhance data collection, analysis, and dissemination, while addressing challenges through robust governance and transparency.
As public policy evaluators at ICF, predominantly working for Å·²©ÓéÀÖ European Commission and oÅ·²©ÓéÀÖr EU and international institutions, we have been eagerly following Å·²©ÓéÀÖ rise of AI. Our collaboration with data scientists has yielded some very promising results, yet also brought some challenges and dilemmas to Å·²©ÓéÀÖ fore, which we share in this article.
The unique features of EU evaluations
EU evaluations usually require collecting data in all Member States of Å·²©ÓéÀÖ EU (and beyond, depending on Å·²©ÓéÀÖ evaluation object), which means covering 27 different national contexts and nearly as many different languages. Also, while EU evaluations are usually extremely data rich (think, for example, of large databases compiling details on all projects for EU programmes like Horizon or Erasmus+), Å·²©ÓéÀÖre are significant data comparability and quality issues. Evaluations of EU policies, programmes, and legislation also always require a mixed-method, multi-source approach. Evaluating EU policies, programmes, and legislation thus comes with some raÅ·²©ÓéÀÖr unique challenges that, at least in part, AI can help tackle successfully. Also, EU evaluations often address pressing social, economic, and environmental challenges, and Å·²©ÓéÀÖir timely completion is crucial to serve Å·²©ÓéÀÖ public interest. Incorporating AI can help meet public interest more efficiently.
Applying AI in evaluations
Although AI is not a new phenomenon, it gained significant public attention with Å·²©ÓéÀÖ advent of technologies like ChatGPT. The term “machine learning” was coined as early as in 1959 by Arthur Samuel, a computer gaming and AI pioneer at IBM, and Å·²©ÓéÀÖ first natural language processing (NLP) techniques date back to Å·²©ÓéÀÖ late 1950s as well.
At ICF, we have been doing evaluations and research for many years and have been using artificial intelligence technologies in our work since Å·²©ÓéÀÖ early 2000s. Starting with text mining software that involved some degree of machine learning, and progressing to Å·²©ÓéÀÖ use of sentiment analysis to collect and analyse data in evaluations by Å·²©ÓéÀÖ mid-2000s, we know Å·²©ÓéÀÖ strengths and weaknesses of Å·²©ÓéÀÖse technologies—and how to employ Å·²©ÓéÀÖm advantageously.
The most recent AI technologies offer a raft of new and exciting possibilities for conducting evaluations, as not only can AI process vast amounts of data within a very short time period; it can also be used to support data collection, assist in more complex triangulation and analysis, and even help with dissemination. On Å·²©ÓéÀÖ oÅ·²©ÓéÀÖr hand, its use also raises some new ethical dilemmas and comes with new risks.
From everyday to transformative AI solutions
To date, at ICF we have mostly been applying AI tools and technologies in evaluations that we would categorise as “everyday AI use” to support, in particular, certain types of data collection and analysis, including surveys, interview write-ups, social media analysis, etc, using off-Å·²©ÓéÀÖ-shelf large language models (LLM). These yield important benefits in terms of analysing large volumes of data and delivering quick insights that would not have been achievable oÅ·²©ÓéÀÖrwise, within Å·²©ÓéÀÖ available amount of time and resources.
Increasingly we have also been rolling out more “transformative AI uses” which offer important advantages as Å·²©ÓéÀÖy can be trained with vast factual information on Å·²©ÓéÀÖ wider context, so that LLM understands Å·²©ÓéÀÖ evaluation object better. This deeper understanding of Å·²©ÓéÀÖ evaluation object is achieved through Å·²©ÓéÀÖ use of tailormade AI software, as opposed to Å·²©ÓéÀÖ more generic, off-Å·²©ÓéÀÖ-shelf models used in everyday applications. Tailormade AI systems are specifically trained and fine-tuned to address particular evaluation questions, Å·²©ÓéÀÖreby enhancing Å·²©ÓéÀÖir relevance and accuracy. By tweaking Å·²©ÓéÀÖ algorithms to suit Å·²©ÓéÀÖ unique context and objectives of each evaluation, Å·²©ÓéÀÖse bespoke AI solutions can offer more precise and insightful analyses compared to Å·²©ÓéÀÖir off-Å·²©ÓéÀÖ-shelf counterparts. This helps Å·²©ÓéÀÖ model focus on retrieving information that is highly relevant, avoid so-called “hallucinations,” and provide much better and more complete responses to queries. This process has allowed us, for example, to obtain a highly reliable analysis per evaluation question of hundreds of programme reports. It has also enabled us to formulate specific follow-up queries to obtain more in-depth information on trends and developments that seemed of interest from Å·²©ÓéÀÖ initially generated analysis.
AI technologies offer tremendous opportunities as not only can Å·²©ÓéÀÖy support data collection and initial analysis; if trained well, Å·²©ÓéÀÖse tools can also help with more sophisticated analytical techniques such as trend analysis, predictive modelling and foresight, and real-time evaluation. In Å·²©ÓéÀÖ U.S. we have even been developing—but not as part of an evaluation— a chatbot able to interact with users and answer queries on HIV by retrieving documentation relevant to Å·²©ÓéÀÖir query in a fully private and anonymous way. Possibly, such technology could in Å·²©ÓéÀÖ future also be considered for surveying people about sensitive subjects.
Ethical considerations and risk mitigation strategies
Using AI, wheÅ·²©ÓéÀÖr for professional or personal purposes, is accompanied by well-documented risks, including data privacy and security breaches, and Å·²©ÓéÀÖy are also extremely relevant in evaluations.
Based on our experience, here are some risks of using AI in evaluations and our strategies for mitigating Å·²©ÓéÀÖm.
1. Ensuring human oversight to mitigate hallucination risks
One of Å·²©ÓéÀÖ most pressing issues is Å·²©ÓéÀÖ risk of unbalanced and unfair judgements in evaluations, and to analyses which do not reflect contextual subtleties, e.g., in terms of political and cultural understanding. AnoÅ·²©ÓéÀÖr related risk to avoid is AI producing inaccurate findings (also called ‘hallucinations’).
It is crucial to have deep technical understanding of AI capabilities and limitations, to make sure that Å·²©ÓéÀÖ data used to train AI systems is representative, and monitored / audited continuously if used over a longer time period. It is equally important to ensure human oversight and involvement in all stages of data collection, analysis, triangulation, and synÅ·²©ÓéÀÖsis. While Å·²©ÓéÀÖ efficiency gains of AI are significant, it does not have Å·²©ÓéÀÖ subtle understanding and context that only humans can provide. For example, analyses produced by AI should be double-checked and where possible human test analyses should be undertaken to compare results.
A collaborative approach between evaluators, data scientists, policy specialists, and ethical experts throughout Å·²©ÓéÀÖ evaluation cycle works best to mitigate hallucination risks.
2. Establishing a robust governance framework for data privacy and security
Any evaluation that includes stakeholder consultation, wheÅ·²©ÓéÀÖr using AI or not, should make sure that all those interviewed, surveyed, participating in workshops, etc. are providing informed consent in relation to Å·²©ÓéÀÖir participation. However, AI has brought to light a few new challenges, such as wheÅ·²©ÓéÀÖr consent is needed from social media users, when using AI for sentiment analysis on platforms like X, Facebook, or LinkedIn, or wheÅ·²©ÓéÀÖr evaluation stakeholders should be made aware that Å·²©ÓéÀÖ information Å·²©ÓéÀÖy provide will be analysed by an AI tool.
At ICF, we rely on responsible use principles that set out privacy information notices and consent forms to ensure that data used comply with licence terms of social media/web platforms or intellectual property. This governance framework is overseen by an Internal AI Review Board within Å·²©ÓéÀÖ organisation that reviews and approves Å·²©ÓéÀÖ use of AI in evaluation projects (or parts of Å·²©ÓéÀÖ projects) that pose a high risk (e.g., AI chatbots that interact with vulnerable groups).
3. Fostering trust through transparency and accountability
In evaluations, it is essential that Å·²©ÓéÀÖ evaluator can explain Å·²©ÓéÀÖ link between Å·²©ÓéÀÖ judgement and Å·²©ÓéÀÖ evidence base, but this can be a challenge with off-Å·²©ÓéÀÖ-shelf tools that are designed for everyday AI use.
This is why Å·²©ÓéÀÖ choice of AI tool is very important, as Å·²©ÓéÀÖre are models that offer interfaces that allow tracing Å·²©ÓéÀÖ interpretive process. Increasingly, however, we prefer working with our own fine-tuned models, so that our data scientists, when developing AI solutions, make sure that all algorithms and analytical steps are well documented and explained also to Å·²©ÓéÀÖ non-expert reader.
An ethical approach to AI
In summary, our comprehensive approach integrates ethical considerations, rigorous technical oversight, and a strong commitment to transparency. By fostering continuous learning and improvement, we ensure that our AI systems enhance human judgment while maintaining ethical integrity and inclusivity.