Abstract
Generative Artificial Intelligence (GenAI) tools are increasingly used in educational contexts for tasks such as content generation, feedback support and tutoring. However, adoption is often faster than the availability of evidence, guidance and shared standards. This article argues that systematic evaluation is essential to ensure pedagogical value, protect learners and teachers, and support compliant, human-centred implementation. Building on international guidance and emerging regulatory expectations, it outlines key dimensions for evaluating GenAI tools in secondary education and highlights how GenAI4ED project contributes to evidence-informed decision-making.
Key takeaways
- GenAI adoption in education is accelerating; governance and evidence are evolving more slowly.
- Evaluation must go beyond technical performance to include pedagogy, fairness, privacy, and wellbeing.
- Because GenAI systems change rapidly, evaluation should be continuous rather than a one-off procurement step.
- GenAI4ED project develops a structured, human-centred assessment approach for secondary education.
Introduction
Generative Artificial Intelligence (GenAI), including large language models (LLMs) capable of producing text, images and other media, has moved rapidly from research to mainstream use. In education, GenAI tools are already being used for lesson planning, drafting learning materials, generating exercises and supporting feedback. International organisations note that the speed of adoption is creating new demands for governance, guidance and capacity-building across education systems (Organisation for Economic Co-operation and Development [OECD], 2023; UNESCO, 2023).
Why “try it and see” is not enough
In many schools and institutions, decisions about GenAI tools may be driven by availability, popularity or perceived inevitability rather than robust evidence of educational benefit. Research highlights both the opportunities of LLMs for education and their limitations and risks, including biased outputs, overreliance, misuse and the need for new competencies to interpret and validate outputs responsibly (Kasneci et al., 2023). Consequently, implementation requires more than experimentation: it requires structured evaluation and human oversight aligned with educational aims and values (UNESCO, 2023).
What evaluation should mean in education
Evaluating GenAI tools for educational use should extend beyond technical fluency or surface-level correctness. A credible evaluation approach should examine whether a tool supports learning objectives, protects fundamental rights, and contributes to equitable and safe learning environments. International guidance emphasises the need for guardrails, including data governance and privacy protections (OECD, 2023). In the European context, emerging regulation introduces a risk-based approach to AI governance; certain education-related uses of AI systems—such as systems intended to determine access/admission or evaluate learning outcomes—are treated as high-risk, implying heightened expectations for risk management, transparency and oversight (European Union, 2024).
Pedagogical value and learning alignment
Evaluation should examine whether a tool supports curriculum-aligned understanding and meaningful learning processes, rather than encouraging superficial completion. Relevant questions include whether the tool’s feedback is actionable and age-appropriate, and whether it supports critical thinking rather than dependency (OECD, 2023).
Reliability, robustness and limitations
GenAI systems can generate plausible but incorrect or inconsistent outputs. Evaluation should assess output stability across prompts, common error patterns for relevant subject domains, and conditions under which the tool is least reliable. Responsible deployment therefore requires strategies for verification and appropriate human oversight (Kasneci et al., 2023).
Fairness, bias and inclusion
Because schools serve diverse learners, evaluation should consider whether outputs reproduce stereotypes or systematically disadvantage certain groups. A human-centred approach emphasises fairness, accessibility and inclusion as core considerations for responsible implementation (UNESCO, 2023).
Data protection, privacy and compliance readiness
Educational settings involve sensitive personal data. Evaluations should examine what data the tool processes, how data are stored and used, and what safeguards exist for privacy and data protection. International policy work highlights privacy and data governance as central issues raised by GenAI in education (OECD, 2023).
Human oversight and learner wellbeing
Evaluation should address whether the tool preserves teacher autonomy and professional judgement, supports safe classroom dynamics, and avoids harmful dependency or negative wellbeing impacts. UNESCO’s guidance emphasises human-centred implementation and capacity building for educators and institutions (UNESCO, 2023).
Why evaluation must be continuous
GenAI tools evolve rapidly: models, interfaces and usage patterns change over time. As a result, evaluation should not be treated as a one-off procurement step. Instead, it should be embedded into a continuous quality assurance process combining monitoring, classroom feedback and periodic reassessment. This is consistent with broader approaches to trustworthy AI that highlight robustness, transparency and accountability across the AI lifecycle (OECD, 2024).
GenAI4ED’s contribution
GenAI4ED supports evidence-informed decision-making by developing an AI-enhanced platform and structured assessment approach for GenAI tools in secondary education. Through research, co-design and pilot activities, the project aims to translate high-level principles—such as transparency, fairness and human oversight—into practical evaluation pathways that can inform classroom practice and organisational decision-making. By grounding evaluation in stakeholder perspectives and real-world use, GenAI4ED contributes to responsible integration where technology complements human educational expertise (OECD, 2023; UNESCO, 2023).
Conclusion
As GenAI becomes embedded in educational practice, evaluation is essential for distinguishing tools that genuinely support learning from those that introduce hidden pedagogical, ethical, legal or psychological risks. International guidance and emerging regulation converge on the need for evidence, governance and human-centred design. Systematic and continuous evaluation is therefore a prerequisite for responsible adoption in secondary education.
Author: Theodora Giatagana (Found.ation)
References
European Union. (2024). Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L (12 July 2024).
Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., … Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. doi:10.1016/j.lindif.2023.102274
Organisation for Economic Co-operation and Development. (2023). OECD Digital Education Outlook 2023: Towards an effective digital education ecosystem. OECD Publishing. doi:10.1787/c74f03de-en
Organisation for Economic Co-operation and Development. (2024). OECD AI Principles (updated 2024). OECD.AI.
UNESCO. (2023). Guidance for generative AI in education and research. UNESCO.

