Beyond Price and Praise: A school leader’s checklist for evaluating AI tutoring claims
edtechprocurementtutoring

Beyond Price and Praise: A school leader’s checklist for evaluating AI tutoring claims

DDaniel Mercer
2026-05-04
21 min read

A school-leader checklist for evaluating AI tutoring claims on alignment, evidence, safeguarding, scalability, and value for money.

Beyond Price and Praise: why AI tutoring procurement needs a school-leader lens

School leaders are being asked to make a harder decision than ever: not just whether an AI tutor is affordable, but whether it will actually improve learning in real classrooms. That is why a procurement checklist matters. Marketing claims can sound persuasive because they often combine the language of personalisation, time savings, and measurable impact, yet those claims rarely answer the questions governors, heads, and MAT leaders must ask about curriculum fit, safeguarding, workload, and value for money. If you are comparing AI tutoring options, the right question is not “Does it sound innovative?” but “What evidence shows this will help our pupils, in our context, safely and at scale?”

This guide is written for leaders evaluating AI tutoring in the same disciplined way they would assess any major intervention. It draws on the current UK tutoring market, where online delivery is now the norm and schools continue to scrutinise impact after the National Tutoring Programme era. In that market, products such as Skye have raised the bar for fixed-price, unlimited tutoring models, but strong branding should never replace due diligence. For a broader view of how schools are comparing platforms, it is worth reading our guide to the best online tutoring websites for UK schools and our analysis of how investors evaluate AI EdTech startups for real learning outcomes.

1. Start with the curriculum, not the sales deck

1.1 Check whether the tutor teaches your curriculum, not just “maths” or “English”

The first test of any AI tutoring product is simple: does it align with what your teachers actually teach? A platform can be impressive in a demo yet still be a poor fit if it follows an incompatible sequence of concepts, uses different terminology, or skips the exact curriculum milestones your pupils need for exams. In practice, school leaders should ask for mapped examples showing how the tutor supports specific year groups, topics, and assessment objectives. If a vendor cannot explain how its sessions match your schemes of work, then the product may be optimised for generic engagement rather than classroom usefulness.

This is where you should push beyond broad claims about “personalised learning.” Personalisation is only valuable when it is constrained by a curriculum logic that teachers recognise and trust. Ask how the system diagnoses misconceptions, how it determines next steps, and whether it can explain why it selected a particular task or prompt. If you are considering a maths tool like Skye, insist on seeing the relationship between the tutoring sequence and your school’s own teaching progression.

1.2 Look for evidence of subject-specific expertise

AI tutors are not all built the same. Some are broad chat-style tools that can discuss many topics, while others are designed around a single subject or phase, which can be an advantage if the underlying pedagogy is genuinely strong. A focused product often has a clearer instructional model and better quality control because it is solving a narrower problem. But a narrower scope only helps if the company can show subject-matter expertise, editorial review, and careful instruction design.

As a leader, you should request sample lesson flows, worked examples, error analysis, and teacher guidance notes. You are looking for signs that the product team understands how pupils typically go wrong, not just what the right answer is. It can help to compare the vendor’s academic claims with the kind of evidence you would expect from other education choices, such as the thinking behind our guide to keeping students engaged in test prep, which shows how engagement only matters when it supports deliberate practice and retention.

1.3 Insist on compatibility with school systems and workflows

Even a well-aligned tutor can fail if it does not fit the way staff already work. Ask whether the product integrates with MIS, LMS, Google Classroom, Microsoft Teams, or your assessment platform. Leaders should also consider whether teachers can assign work, review usage, and export data without extra admin burden. If using the platform requires a separate login maze or duplicating student records manually, implementation costs may silently erode value for money.

For leaders responsible for digital estates, the lesson is similar to other technology purchases: interoperability is not a luxury. Our guide on mixing quality accessories with mobile devices makes the same point in a different context: the best tool is the one that works with the rest of the ecosystem. In schools, that ecosystem includes safeguarding processes, assessment routines, and teacher planning time.

2. Demand evidence that is usable, not just impressive

2.1 Separate pilot enthusiasm from sustained impact

Many products look powerful in pilots because novelty drives usage. The real question is whether the effect persists after the first few weeks, when staff enthusiasm fades and pupils encounter harder content. Ask vendors for evidence from multiple cohorts, not one showcase school, and request enough detail to judge whether the results are replicable. You should want to know who was included, how outcomes were measured, how long the intervention ran, and what happened after the pilot period ended.

This is where many school leaders need a stronger procurement discipline. If a provider cannot explain the control group, baseline attainment, or the intervention dosage, then the headline impact figure is not enough. Strong evidence should resemble the standards used in other risk-managed domains. For a useful analogy, see how our article on clinical validation for AI-enabled medical devices treats product claims as something to test, not assume. Schools do not need medical-grade regulation, but they do need the same seriousness about proof.

2.2 Ask for tutor evidence, not just product metrics

When schools buy AI tutoring, they are often buying the promise of tutoring quality at scale. That makes tutor evidence essential. If the platform uses human tutors, ask how tutors are selected, trained, supervised, and rechecked. If the platform is fully AI-driven, ask what evidence exists that the pedagogical interactions themselves produce learning, not merely completion rates or happy-user scores. Leaders should want to see examples of how the system responds to common misconceptions, how it explains answers, and how it avoids simply giving away solutions.

A useful procurement habit is to ask for three kinds of evidence: attainment evidence, behaviour evidence, and teacher evidence. Attainment evidence shows whether scores improved. Behaviour evidence shows whether pupils actually used the tool consistently. Teacher evidence shows whether staff found the tool useful enough to keep using it without pressure. For a broader strategy on turning research into practical authority, our piece on turning analyst insights into content series offers a good model for how to convert raw information into decision-ready insight.

2.3 Ask what the vendor has not yet proven

Good school leaders ask about the limits of evidence, not just the highlights. Has the product been tested across different pupil groups, including disadvantaged pupils, SEND learners, or those with low prior attainment? Does the provider distinguish between short-term motivation and long-term mastery? Are impact claims based on independent research, internal studies, or both? A trustworthy vendor will explain what is still being tested and where the evidence base is thinner.

This is especially important with AI tutoring because the market is evolving quickly. A strong product today may still have an immature evidence base for particular age ranges or use cases. Leaders should treat this the same way they would treat other fast-moving categories, where early promise can outpace verification. The lesson from our article on using AI for PESTLE with a verification checklist applies neatly here: use AI as a tool, but verify its outputs before operationalising them.

3. Safeguarding and governance are not add-ons

3.1 Test data protection like you would test any sensitive system

AI tutoring platforms process student data, interaction data, and often voice or writing samples. That means leaders should ask exactly where data is stored, who can access it, whether it is used for model training, and how long it is retained. The school’s DPO or data protection lead should be involved early, not after the demo. If the provider offers vague statements about being “GDPR compliant,” that is not enough; request a data processing agreement, subprocessors list, breach procedures, and retention policy.

School procurement teams can borrow a lot from broader AI vendor governance. Our guide to AI vendor contracts and must-have clauses is useful because it shows how contractual language turns abstract risk into enforceable responsibility. The same logic applies to schools: define who owns the data, who deletes it, what happens at contract end, and how incidents are reported.

3.2 Ask how the system protects pupils in live interactions

Safeguarding is not just about data security. It is also about the risks created by dynamic conversation, hallucinated explanations, inappropriate prompts, and unsupervised messaging. Leaders should ask whether the tutor can engage in open chat or whether it is tightly constrained to curriculum content. The safest systems are usually not the most open-ended ones, especially for younger pupils, because limiting the conversation reduces the chance of unsafe or misleading interactions. If the tool includes human support, you also need to know how moderation and escalation work.

Schools should expect clear answers on content filtering, prompt restrictions, abuse detection, escalation routes, and audit logging. These controls should be documented, not assumed. If the vendor claims to have strong safeguards, ask for examples of how the system behaves when a pupil enters inappropriate content, asks for help outside the curriculum, or attempts to circumvent rules. In other words, test the system as a child might, not as a salesperson would.

3.3 Align procurement with your safeguarding culture

The best AI tutoring tools fit into existing safeguarding routines rather than replacing them. That means your DSL, SENCO, curriculum leads, and digital lead should all have a voice in the decision. You should also consider staff training: if teachers do not understand what the platform can and cannot do, they will either over-trust it or ignore it. Both outcomes reduce value and increase risk.

For a broader perspective on school technology safety, it helps to read about how systems are controlled at scale in our article on blocking harmful sites at scale. The principle is the same: safety is built through layered controls, clear permissions, and ongoing monitoring, not by a single policy statement.

4. Scalability: can it work for one class and a whole trust?

4.1 Estimate operational capacity before you commit

A tutoring tool that works with one enthusiastic department may fall apart when rolled out to twenty schools. Leaders should ask how many pupils can be supported concurrently, how onboarding is managed, and how technical support scales during peak periods. If the vendor is promising “unlimited tutoring,” define what unlimited means in practice: unlimited usage per pupil, unlimited concurrent sessions, or simply a fixed fee with fair-use limits? These details determine whether the offer is genuinely scalable or just cleverly marketed.

Scale also means staff capacity. If teachers need to spend hours reviewing outputs or manually correcting recommendations, then the platform may transfer work rather than reduce it. By contrast, products with strong dashboards, assignment automation, and reporting can support wider adoption without overwhelming staff. That is one reason schools are watching fixed-price models like Skye closely: predictable pricing matters when MAT leaders must forecast budgets across many schools.

4.2 Check whether the vendor can support different phases and contexts

A trust-wide purchase should be evaluated across multiple contexts: primary, secondary, SEND settings, intervention groups, and exam classes. A vendor’s strongest case study may come from a highly coached school with a mature digital culture, but that does not prove the product will work equally well in a resource-constrained setting. Ask for implementation stories from a range of schools, including those with different attainment profiles and staffing pressures. Leaders should also ask what happens if usage falls below expectation: will the vendor help diagnose issues, or will they blame implementation?

This is where a well-run trial matters. It should include one or two high-readiness schools and one more typical school, so leaders can compare adoption conditions. The best evaluation process resembles the kind of staged experimentation recommended in our guide to a small-experiment framework: test quickly, gather meaningful signals, and scale only after confirming the intervention works under real constraints.

4.3 Consider the hidden labour of rollout

New technology rarely fails because of one dramatic problem. More often, it fails because of hidden work: account setup, timetable changes, staff briefing, monitoring, parental communication, and post-launch troubleshooting. When leaders assess value for money, they should include this labour in the full cost picture. A platform that appears cheap on paper may cost more overall if it requires frequent manual intervention or specialist staff oversight.

It can help to compare tutoring procurement with other decisions where the sticker price does not tell the full story. Our article on expert brokers and deal hunting explains why experienced buyers look beyond headline prices to terms, flexibility, and risk transfer. School leaders should do the same with AI tutoring contracts, especially if the vendor offers volume pricing, service tiers, or implementation support that changes the real economics of adoption.

5. A practical school-leader procurement checklist

5.1 The core questions every head or MAT lead should ask

Below is a practical checklist you can use in vendor meetings, steering groups, or trust procurement panels. It is designed to surface classroom value rather than surface-level polish. Leaders should insist that vendors answer each item with evidence, examples, and contract terms where relevant. If a response is vague, that usually means the issue has not been fully solved.

Procurement questionWhat a strong answer looks likeRed flags
Does it align with our curriculum?Clear mapping to year groups, topics, and assessment objectivesGeneric subject claims with no curriculum references
What evidence supports impact?Independent or well-designed studies with sample sizes and methodologyPercent claims with no context or baseline
How is safeguarding managed?Documented controls, moderation, escalation, and logging“AI is safe by design” without specifics
How does it scale?Defined onboarding, support model, and concurrent user capacityUnclear fair-use limits or hidden extra fees
What is the true cost?Transparent pricing, implementation costs, and contract exit termsLow entry price but high labour or add-on costs

When you are comparing providers, it can also help to benchmark the market using a broader set of products. Our guide to the best online tutoring websites for UK schools gives a helpful snapshot of how different suppliers position themselves on subjects, delivery, and safeguarding. That comparison is especially useful when you need to explain a shortlist to governors or trustees.

5.2 Ask for a pilot design before you approve a full rollout

A pilot should not be a vague trial where everyone “has a go” and then reports back with impressions. A proper pilot has a hypothesis, a small set of success measures, a defined cohort, and a date for review. For example, you might test whether a Year 8 intervention group improves weekly quiz scores, whether teachers save time on planning, and whether attendance to sessions remains above a threshold. That kind of structure makes it much easier to judge whether the product deserves wider rollout.

Strong pilots also include failure criteria. If usage falls below a certain level, if teachers report excessive admin, or if pupils do not move on key misconceptions, the project should be paused rather than expanded. This disciplined approach reflects the thinking in our article on comparing appraisal systems, where the point is to evaluate tools by their real operational outputs, not their presentation.

5.3 Make value for money a total-cost conversation

Value for money in AI tutoring is not just annual licence cost. It includes staff time, training, implementation, reporting, curriculum mapping, support, and the cost of poor fit if the tool is underused. School leaders should ask vendors to price the full service, including what happens if usage expands, if onboarding requires extra consultancy, or if the platform needs integration work. A low quote that later demands add-ons is not good procurement, even if it appears budget-friendly at first glance.

To sharpen the conversation, compare AI tutoring with other technology purchases where the lowest price is not always the best deal. The logic in our piece on whether a discounted device is truly better value applies in schools too: the cheapest option can become expensive if it lacks durability, support, or compatibility. Leaders should therefore compare unit cost, expected learning gain, and implementation effort together.

6. What good implementation looks like in the first 90 days

6.1 Week 1 to 3: set the conditions for success

The first month determines whether an AI tutor becomes part of practice or a forgotten subscription. Leaders should start with clear ownership: who approves usage, who monitors data, who supports teachers, and who reviews impact? Teachers need a concise guide explaining when to assign the tutor, which pupils are most likely to benefit, and how to interpret the reports. Without that structure, even the strongest product can be used inconsistently.

Communication matters too. Pupils should understand that the tool is there to support learning, not replace effort or professional teaching. Parents may also need a plain-language explanation of the purpose, safeguards, and expected benefits. The clearer the expectations, the more likely the intervention is to be taken seriously rather than treated as a novelty.

6.2 Week 4 to 8: monitor usage and instructional quality

After launch, review not only login numbers but also the quality of interaction. Are pupils completing sessions? Are they getting stuck at the same point? Are teachers seeing misconceptions addressed, or is the system simply generating extra practice? This is the period when leaders can tell whether the tutor is genuinely adapting or merely delivering a fixed sequence with a modern interface.

For teams used to data dashboards, it may help to think in terms of friction points rather than vanity metrics. A high completion rate is useful only if the completed work leads to measurable mastery. In the same way that our guide to building a citation-ready content library values reliable sourcing over volume, school leaders should prioritise reliable instructional output over large numbers of interactions.

6.3 Week 9 to 12: decide whether to scale, adapt, or stop

By the end of a 90-day period, the school should be able to answer three questions: Did pupils benefit? Did staff find it workable? Did the cost match the value delivered? If the answer is yes, scale carefully with the lessons learned from the pilot. If the answer is mixed, adapt the rollout and gather more data. If the answer is no, stop early and redeploy the budget elsewhere.

This discipline helps schools avoid the trap of sunk-cost thinking. Once a vendor is procured, there is often pressure to justify the purchase even if the fit is poor. But strong leaders know that stopping a weak intervention is part of responsible stewardship. That mindset is similar to the practical judgement in our guide to vendor checklists for AI tools, where the goal is to protect data and avoid long-term lock-in.

7. Building a procurement scorecard that governors will trust

7.1 Turn subjective impressions into scored criteria

To make decisions defensible, leaders should convert the evaluation into a scorecard with weighted criteria. A sensible model might include curriculum alignment, evidence quality, safeguarding, scalability, implementation effort, reporting, and price. Each criterion should be scored against agreed descriptors so that “good” means the same thing across the panel. This reduces the risk that the most persuasive salesperson wins instead of the most effective product.

Scorecards are especially useful in MAT settings because they make comparisons transparent across schools. They also help governors understand why a more expensive option may still be the better value if it has stronger evidence or lower labour costs. The important thing is not to chase a perfect score, but to make sure the reasons for selection are explicit, documented, and reviewable later.

7.2 Keep the focus on measurable educational outcomes

When school leaders assess AI tutoring claims, they should resist drifting into vague language about innovation. The measures that matter are the ones connected to learning: improved retention, fewer misconceptions, better confidence, higher independence, and more efficient intervention delivery. If a platform cannot plausibly affect one of these outcomes, it is probably not worth buying. That does not mean every intervention needs a randomised trial, but it does mean leaders should require a clear theory of change.

If you want a useful analogue for outcome-led thinking, our article on AI agents and supply chain chaos shows how technologies are judged by whether they improve real operational flow, not just whether they automate activity. School procurement should follow the same principle: automation is only valuable if it improves teaching and learning.

7.3 Use peer evidence, but verify fit

One school’s success can be informative without being decisive. Ask for references from schools with similar demographics, attainment profiles, and staffing capacity. Then verify whether the implementation conditions match yours. A product that works beautifully in a well-resourced academy with a large intervention team may struggle in a small primary school or a trust where digital maturity is uneven. Peer evidence is most helpful when it is context-rich.

This is why school leaders should triangulate vendor claims with independent evidence, peer references, and their own pilot results. That combination is stronger than any one source alone. It also makes the final procurement decision easier to defend when stakeholders ask hard questions about price, outcomes, and safeguarding.

8. The bottom line for heads and MAT leads

8.1 Don’t buy AI tutoring because it sounds modern

The most persuasive AI tutoring products are not necessarily the most powerful; they are the ones that prove they can help the right pupils, in the right context, safely and repeatedly. School leaders should focus on curriculum alignment, evidence, safeguarding, scalability, and total cost rather than being swayed by polished demos or broad claims of transformation. If the vendor cannot answer detailed questions clearly, that is itself a useful answer.

Used well, AI tutoring can extend capacity, support intervention, and improve access to high-quality practice. Used poorly, it becomes another subscription that adds complexity without improving outcomes. The difference lies in procurement discipline. A rigorous checklist does not slow innovation; it makes innovation usable.

8.2 A smart buying decision is one you can explain a year later

Good procurement survives contact with reality. Twelve months after purchase, you should still be able to explain why the product was selected, what evidence supported the decision, how safeguarding was handled, and what impact it had on pupils. If you cannot tell that story clearly, the purchase probably relied too heavily on price or praise. The best leaders create decisions that are both ambitious and defensible.

As AI tutoring continues to evolve, the schools that benefit most will be those that ask the sharpest questions before they buy. Keep the checklist in front of you, involve the right safeguarding and curriculum leads, and treat every claim as something to verify. That is how you turn an AI promise into classroom value.

Pro tip: If a vendor cannot show curriculum mapping, safeguarding documentation, and a credible impact method in the same meeting, pause the procurement. The strongest products can withstand scrutiny.

FAQ: Evaluating AI tutoring claims in schools

What should school leaders ask first when reviewing an AI tutor?

Start with curriculum alignment. Ask exactly which year groups, topics, and assessment objectives the tutor supports, and request examples tied to your schemes of work. If the product cannot show this clearly, it is not ready for school procurement.

How do we judge whether the evidence is trustworthy?

Look for methodology, sample size, duration, and context. Prefer evidence that explains how outcomes were measured and what the baseline was. Be cautious with headline percentages that do not show whether the gains were sustained or replicable.

What safeguarding issues matter most?

Data handling, pupil-facing conversation controls, moderation, escalation, and logging are all essential. Leaders should confirm where data is stored, whether it is used for training, and how inappropriate interactions are blocked or reported.

How can we tell if an AI tutor is good value for money?

Compare total cost, not just licence price. Include implementation, staff time, reporting, support, and any integration work. A slightly higher-priced tool can be better value if it reduces workload and shows stronger impact evidence.

Should we pilot before purchasing trust-wide?

Yes. A short, structured pilot with defined success criteria is the safest way to test classroom fit, safeguarding, and adoption. Use a small group of schools or classes, collect both quantitative and qualitative data, and decide whether to scale only after reviewing the results.

Where does Skye fit in this market?

Skye is notable for its fixed-price, unlimited AI maths tutoring model, which makes it attractive for schools and MATs seeking scalable intervention. As with any product, leaders should still test curriculum fit, evidence, implementation effort, and safeguarding before committing.

Advertisement
IN BETWEEN SECTIONS
Sponsored Content

Related Topics

#edtech#procurement#tutoring
D

Daniel Mercer

Senior EdTech Editor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
BOTTOM
Sponsored Content
2026-05-04T00:32:05.837Z