Why use GNNs for student outcome prediction?

Student outcomes depend on context: which courses they take, who teaches them, which study groups they join, and how their peer cohort performs. A student in a high-performing study group with an experienced instructor has different outcome probabilities than one with identical grades but isolated circumstances. GNNs capture this relational context.

What graph structure represents student outcomes?

Students, courses, instructors, and programs form a heterogeneous graph. Edges connect students to courses they enrolled in, courses to instructors, students to study groups or peer cohorts, and courses to prerequisite chains. The graph captures both academic pathways and social context.

How do you predict at-risk students early enough to intervene?

Use enrollment data and early-semester signals (first assignment grades, attendance, LMS login patterns) as temporal node features. The GNN identifies at-risk students by comparing their early trajectory to the graph neighborhood of similar students who previously succeeded or dropped out.

Can GNNs handle the small dataset sizes typical in education?

Education datasets are often small (thousands of students per institution). The graph structure helps by enabling knowledge transfer: patterns from similar students, courses, and instructors provide additional signal. Pre-training on larger multi-institution datasets and fine-tuning on local data also helps.

How does KumoRFM handle student outcome prediction?

KumoRFM takes your student information system data (students, enrollments, courses, grades, instructors) and predicts outcomes with one PQL query. It constructs the enrollment graph and captures academic pathway patterns automatically.

Student Outcome Prediction with PyG: GNN on Enrollment Graphs | PyG Guide

The business problem

US colleges and universities lose $16.5 billion annually in revenue from student dropout. The average 6-year graduation rate for 4-year institutions is only 62%. For every student who drops out, the institution loses tuition revenue and the student accumulates debt without a credential. Early identification of at-risk students (before midterms, ideally within the first 3 weeks) gives advisors time to intervene effectively.

Why flat ML fails

No course context: A C+ in Organic Chemistry is very different from a C+ in Introduction to Art. The course graph captures difficulty, prerequisites, and instructor effectiveness.
No peer effects: Students in strong study groups and cohorts outperform isolated students with identical grades. The social/academic graph captures these peer effects.
No pathway awareness: A student who skipped a prerequisite and is now struggling in the advanced course needs different support than one who completed the full prerequisite chain. The course graph encodes these pathways.
No instructor signal: Some instructors have consistently better student outcomes. The student-course-instructor graph captures this quality signal and its interaction with student backgrounds.

The relational schema

schema.txt

Node types:
  Student    (id, gpa, credits_earned, financial_aid, first_gen)
  Course     (id, department, level, avg_grade, dfw_rate)
  Instructor (id, department, tenure, avg_eval, years_exp)
  Program    (id, department, degree_type, avg_completion_rate)

Edge types:
  Student    --[enrolled_in]-->  Course     (grade, semester)
  Course     --[taught_by]-->    Instructor (semester, section)
  Course     --[prerequisite]--> Course
  Student    --[in_program]-->   Program
  Student    --[study_with]-->   Student    (shared_courses)

Students, courses, instructors, and programs form the enrollment graph. Study-with edges capture peer cohort effects.

PyG architecture: HeteroConv for enrollment data

student_model.py

import torch
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv, HeteroConv, Linear

class StudentOutcomeGNN(torch.nn.Module):
    def __init__(self, hidden_dim=64):
        super().__init__()
        self.student_lin = Linear(-1, hidden_dim)
        self.course_lin = Linear(-1, hidden_dim)
        self.instructor_lin = Linear(-1, hidden_dim)
        self.program_lin = Linear(-1, hidden_dim)

        self.conv1 = HeteroConv({
            ('student', 'enrolled_in', 'course'): SAGEConv(
                hidden_dim, hidden_dim),
            ('course', 'taught_by', 'instructor'): SAGEConv(
                hidden_dim, hidden_dim),
            ('course', 'prerequisite', 'course'): SAGEConv(
                hidden_dim, hidden_dim),
            ('student', 'in_program', 'program'): SAGEConv(
                hidden_dim, hidden_dim),
            ('student', 'study_with', 'student'): SAGEConv(
                hidden_dim, hidden_dim),
        }, aggr='mean')

        self.conv2 = HeteroConv({
            ('student', 'enrolled_in', 'course'): SAGEConv(
                hidden_dim, hidden_dim),
            ('student', 'study_with', 'student'): SAGEConv(
                hidden_dim, hidden_dim),
            ('student', 'in_program', 'program'): SAGEConv(
                hidden_dim, hidden_dim),
        }, aggr='mean')

        self.classifier = Linear(hidden_dim, 1)

    def forward(self, x_dict, edge_index_dict):
        x_dict['student'] = self.student_lin(x_dict['student'])
        x_dict['course'] = self.course_lin(x_dict['course'])
        x_dict['instructor'] = self.instructor_lin(
            x_dict['instructor'])
        x_dict['program'] = self.program_lin(x_dict['program'])

        x_dict = {k: F.relu(v) for k, v in
                  self.conv1(x_dict, edge_index_dict).items()}
        x_dict = self.conv2(x_dict, edge_index_dict)

        return torch.sigmoid(
            self.classifier(x_dict['student']).squeeze(-1))

HeteroConv aggregates course difficulty, instructor quality, peer performance, and program context. Two hops capture prerequisite chain effects.

Expected performance

GPA-based heuristic: ~58 AUROC
LightGBM (flat features): 62.44 AUROC
GNN (enrollment graph): 75.83 AUROC
KumoRFM (zero-shot): 76.71 AUROC

Or use KumoRFM in one line

KumoRFM PQL

PREDICT will_persist FOR student
USING student, enrollment, course, instructor, program

One PQL query. KumoRFM constructs the enrollment graph from your SIS data and predicts student persistence.

Student Outcome: GNN on Enrollment Graphs