FERPA, De-identification, and AI
FERPA only protects PII contained in student educational records. This means that FERPA allows de-identified student data to be shared with any party–including parents, researchers, and the general public–for any purpose without consent (34 CFR §99.31[b][1]). There are many social benefits that can come from sharing de-identified student data. For example, de-identified student data that is reported at the aggregate level is often helpful for institutional leaders and policymakers engaging in evidence-based decision making. However, there is also a serious risk that unfamiliarity with FERPA’s de-identification requirements could lead to schools improperly sharing pseudonymized student data this way that can be combined with other information and used to re-identify a student.
The Department of Education’s Privacy Technical Assistance Center (PTAC), which provides guidance and resources for stakeholders about protecting student privacy, defines de-identification as the “process of removing or obscuring any personally identifiable information from student records in a way that minimizes the risk of unintended disclosure of the identity of individuals and information about them.” This may sound (mostly) straightforward, but de-identification is a confusing process and increasingly difficult in light of new technologies. Many try to meet this de-identification standard by pseudonymizing data–by removing or obscuring direct identifiers but retaining indirect identifiers. However, pseudonymization doesn't meet the de-identification standard in FERPA and actors using this method, while they may be well-intentioned, will fail to fully protect students to the standard that FERPA requires.
A Visual Guide to Practical De-Identification by Future of Privacy Forum, April 2016
The FERPA standard for de-identification assesses whether a “reasonable person in the school community who does not have personal knowledge of the relevant circumstances” could identify individual students in a dataset based on reasonably available information, such as previously published data or a widely-known event, communications, or other similar factor.” More information on the de-identification standard in FERPA is available in this resource from the Department of Education.
Consider a scenario in a higher educational institution where an AI tool is used to analyze student data to predict academic performance and then publishes an aggregate dataset of the information it collected about students without consent. The AI tool only analyzes indirect identifiers–like the frequency of library visits, number of online resources accessed, participation in campus events, and assignment grades. Individually, these data points are unlikely to reveal a student's identity. However, even though names and student IDs are not used, these elements can form a unique pattern that can identify a specific student. If a reporter could look up on the university’s website that only a couple students ran track while also participating in a school play and then use that information in the data set to identify how that specific student performed on an assignment, then the published data set is not sufficiently de-identified under FERPA. However, the fact that an individual in the dataset can be re-identified by another person based solely on personal knowledge–for example, a student who knows how many times their best friend has gone to the library–– does not necessarily mean that the information is insufficiently de-identified under FERPA.
PTAC recommends that schools consider other public information to determine whether students may be re-identified by combining information from other sources. For example, routinely disclosed directory information could be combined with de-identified data in a way that allows for re-identification. A student’s date and place of birth may be disclosed as directory information (e.g. May 3, 2000, Seattle, WA); de-identified data could be released that groups students by area of birth. If only one student was born in the Pacific Northwest, then they can be re-identified by combining the de-identified data with directory information. Because PII often includes both direct and indirect identifiers, removing direct identifiers alone is not sufficient to de-identify data. Rather, proper de-identification “involves removing or obscuring all identifiable information until all data that can lead to individual identification have been expunged or masked.” De-identification, however, is not foolproof - studies have shown that anonymized data can be matched back to an individual (“re-identification”). The Department of Education specifically noted that “the risk of re-identification may be greater for student data than other information” because of the large amount of student data that is disclosed, including de-identified data and directory information. Re-identifying a student may reveal seemingly innocuous information, such as their grades; however, it can also reveal incredibly sensitive information such as disability status or citizenship status.
In the context of AI systems that collect large amounts of student data, the risk of re-identification grows as the amount of data that is released (as identifiable or de-identified data) increases. Because AI systems are trained on massive amounts of data and have powerful pattern recognition abilities, they have greater abilities to recognize patterns and establish connections among seemingly disconnected data points. Additionally, anonymization methods that are effective today may not be robust enough to withstand future advancements, making it challenging to ensure long-term data privacy and protection under FERPA. This increased risk of re-identification requires institutions to carefully consider whether their processes for de-identification still meet FERPA’s standards.