Construction of benchmark datasets for supervised learning requires a label or class to be assigned to each datapoint. This is done by the constructor of the dataset in those cases where the label is not directly taken from a reference source. In transporter substrate prediction, during the dataset construction step, a class is assigned to each protein that reflects the substrate transported across the biological membrane. This substrate class assignment is typically conducted through manual curation process in which details regarding the assignment are not explained. Biological databases are consistently growing and many entries are updated; therefore, automating the data collection stage is desirable. This work aims to automate the transporter substrate data collection process in a consistent and reproducible manner, and eliminate external dataset curator judgment. To achieve this, we propose an automated tool that assigns a substrate class by using available annotations and delegating the broader class assignment to previously established ontologies. Two case studies have been used to evaluate the automation tool and to analyze the available number of substrates in the current biological databases.