相似题检索旨在从数据库中找到与给定查询试题考查目标相似的试题.随着在线教育的不断发展,试题数据库日益庞大,且由于试题数据的专业属性使标注相关性非常困难,因此需要一种高效且无需标注的相似题检索模型.无监督语义哈希能在无监督信号的前提下将高维数据映射为低维且高效的二值表征.但不能简单地将语义哈希模型应用在相似题检索模型中,因为试题数据具有丰富的语义信息,而二值向量的表征空间有限.为此,提出一个能获取、保留关键信息的相似题检索模型.首先,设计了一个关键信息获取模块获取试题数据的关键信息,并引入去冗余目标损失去除冗余信息;其次,在编码过程中引入随时间变化的激活函数,减少编码信息损失;再次,为了最大化利用汉明空间,在优化过程中引入比特平衡目标和比特无关目标以优化二值表征的分布.在MATH和HISTORY数据集上的实验结果表明,相较于表现最好的文本语义哈希模型DHIM(Deep Hash InfoMax),所提模型在2个数据集的3个召回率设置上分别平均提升约54%和23%;在检索效率方面,所提模型比最优的相似题检索模型QuesCo具有明显的优势.
Finding similar exercises aims to retrieve exercises with similar testing goals to a given query exercise from the exercise database.As online education evolves,the exercise database is growing in size,and due to the professional characteristic of the exercises,it is not easy to annotate their relations.Thus,online education systems require an efficient and unsupervised model for finding similar exercise.Unsupervised semantic hashing can map high-dimensional data to compact and efficient binary representation under the premise of unsupervised signals.However,it is inadequate to simply apply the semantic hashing model to the similar exercise retrieval model because exercise data contains rich semantic information while the representation space of binary vector is limited.To address this issue,a similar exercise retrieval model was introduced to acquire and retain crucial information.Firstly,a crucial information acquisition module was designed to acquire critical information from exercise data and a de-redundancy object loss was proposed to eliminate redundant information.Secondly,a time-aware activation function was introduced to reduce coding information loss.Thirdly,to maximize the utilization of the Hamming space,a bit balance loss and a bit independent loss were introduced to optimize the distribution of binary representation in the optimization process.Experimental results on MATH and HISTORY datasets demonstrate that the proposed model outperforms the state-of-the-art text semantic hashing model Deep Hash InfoMax(DHIM),with an average improvement of approximately 54%and 23%respectively across three recall settings.Moreover,compared to the best-performing similar exercise retrieval model QuesCo,the proposed model demonstrates a clear advantage on search efficiency.