Deep learning (DL) methods accurately predict gene expression levels from genomic DNA, promising to serve as an important tool in interpreting the full spectrum of genetic variations in personal genomes. However, systematic benchmarking is needed to assess the gap in their utility as personal DNA interpreters. Using paired Whole Genome Sequencing and gene expression data we evaluate DL sequence-to-expression models, identifying their failure to make correct predictions on a substantial number of genomic loci due to their inability to correctly determine the direction of variant effects, highlighting the limits of the current model training paradigm.