Seismic event detection and phase picking are the base of many seismological workflows. In recent years, several publications demonstrated that deep learning approaches significantly outperform classical approaches, achieving human‐like performance under certain circumstances. However, as studies differ in the datasets and evaluation tasks, it is unclear how the different approaches compare to each other. Furthermore, there are no systematic studies about model performance in cross‐domain scenarios, that is, when applied to data with different characteristics. Here, we address these questions by conducting a large‐scale benchmark. We compare six previously published deep learning models on eight data sets covering local to teleseismic distances and on three tasks: event detection, phase identification and onset time picking. Furthermore, we compare the results to a classical Baer‐Kradolfer picker. Overall, we observe the best performance for EQTransformer, GPD and PhaseNet, with a small advantage for EQTransformer on teleseismic data. Furthermore, we conduct a cross‐domain study, analyzing model performance on data sets they were not trained on. We show that trained models can be transferred between regions with only mild performance degradation, but models trained on regional data do not transfer well to teleseismic data. As deep learning for detection and picking is a rapidly evolving field, we ensured extensibility of our benchmark by building our code on standardized frameworks and making it openly accessible. This allows model developers to easily evaluate new models or performance on new data sets. Furthermore, we make all trained models available through the SeisBench framework, giving end‐users an easy way to apply these models. Plain Language Summary: The first step in many seismological workflows is identifying if a signal contains an earthquake, and at which time which type of seismic wave arrived. These steps are known as event detection, phase identification and phase picking. In recent years, machine learning methods, in particular deep learning methods have been developed, showing promising performance on these tasks. However, so far these models have not been compared systematically in a quantitative way. Here we evaluate the performance of six deep learning models on eight datasets. Additionally, we compare them to a traditional picking algorithm not using machine learning. From our results we identify that the models EQTransformer, GPD and PhaseNet perform best. As in many use cases no picker trained on the target region will be available, we further evaluated how well models are transferable across regions. We identified that transfer across regions works well as long as the distance ranges stay similar. To foster application of the results, we make all our trained models available through the SeisBench framework. Key Points: We conducted a large scale benchmark of machine learning pickers using six models and eight datasetsBest overall performance is observed for EQTransformer, GPD and PhaseNet, with advantages for EQTransformer on teleseismic distancesModels transfer well between different regions with similar distances, but not between regional and teleseismic distances [ABSTRACT FROM AUTHOR]