We show that the empirical risk minimization (ERM) problem for neural networks has no solution in general. Given a training set s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,n with corresponding responses s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,n, fitting a k-layer neural network s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,n involves estimation of the weights s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,n via an ERM: s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,nWe show that even for s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,n, this infimum is not attainable in general for common activations like ReLU, hyperbolic tangent, and sigmoid functions. In addition, we deduce that if one attempts to minimize such a loss function in the event when its infimum is not attainable, it necessarily results in values of s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,n diverging to s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,n. We will show that for smooth activations s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,n and s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,n, such failure to attain an infimum can happen on a positive-measured subset of responses. For the ReLU activation s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,n, we completely classify cases where the ERM for a best two-layer neural network approximation attains its infimum. In recent applications of neural networks, where overfitting is commonplace, the failure to attain an infimum is avoided by ensuring that the system of equations s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,n, s1,…,sn∈Rpt1,…,tn∈Rqνθ:Rp→Rqθ∈Rminfθ∈Rm∑i=1n‖ti-νθ(si)‖22.k=2θ±∞σ(x)=1/(1+exp(-x))σ(x)=tanh(x)σ(x)=max(0,x)ti=νθ(si)i=1,…,n, has a solution. For a two-layer ReLU-activated network, we will show when such a system of equations has a solution generically, i.e., when can such a neural network be fitted perfectly with probability one.