Transformer regression model overfits on single sample but fails to further reduce loss on a 50-sample dataset

📰 Reddit r/deeplearning

My task consists of forecasting number of upvotes for Reddit posts at time t after posting (how many hours t it was posted ago) based on text/title/time t, current architecture is basically transformer's encoders taking text as input after which is placed a linear network taking 'how long ago was posted' and encoder's outputs as input and outputting the regression value. Current architecture worked fine for small dataset (n=2, 1 for training): <a h

Published 14 Apr 2026
Read full article → ← Back to Reads