Transformer regression model overfits on single sample but fails to further reduce loss on a 50-sample dataset
📰 Reddit r/deeplearning
My task consists of forecasting number of upvotes for Reddit posts at time t after posting (how many hours t it was posted ago) based on text/title/time t, current architecture is basically transformer's encoders taking text as input after which is placed a linear network taking 'how long ago was posted' and encoder's outputs as input and outputting the regression value. Current architecture worked fine for small dataset (n=2, 1 for training): <a h
DeepCamp AI