This is a minor generative pre trained transformer built which generates more text based on the data set trained on, (which is a Shakespeare here) based on the transformer architecture, from the paper attention all you need.
Couldnt really get the final loss prediction displaying the more.txt file with the generated text of the gpt, as i didnt have access to a good GPU, running this on a macbook
This is basically an implementation of the self attention, batch norm/layer norm and the feed forward mechanishm of the decoder part of the transformer architecture. You can use this with any other different dataset