The original Transformer paper. Self-attention, no recurrence, parallelizable. The architecture that enabled GPT, BERT, and everything since.
The original Transformer paper. Self-attention, no recurrence, parallelizable. The architecture that enabled GPT, BERT, and everything since.