Трансформер (модель машинного навчання) (Ukrainian Wikipedia)

Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (серпень 2019). What Does BERT Look at? An Analysis of BERT's Attention. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (англ.). Florence, Italy: Association for Computational Linguistics: 276—286. arXiv:1906.04341. doi:10.18653/v1/W19-4828. Архів оригіналу за 21 жовтня 2020. Процитовано 20 травня 2020.

acm.org

dl.acm.org

Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (1 січня 2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research (англ.). 21 (1): 140:5485–140:5551. arXiv:1910.10683. ISSN 1532-4435.

archive.org

Schmidhuber, Jürgen (1992). Learning to control fast-weight memories: an alternative to recurrent nets. Neural Computation (англ.). 4 (1): 131—139. doi:10.1162/neco.1992.4.1.131. S2CID 16683347.

arxiv.org

Bahdanau; Cho, Kyunghyun; Bengio, Yoshua (1 вересня 2014). Neural Machine Translation by Jointly Learning to Align and Translate (англ.). arXiv:1409.0473 [cs.CL].
Luong, Minh-Thang; Pham, Hieu; Manning, Christopher D. (17 серпня 2015). Effective Approaches to Attention-based Neural Machine Translation (англ.). arXiv:1508.04025 [cs.CL].
Radford, Alec; Jong Wook Kim; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). Robust Speech Recognition via Large-Scale Weak Supervision (англ.). arXiv:2212.04356 [eess.AS].
Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Mohiuddin, Afroz; Kaiser, Lukasz; Belanger, David; Colwell, Lucy; Weller, Adrian (2020). Rethinking Attention with Performers (англ.). arXiv:2009.14794 [cs.CL].
Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V (2014). Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems. Curran Associates, Inc. 27. arXiv:1409.3215.
Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, Dzmitry; Bengio, Yoshua (2014). On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (англ.). Stroudsburg, PA, USA: Association for Computational Linguistics: 103—111. arXiv:1409.1259. doi:10.3115/v1/w14-4012. S2CID 11336213.
Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling (англ.). arXiv:1412.3555 [cs.NE].
Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (1 вересня 2014). Neural Machine Translation by Jointly Learning to Align and Translate (англ.). arXiv:1409.0473 [cs.CL].
Wu, Yonghui та ін. (1 вересня 2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (англ.). arXiv:1609.08144 [cs.CL].
Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 жовтня 2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (англ.). arXiv:1810.04805v2 [cs.CL].
Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (3 червня 2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (англ.). arXiv:2010.11929 [cs.CV].
Gulati, Anmol; Qin, James; Chiu, Chung-Cheng; Parmar, Niki; Zhang, Yu; Yu, Jiahui; Han, Wei; Wang, Shibo; Zhang, Zhengdong; Wu, Yonghui; Pang, Ruoming (2020). Conformer: Convolution-augmented Transformer for Speech Recognition (англ.). arXiv:2005.08100 [eess.AS].
Xiong, Ruibin; Yang, Yunchang; He, Di; Zheng, Kai; Zheng, Shuxin; Xing, Chen; Zhang, Huishuai; Lan, Yanyan; Wang, Liwei; Liu, Tie-Yan (29 червня 2020). On Layer Normalization in the Transformer Architecture (англ.). arXiv:2002.04745 [cs.LG].
Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (1 січня 2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research (англ.). 21 (1): 140:5485–140:5551. arXiv:1910.10683. ISSN 1532-4435.
Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (серпень 2019). What Does BERT Look at? An Analysis of BERT's Attention. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (англ.). Florence, Italy: Association for Computational Linguistics: 276—286. arXiv:1906.04341. doi:10.18653/v1/W19-4828. Архів оригіналу за 21 жовтня 2020. Процитовано 20 травня 2020.
Shazeer, Noam (1 лютого 2020). GLU Variants Improve Transformer (англ.). arXiv:2002.05202 [cs.LG].
Dufter, Philipp; Schmitt, Martin; Schütze, Hinrich (6 червня 2022). Position Information in Transformers: An Overview. Computational Linguistics (англ.). 48 (3): 733—763. arXiv:2102.11090. doi:10.1162/coli_a_00445. ISSN 0891-2017. S2CID 231986066.
Su, Jianlin; Lu, Yu; Pan, Shengfeng; Murtadha, Ahmed; Wen, Bo; Liu, Yunfeng (1 квітня 2021). RoFormer: Enhanced Transformer with Rotary Position Embedding (англ.). arXiv:2104.09864 [cs.CL].
Press, Ofir; Smith, Noah A.; Lewis, Mike (1 серпня 2021). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (англ.). arXiv:2108.12409 [cs.CL].
Shaw, Peter; Uszkoreit, Jakob; Vaswani, Ashish (2018). Self-Attention with Relative Position Representations (англ.). arXiv:1803.02155 [cs.CL].
Dao, Tri; Fu, Dan; Ermon, Stefano; Rudra, Atri; Ré, Christopher (6 грудня 2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems (англ.). 35: 16344—16359. arXiv:2205.14135.
Chowdhery, Aakanksha; Narang, Sharan; Devlin, Jacob; Bosma, Maarten; Mishra, Gaurav; Roberts, Adam; Barham, Paul; Chung, Hyung Won; Sutton, Charles; Gehrmann, Sebastian; Schuh, Parker; Shi, Kensen; Tsvyashchenko, Sasha; Maynez, Joshua; Rao, Abhishek (1 квітня 2022). PaLM: Scaling Language Modeling with Pathways (англ.). arXiv:2204.02311 [cs.CL].
Leviathan, Yaniv; Kalman, Matan; Matias, Yossi (18 травня 2023), Fast Inference from Transformers via Speculative Decoding (англ.), arXiv:2211.17192
Chen, Charlie; Borgeaud, Sebastian; Irving, Geoffrey; Lespiau, Jean-Baptiste; Sifre, Laurent; Jumper, John (2 лютого 2023), Accelerating Large Language Model Decoding with Speculative Sampling (англ.), arXiv:2302.01318
Kitaev, Nikita; Kaiser, Łukasz; Levskaya, Anselm (2020). Reformer: The Efficient Transformer (англ.). arXiv:2001.04451 [cs.LG].
Zhai, Shuangfei; Talbott, Walter; Srivastava, Nitish; Huang, Chen; Goh, Hanlin; Zhang, Ruixiang; Susskind, Josh (21 вересня 2021). An Attention Free Transformer (англ.). arXiv:2105.14103 [cs.LG].
Tay, Yi; Dehghani, Mostafa; Abnar, Samira; Shen, Yikang; Bahri, Dara; Pham, Philip; Rao, Jinfeng; Yang, Liu; Ruder, Sebastian; Metzler, Donald (8 листопада 2020). Long Range Arena: A Benchmark for Efficient Transformers (англ.). arXiv:2011.04006 [cs.LG].
Peng, Hao; Pappas, Nikolaos; Yogatama, Dani; Schwartz, Roy; Smith, Noah A.; Kong, Lingpeng (19 березня 2021). Random Feature Attention (англ.). arXiv:2103.02143 [cs.CL].
Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Belanger, David; Colwell, Lucy; Weller, Adrian (30 вересня 2020). Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers (англ.). arXiv:2006.03555 [cs.LG].
Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). Robust Speech Recognition via Large-Scale Weak Supervision (англ.). arXiv:2212.04356 [eess.AS].
Jaegle, Andrew; Gimeno, Felix; Brock, Andrew; Zisserman, Andrew; Vinyals, Oriol; Carreira, Joao (22 червня 2021). Perceiver: General Perception with Iterative Attention (англ.). arXiv:2103.03206 [cs.CV].
Jaegle, Andrew; Borgeaud, Sebastian; Alayrac, Jean-Baptiste; Doersch, Carl; Ionescu, Catalin; Ding, David; Koppula, Skanda; Zoran, Daniel; Brock, Andrew; Shelhamer, Evan; Hénaff, Olivier (2 серпня 2021). Perceiver IO: A General Architecture for Structured Inputs & Outputs (англ.). arXiv:2107.14795 [cs.LG].
Peebles, William; Xie, Saining (2 березня 2023). Scalable Diffusion Models with Transformers (англ.). arXiv:2212.09748 [cs.CV].

coursera.org

Tasks with Long Sequences – Chatbot. Coursera (англ.). Архів оригіналу за 26 жовтня 2020. Процитовано 22 жовтня 2020.

doi.org

Hochreiter, Sepp; Schmidhuber, Jürgen (1 листопада 1997). Long Short-Term Memory. Neural Computation (англ.). 9 (8): 1735—1780. doi:10.1162/neco.1997.9.8.1735. ISSN 0899-7667. PMID 9377276. S2CID 1915014.
Schmidhuber, Jürgen (1992). Learning to control fast-weight memories: an alternative to recurrent nets. Neural Computation (англ.). 4 (1): 131—139. doi:10.1162/neco.1992.4.1.131. S2CID 16683347.
Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander (2020). Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (англ.). с. 38—45. doi:10.18653/v1/2020.emnlp-demos.6. S2CID 208117506.
Elman, Jeffrey L. (березень 1990). Finding Structure in Time. Cognitive Science (англ.). 14 (2): 179—211. doi:10.1207/s15516709cog1402_1. S2CID 2763403.
Banko, Michele; Brill, Eric (2001). Scaling to very very large corpora for natural language disambiguation. Proceedings of the 39th Annual Meeting on Association for Computational Linguistics - ACL '01 (англ.). Morristown, NJ, USA: Association for Computational Linguistics: 26—33. doi:10.3115/1073012.1073017. S2CID 6645623.
Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, Dzmitry; Bengio, Yoshua (2014). On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (англ.). Stroudsburg, PA, USA: Association for Computational Linguistics: 103—111. arXiv:1409.1259. doi:10.3115/v1/w14-4012. S2CID 11336213.
Gruber, N.; Jockisch, A. (2020), Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?, Frontiers in Artificial Intelligence (англ.), 3: 40, doi:10.3389/frai.2020.00040, PMC 7861254, PMID 33733157, S2CID 220252321
Chen, Jia; Chen, Tao; Shen, Mengqi; Shi, Yunhai; Wang, Dongjing; Zhang, Xin (1 вересня 2022). Gated three-tower transformer for text-driven stock market prediction. Multimedia Tools and Applications (англ.). 81 (21): 30093—30119. doi:10.1007/s11042-022-11908-1. ISSN 1573-7721. S2CID 247987240.
Kariampuzha, William; Alyea, Gioconda; Qu, Sue; Sanjak, Jaleal; Mathé, Ewy; Sid, Eric; Chatelaine, Haley; Yadaw, Arjun; Xu, Yanji; Zhu, Qian (2023). Precision information extraction for rare disease epidemiology at scale. Journal of Translational Medicine (англ.). 21 (1): 157. doi:10.1186/s12967-023-04011-y. PMC 9972634. PMID 36855134.
Assael, Yannis; Sommerschield, Thea; Shillingford, Brendan; Bordbar, Mahyar; Pavlopoulos, John; Chatzipanagiotou, Marita; Androutsopoulos, Ion; Prag, Jonathan; de Freitas, Nando (березень 2022). Restoring and attributing ancient texts using deep neural networks. Nature (англ.). 603 (7900): 280—283. Bibcode:2022Natur.603..280A. doi:10.1038/s41586-022-04448-z. ISSN 1476-4687. PMC 8907065. PMID 35264762.
Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (серпень 2019). What Does BERT Look at? An Analysis of BERT's Attention. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (англ.). Florence, Italy: Association for Computational Linguistics: 276—286. arXiv:1906.04341. doi:10.18653/v1/W19-4828. Архів оригіналу за 21 жовтня 2020. Процитовано 20 травня 2020.
Dufter, Philipp; Schmitt, Martin; Schütze, Hinrich (6 червня 2022). Position Information in Transformers: An Overview. Computational Linguistics (англ.). 48 (3): 733—763. arXiv:2102.11090. doi:10.1162/coli_a_00445. ISSN 0891-2017. S2CID 231986066.

dx.doi.org

Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, Dzmitry; Bengio, Yoshua (2014). On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (англ.). Stroudsburg, PA, USA: Association for Computational Linguistics: 103—111. arXiv:1409.1259. doi:10.3115/v1/w14-4012. S2CID 11336213.

github.com

finetune-transformer-lm (англ.), OpenAI, 11 червня 2018, процитовано 1 травня 2023

googleblog.com

ai.googleblog.com

Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing. Google AI Blog (англ.). 2 листопада 2018. Архів оригіналу за 13 січня 2021. Процитовано 25 серпня 2019.
Constructing Transformers For Longer Sequences with Sparse Attention Methods. Google AI Blog (англ.). 25 березня 2021. Архів оригіналу за 18 вересня 2021. Процитовано 28 травня 2021.
Reformer: The Efficient Transformer. Google AI Blog (англ.). 16 січня 2020. Архів оригіналу за 22 жовтня 2020. Процитовано 22 жовтня 2020.

harvard.edu

ui.adsabs.harvard.edu

Assael, Yannis; Sommerschield, Thea; Shillingford, Brendan; Bordbar, Mahyar; Pavlopoulos, John; Chatzipanagiotou, Marita; Androutsopoulos, Ion; Prag, Jonathan; de Freitas, Nando (березень 2022). Restoring and attributing ancient texts using deep neural networks. Nature (англ.). 603 (7900): 280—283. Bibcode:2022Natur.603..280A. doi:10.1038/s41586-022-04448-z. ISSN 1476-4687. PMC 8907065. PMID 35264762.

huggingface.co

Masked language modeling. huggingface.co (англ.). Процитовано 5 жовтня 2023.
Causal language modeling. huggingface.co (англ.). Процитовано 5 жовтня 2023.

idsia.ch

people.idsia.ch

Schmidhuber, Juergen (26 березня 2021). 26 March 1991: Neural nets learn to program neural nets with fast weights—the first Transformer variants. 2021-: New stuff! (англ.). IDSIA, Switzerland. Архів оригіналу за 5 грудня 2023. Процитовано 29 грудня 2023.

indico.io

Sequence Modeling with Neural Networks (Part 2): Attention Models. Indico (англ.). 18 квітня 2016. Архів оригіналу за 21 жовтня 2020. Процитовано 15 жовтня 2019.

infoq.com

Google AI Unveils Muse, a New Text-to-Image Transformer Model. InfoQ (англ.).

jalammar.github.io

Alammar, Jay. The Illustrated Transformer. jalammar.github.io (англ.). Архів оригіналу за 18 жовтня 2020. Процитовано 15 жовтня 2019.

marktechpost.com

Islam, Arham (14 листопада 2022). How Do DALL·E 2, Stable Diffusion, and Midjourney Work? (англ.).

metaphysic.ai

blog.metaphysic.ai

Using Diffusion Models to Create Superior NeRF Avatars (англ.). 5 січня 2023.

neurips.cc

proceedings.neurips.cc

Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). Attention is All you Need (PDF). Advances in Neural Information Processing Systems (англ.). Curran Associates, Inc. 30.
Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V (2014). Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems. Curran Associates, Inc. 27. arXiv:1409.3215.
Dao, Tri; Fu, Dan; Ermon, Stefano; Rudra, Atri; Ré, Christopher (6 грудня 2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems (англ.). 35: 16344—16359. arXiv:2205.14135.

nih.gov

pubmed.ncbi.nlm.nih.gov

Hochreiter, Sepp; Schmidhuber, Jürgen (1 листопада 1997). Long Short-Term Memory. Neural Computation (англ.). 9 (8): 1735—1780. doi:10.1162/neco.1997.9.8.1735. ISSN 0899-7667. PMID 9377276. S2CID 1915014.
Gruber, N.; Jockisch, A. (2020), Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?, Frontiers in Artificial Intelligence (англ.), 3: 40, doi:10.3389/frai.2020.00040, PMC 7861254, PMID 33733157, S2CID 220252321
Kariampuzha, William; Alyea, Gioconda; Qu, Sue; Sanjak, Jaleal; Mathé, Ewy; Sid, Eric; Chatelaine, Haley; Yadaw, Arjun; Xu, Yanji; Zhu, Qian (2023). Precision information extraction for rare disease epidemiology at scale. Journal of Translational Medicine (англ.). 21 (1): 157. doi:10.1186/s12967-023-04011-y. PMC 9972634. PMID 36855134.
Assael, Yannis; Sommerschield, Thea; Shillingford, Brendan; Bordbar, Mahyar; Pavlopoulos, John; Chatzipanagiotou, Marita; Androutsopoulos, Ion; Prag, Jonathan; de Freitas, Nando (березень 2022). Restoring and attributing ancient texts using deep neural networks. Nature (англ.). 603 (7900): 280—283. Bibcode:2022Natur.603..280A. doi:10.1038/s41586-022-04448-z. ISSN 1476-4687. PMC 8907065. PMID 35264762.

ncbi.nlm.nih.gov

Gruber, N.; Jockisch, A. (2020), Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?, Frontiers in Artificial Intelligence (англ.), 3: 40, doi:10.3389/frai.2020.00040, PMC 7861254, PMID 33733157, S2CID 220252321
Kariampuzha, William; Alyea, Gioconda; Qu, Sue; Sanjak, Jaleal; Mathé, Ewy; Sid, Eric; Chatelaine, Haley; Yadaw, Arjun; Xu, Yanji; Zhu, Qian (2023). Precision information extraction for rare disease epidemiology at scale. Journal of Translational Medicine (англ.). 21 (1): 157. doi:10.1186/s12967-023-04011-y. PMC 9972634. PMID 36855134.
Assael, Yannis; Sommerschield, Thea; Shillingford, Brendan; Bordbar, Mahyar; Pavlopoulos, John; Chatzipanagiotou, Marita; Androutsopoulos, Ion; Prag, Jonathan; de Freitas, Nando (березень 2022). Restoring and attributing ancient texts using deep neural networks. Nature (англ.). 603 (7900): 280—283. Bibcode:2022Natur.603..280A. doi:10.1038/s41586-022-04448-z. ISSN 1476-4687. PMC 8907065. PMID 35264762.

notion.site

yaofu.notion.site

Fu, Yao (13 грудня 2023). Towards 100x Speedup: Full Stack Transformer Inference Optimization (англ.).

nytimes.com

Lewis-Kraus, Gideon (14 грудня 2016). The Great A.I. Awakening. The New York Times (амер.). ISSN 0362-4331. Архів оригіналу за 24 травня 2023. Процитовано 22 червня 2023.

openai.com

Better Language Models and Their Implications. OpenAI (англ.). 14 лютого 2019. Архів оригіналу за 19 грудня 2020. Процитовано 25 серпня 2019.
Improving language understanding with unsupervised learning. openai.com (амер.). 11 червня 2018. Архів оригіналу за 18 березня 2023. Процитовано 18 березня 2023.

paperswithcode.com

Katharopoulos, Angelos; Vyas, Apoorv; Pappas, Nikolaos; Fleuret, François (2020). Transformers are RNNs: Fast autoregressive Transformers with linear attention. ICML 2020 (англ.). PMLR. с. 5156—5165.
Papers with Code – A Decomposable Attention Model for Natural Language Inference. paperswithcode.com (англ.).

princeton-nlp.github.io

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. Princeton NLP (англ.). 17 червня 2023. Процитовано 18 липня 2023.

scholar.google.com

Google Scholar. scholar.google.com (англ.). Процитовано 13 серпня 2023.

semanticscholar.org

api.semanticscholar.org

Hochreiter, Sepp; Schmidhuber, Jürgen (1 листопада 1997). Long Short-Term Memory. Neural Computation (англ.). 9 (8): 1735—1780. doi:10.1162/neco.1997.9.8.1735. ISSN 0899-7667. PMID 9377276. S2CID 1915014.
Schmidhuber, Jürgen (1992). Learning to control fast-weight memories: an alternative to recurrent nets. Neural Computation (англ.). 4 (1): 131—139. doi:10.1162/neco.1992.4.1.131. S2CID 16683347.
Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander (2020). Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (англ.). с. 38—45. doi:10.18653/v1/2020.emnlp-demos.6. S2CID 208117506.
Elman, Jeffrey L. (березень 1990). Finding Structure in Time. Cognitive Science (англ.). 14 (2): 179—211. doi:10.1207/s15516709cog1402_1. S2CID 2763403.
Banko, Michele; Brill, Eric (2001). Scaling to very very large corpora for natural language disambiguation. Proceedings of the 39th Annual Meeting on Association for Computational Linguistics - ACL '01 (англ.). Morristown, NJ, USA: Association for Computational Linguistics: 26—33. doi:10.3115/1073012.1073017. S2CID 6645623.
Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, Dzmitry; Bengio, Yoshua (2014). On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (англ.). Stroudsburg, PA, USA: Association for Computational Linguistics: 103—111. arXiv:1409.1259. doi:10.3115/v1/w14-4012. S2CID 11336213.
Gruber, N.; Jockisch, A. (2020), Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?, Frontiers in Artificial Intelligence (англ.), 3: 40, doi:10.3389/frai.2020.00040, PMC 7861254, PMID 33733157, S2CID 220252321
Chen, Jia; Chen, Tao; Shen, Mengqi; Shi, Yunhai; Wang, Dongjing; Zhang, Xin (1 вересня 2022). Gated three-tower transformer for text-driven stock market prediction. Multimedia Tools and Applications (англ.). 81 (21): 30093—30119. doi:10.1007/s11042-022-11908-1. ISSN 1573-7721. S2CID 247987240.
Dufter, Philipp; Schmitt, Martin; Schütze, Hinrich (6 червня 2022). Position Information in Transformers: An Overview. Computational Linguistics (англ.). 48 (3): 733—763. arXiv:2102.11090. doi:10.1162/coli_a_00445. ISSN 0891-2017. S2CID 231986066.

stanford.edu

crfm.stanford.edu

Stanford CRFM. crfm.stanford.edu (англ.). Процитовано 18 липня 2023.

together.ai

Introducing Together AI Chief Scientist Tri Dao, as he releases FlashAttention-2 to speed up model training and inference. TOGETHER (амер.). Процитовано 18 липня 2023.

towardsdatascience.com

He, Cheng (31 грудня 2021). Transformer in CV. Transformer in CV (англ.). Towards Data Science. Архів оригіналу за 16 квітня 2023. Процитовано 19 червня 2021.

twitter.com

LeCun, Yann (28 квітня 2023). A survey of LLMs with a practical guide and evolutionary tree. Twitter (англ.). Архів оригіналу за 23 червня 2023. Процитовано 23 червня 2023.

web.archive.org

Better Language Models and Their Implications. OpenAI (англ.). 14 лютого 2019. Архів оригіналу за 19 грудня 2020. Процитовано 25 серпня 2019.
He, Cheng (31 грудня 2021). Transformer in CV. Transformer in CV (англ.). Towards Data Science. Архів оригіналу за 16 квітня 2023. Процитовано 19 червня 2021.
Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing. Google AI Blog (англ.). 2 листопада 2018. Архів оригіналу за 13 січня 2021. Процитовано 25 серпня 2019.
Schmidhuber, Juergen (26 березня 2021). 26 March 1991: Neural nets learn to program neural nets with fast weights—the first Transformer variants. 2021-: New stuff! (англ.). IDSIA, Switzerland. Архів оригіналу за 5 грудня 2023. Процитовано 29 грудня 2023.
Lewis-Kraus, Gideon (14 грудня 2016). The Great A.I. Awakening. The New York Times (амер.). ISSN 0362-4331. Архів оригіналу за 24 травня 2023. Процитовано 22 червня 2023.
Improving language understanding with unsupervised learning. openai.com (амер.). 11 червня 2018. Архів оригіналу за 18 березня 2023. Процитовано 18 березня 2023.
Sequence Modeling with Neural Networks (Part 2): Attention Models. Indico (англ.). 18 квітня 2016. Архів оригіналу за 21 жовтня 2020. Процитовано 15 жовтня 2019.
Alammar, Jay. The Illustrated Transformer. jalammar.github.io (англ.). Архів оригіналу за 18 жовтня 2020. Процитовано 15 жовтня 2019.
Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (серпень 2019). What Does BERT Look at? An Analysis of BERT's Attention. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (англ.). Florence, Italy: Association for Computational Linguistics: 276—286. arXiv:1906.04341. doi:10.18653/v1/W19-4828. Архів оригіналу за 21 жовтня 2020. Процитовано 20 травня 2020.
LeCun, Yann (28 квітня 2023). A survey of LLMs with a practical guide and evolutionary tree. Twitter (англ.). Архів оригіналу за 23 червня 2023. Процитовано 23 червня 2023.
Constructing Transformers For Longer Sequences with Sparse Attention Methods. Google AI Blog (англ.). 25 березня 2021. Архів оригіналу за 18 вересня 2021. Процитовано 28 травня 2021.
Tasks with Long Sequences – Chatbot. Coursera (англ.). Архів оригіналу за 26 жовтня 2020. Процитовано 22 жовтня 2020.
Reformer: The Efficient Transformer. Google AI Blog (англ.). 16 січня 2020. Архів оригіналу за 22 жовтня 2020. Процитовано 22 жовтня 2020.

wiley.com

doi.wiley.com

Elman, Jeffrey L. (березень 1990). Finding Structure in Time. Cognitive Science (англ.). 14 (2): 179—211. doi:10.1207/s15516709cog1402_1. S2CID 2763403.

worldcat.org

Hochreiter, Sepp; Schmidhuber, Jürgen (1 листопада 1997). Long Short-Term Memory. Neural Computation (англ.). 9 (8): 1735—1780. doi:10.1162/neco.1997.9.8.1735. ISSN 0899-7667. PMID 9377276. S2CID 1915014.
Lewis-Kraus, Gideon (14 грудня 2016). The Great A.I. Awakening. The New York Times (амер.). ISSN 0362-4331. Архів оригіналу за 24 травня 2023. Процитовано 22 червня 2023.
Chen, Jia; Chen, Tao; Shen, Mengqi; Shi, Yunhai; Wang, Dongjing; Zhang, Xin (1 вересня 2022). Gated three-tower transformer for text-driven stock market prediction. Multimedia Tools and Applications (англ.). 81 (21): 30093—30119. doi:10.1007/s11042-022-11908-1. ISSN 1573-7721. S2CID 247987240.
Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (1 січня 2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research (англ.). 21 (1): 140:5485–140:5551. arXiv:1910.10683. ISSN 1532-4435.
Assael, Yannis; Sommerschield, Thea; Shillingford, Brendan; Bordbar, Mahyar; Pavlopoulos, John; Chatzipanagiotou, Marita; Androutsopoulos, Ion; Prag, Jonathan; de Freitas, Nando (березень 2022). Restoring and attributing ancient texts using deep neural networks. Nature (англ.). 603 (7900): 280—283. Bibcode:2022Natur.603..280A. doi:10.1038/s41586-022-04448-z. ISSN 1476-4687. PMC 8907065. PMID 35264762.
Dufter, Philipp; Schmitt, Martin; Schütze, Hinrich (6 червня 2022). Position Information in Transformers: An Overview. Computational Linguistics (англ.). 48 (3): 733—763. arXiv:2102.11090. doi:10.1162/coli_a_00445. ISSN 0891-2017. S2CID 231986066.