Mixture of experts (English Wikipedia)

Analysis of information sources in references of the Wikipedia article "Mixture of experts" in English language version.

refsWebsite

Global rank English rank

26arxiv.org

69^th place

59^th place

12doi.org

2^nd place

8worldcat.org

5^th place

7neurips.cc

low place

3nih.gov

4^th place

3sciencedirect.com

149^th place

178^th place

2mit.edu

415^th place

327^th place

2semanticscholar.org

11^th place

8^th place

2handle.net

102^nd place

76^th place

1harvard.edu

18^th place

17^th place

1nii.ac.jp

304^th place

1,952^nd place

1kit.edu

6,505^th place

9,776^th place

1taylorfrancis.com

3,142^nd place

2,072^nd place

1acm.org

1,185^th place

840^th place

1mlr.press

low place

1orenleung.com

low place

1web.archive.org

1^st place

1facebook.com

77^th place

111^th place

1mistral.ai

low place

1databricks.com

low place

1wired.com

193^rd place

152^nd place

acm.org

dl.acm.org

Fedus, William; Zoph, Barret; Shazeer, Noam (2022-01-01). "Switch transformers: scaling to trillion parameter models with simple and efficient sparsity". The Journal of Machine Learning Research. 23 (1): 5232–5270. arXiv:2101.03961. ISSN 1532-4435.

arxiv.org

Chamroukhi, F. (2016-07-01). "Robust mixture of experts modeling using the t distribution". Neural Networks. 79: 20–36. arXiv:1701.07429. doi:10.1016/j.neunet.2016.03.002. ISSN 0893-6080. PMID 27093693. S2CID 3171144.
Yang, Zhilin; Dai, Zihang; Salakhutdinov, Ruslan; Cohen, William W. (2017-11-10). "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model". arXiv:1711.03953 [cs.CL].
Narang, Sharan; Chung, Hyung Won; Tay, Yi; Fedus, William; Fevry, Thibault; Matena, Michael; Malkan, Karishma; Fiedel, Noah; Shazeer, Noam (2021-02-23). "Do Transformer Modifications Transfer Across Implementations and Applications?". arXiv:2102.11972 [cs.LG].
Bengio, Yoshua; Léonard, Nicholas; Courville, Aaron (2013). "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation". arXiv:1308.3432 [cs.LG].
Eigen, David; Ranzato, Marc'Aurelio; Sutskever, Ilya (2013). "Learning Factored Representations in a Deep Mixture of Experts". arXiv:1312.4314 [cs.LG].
Shazeer, Noam; Mirhoseini, Azalia; Maziarz, Krzysztof; Davis, Andy; Le, Quoc; Hinton, Geoffrey; Dean, Jeff (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer". arXiv:1701.06538 [cs.LG].
Fedus, William; Zoph, Barret; Shazeer, Noam (2022-01-01). "Switch transformers: scaling to trillion parameter models with simple and efficient sparsity". The Journal of Machine Learning Research. 23 (1): 5232–5270. arXiv:2101.03961. ISSN 1532-4435.
Wu, Yonghui; Schuster, Mike; Chen, Zhifeng; Le, Quoc V.; Norouzi, Mohammad; Macherey, Wolfgang; Krikun, Maxim; Cao, Yuan; Gao, Qin; Macherey, Klaus; Klingner, Jeff; Shah, Apurva; Johnson, Melvin; Liu, Xiaobing; Kaiser, Łukasz (2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation". arXiv:1609.08144 [cs.CL].
DeepSeek-AI; Liu, Aixin; Feng, Bei; Wang, Bin; Wang, Bingxuan; Liu, Bo; Zhao, Chenggang; Dengr, Chengqi; Ruan, Chong (19 June 2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model". arXiv:2405.04434 [cs.CL]..
Dai, Damai; Deng, Chengqi; Zhao, Chenggang; Xu, R. X.; Gao, Huazuo; Chen, Deli; Li, Jiashi; Zeng, Wangding; Yu, Xingkai (11 January 2024). "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models". arXiv:2401.06066 [cs.CL].
DeepSeek-AI; Liu, Aixin; Feng, Bei; Xue, Bing; Wang, Bingxuan; Wu, Bochao; Lu, Chengda; Zhao, Chenggang; Deng, Chengqi (2024-12-27). "DeepSeek-V3 Technical Report". arXiv:2412.19437 [cs.CL].
Zoph, Barret; Bello, Irwan; Kumar, Sameer; Du, Nan; Huang, Yanping; Dean, Jeff; Shazeer, Noam; Fedus, William (2022). "ST-MoE: Designing Stable and Transferable Sparse Expert Models". arXiv:2202.08906 [cs.CL].
Zhou, Yanqi; Lei, Tao; Liu, Hanxiao; Du, Nan; Huang, Yanping; Zhao, Vincent; Dai, Andrew M.; Chen, Zhifeng; Le, Quoc V.; Laudon, James (2022-12-06). "Mixture-of-Experts with Expert Choice Routing". Advances in Neural Information Processing Systems. 35: 7103–7114. arXiv:2202.09368.
Fedus, William; Dean, Jeff; Zoph, Barret (2022-09-04). "A Review of Sparse Expert Models in Deep Learning". arXiv:2209.01667 [cs.LG].
Lewis, Mike; Bhosale, Shruti; Dettmers, Tim; Goyal, Naman; Zettlemoyer, Luke (2021-07-01). "BASE Layers: Simplifying Training of Large, Sparse Models". Proceedings of the 38th International Conference on Machine Learning. PMLR: 6265–6274. arXiv:2103.16716.
Bengio, Emmanuel; Bacon, Pierre-Luc; Pineau, Joelle; Precup, Doina (2015). "Conditional Computation in Neural Networks for faster models". arXiv:1511.06297 [cs.LG].
Zuo, Simiao; Liu, Xiaodong; Jiao, Jian; Kim, Young Jin; Hassan, Hany; Zhang, Ruofei; Zhao, Tuo; Gao, Jianfeng (2022-02-03). "Taming Sparsely Activated Transformer with Stochastic Experts". arXiv:2110.04260 [cs.CL].
Komatsuzaki, Aran; Puigcerver, Joan; Lee-Thorp, James; Ruiz, Carlos Riquelme; Mustafa, Basil; Ainslie, Joshua; Tay, Yi; Dehghani, Mostafa; Houlsby, Neil (2023-02-17). "Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints". arXiv:2212.05055 [cs.LG].
Muennighoff, Niklas; Soldaini, Luca; Groeneveld, Dirk; Lo, Kyle; Morrison, Jacob; Min, Sewon; Shi, Weijia; Walsh, Pete; Tafjord, Oyvind (2024-09-03). "OLMoE: Open Mixture-of-Experts Language Models". arXiv:2409.02060 [cs.CL].
Riquelme, Carlos; Puigcerver, Joan; Mustafa, Basil; Neumann, Maxim; Jenatton, Rodolphe; Susano Pinto, André; Keysers, Daniel; Houlsby, Neil (2021). "Scaling Vision with Sparse Mixture of Experts". Advances in Neural Information Processing Systems. 34: 8583–8595. arXiv:2106.05974.
Fei, Zhengcong; Fan, Mingyuan; Yu, Changqian; Li, Debang; Huang, Junshi (2024-07-16). "Scaling Diffusion Transformers to 16 Billion Parameters". arXiv:2407.11633 [cs.CV].
Lepikhin, Dmitry; Lee, HyoukJoong; Xu, Yuanzhong; Chen, Dehao; Firat, Orhan; Huang, Yanping; Krikun, Maxim; Shazeer, Noam; Chen, Zhifeng (2020). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding". arXiv:2006.16668 [cs.CL].
Du, Nan; Huang, Yanping; Dai, Andrew M.; Tong, Simon; Lepikhin, Dmitry; Xu, Yuanzhong; Krikun, Maxim; Zhou, Yanqi; Yu, Adams Wei; Firat, Orhan; Zoph, Barret; Fedus, Liam; Bosma, Maarten; Zhou, Zongwei; Wang, Tao (2021). "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts". arXiv:2112.06905 [cs.CL].
NLLB Team; Costa-jussà, Marta R.; Cross, James; Çelebi, Onur; Elbayad, Maha; Heafield, Kenneth; Heffernan, Kevin; Kalbassi, Elahe; Lam, Janice; Licht, Daniel; Maillard, Jean; Sun, Anna; Wang, Skyler; Wenzek, Guillaume; Youngblood, Al (2022). "No Language Left Behind: Scaling Human-Centered Machine Translation". arXiv:2207.04672 [cs.CL].
Shen, Sheng; Hou, Le; Zhou, Yanqi; Du, Nan; Longpre, Shayne; Wei, Jason; Chung, Hyung Won; Zoph, Barret; Fedus, William; Chen, Xinyun; Vu, Tu; Wu, Yuexin; Chen, Wuyang; Webson, Albert; Li, Yunxuan (2023). "Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models". arXiv:2305.14705 [cs.CL].
Jiang, Albert Q.; Sablayrolles, Alexandre; Roux, Antoine; Mensch, Arthur; Savary, Blanche; Bamford, Chris; Chaplot, Devendra Singh; Casas, Diego de las; Hanna, Emma Bou (2024-01-08). "Mixtral of Experts". arXiv:2401.04088 [cs.LG].

databricks.com

"Introducing DBRX: A New State-of-the-Art Open LLM". Databricks. 2024-03-27. Retrieved 2024-03-28.

doi.org

Baldacchino, Tara; Cross, Elizabeth J.; Worden, Keith; Rowson, Jennifer (2016). "Variational Bayesian mixture of experts models and sensitivity analysis for nonlinear dynamical systems". Mechanical Systems and Signal Processing. 66–67: 178–200. Bibcode:2016MSSP...66..178B. doi:10.1016/j.ymssp.2015.05.009.
Rokach, Lior (November 2009). Pattern Classification Using Ensemble Methods. Series in Machine Perception and Artificial Intelligence. Vol. 75. WORLD SCIENTIFIC. p. 142. doi:10.1142/7238. ISBN 978-981-4271-06-6. Retrieved 14 November 2024.
TRESP, V. (2001). "Committee Machines". Handbook of Neural Network Signal Processing. Electrical Engineering & Applied Signal Processing Series. Vol. 5. doi:10.1201/9781420038613.ch5 (inactive 7 June 2025). ISBN 978-0-8493-2359-1.{{cite book}}: CS1 maint: DOI inactive as of June 2025 (link)
Hampshire, J.B.; Waibel, A. (July 1992). "The Meta-Pi network: building distributed knowledge representations for robust multisource pattern recognition" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 14 (7): 751–769. doi:10.1109/34.142911.
Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, Kevin J. Lang (1995). "Phoneme Recognition Using Time-Delay Neural Networks*". In Chauvin, Yves; Rumelhart, David E. (eds.). Backpropagation. Psychology Press. doi:10.4324/9780203763247. ISBN 978-0-203-76324-7.{{cite book}}: CS1 maint: multiple names: authors list (link)
Jacobs, Robert A.; Jordan, Michael I.; Nowlan, Steven J.; Hinton, Geoffrey E. (February 1991). "Adaptive Mixtures of Local Experts". Neural Computation. 3 (1): 79–87. doi:10.1162/neco.1991.3.1.79. ISSN 0899-7667. PMID 31141872. S2CID 572361.
Jordan, Michael I.; Jacobs, Robert A. (March 1994). "Hierarchical Mixtures of Experts and the EM Algorithm". Neural Computation. 6 (2): 181–214. doi:10.1162/neco.1994.6.2.181. hdl:1721.1/7206. ISSN 0899-7667.
Jordan, Michael I.; Xu, Lei (1995-01-01). "Convergence results for the EM approach to mixtures of experts architectures". Neural Networks. 8 (9): 1409–1431. doi:10.1016/0893-6080(95)00014-3. hdl:1721.1/6620. ISSN 0893-6080.
Nguyen, Hien D.; McLachlan, Geoffrey J. (2016-01-01). "Laplace mixture of linear experts". Computational Statistics & Data Analysis. 93: 177–191. doi:10.1016/j.csda.2014.10.016. ISSN 0167-9473.
Chamroukhi, F. (2016-07-01). "Robust mixture of experts modeling using the t distribution". Neural Networks. 79: 20–36. arXiv:1701.07429. doi:10.1016/j.neunet.2016.03.002. ISSN 0893-6080. PMID 27093693. S2CID 3171144.
Chen, K.; Xu, L.; Chi, H. (1999-11-01). "Improved learning algorithms for mixture of experts in multiclass classification". Neural Networks. 12 (9): 1229–1252. doi:10.1016/S0893-6080(99)00043-X. ISSN 0893-6080. PMID 12662629.

dx.doi.org

Jordan, Michael I.; Xu, Lei (1995-01-01). "Convergence results for the EM approach to mixtures of experts architectures". Neural Networks. 8 (9): 1409–1431. doi:10.1016/0893-6080(95)00014-3. hdl:1721.1/6620. ISSN 0893-6080.

facebook.com

ai.facebook.com

"200 languages within a single AI model: A breakthrough in high-quality machine translation". ai.facebook.com. 2022-06-19. Archived from the original on 2023-01-09.

handle.net

hdl.handle.net

Jordan, Michael I.; Jacobs, Robert A. (March 1994). "Hierarchical Mixtures of Experts and the EM Algorithm". Neural Computation. 6 (2): 181–214. doi:10.1162/neco.1994.6.2.181. hdl:1721.1/7206. ISSN 0899-7667.
Jordan, Michael I.; Xu, Lei (1995-01-01). "Convergence results for the EM approach to mixtures of experts architectures". Neural Networks. 8 (9): 1409–1431. doi:10.1016/0893-6080(95)00014-3. hdl:1721.1/6620. ISSN 0893-6080.

harvard.edu

ui.adsabs.harvard.edu

Baldacchino, Tara; Cross, Elizabeth J.; Worden, Keith; Rowson, Jennifer (2016). "Variational Bayesian mixture of experts models and sensitivity analysis for nonlinear dynamical systems". Mechanical Systems and Signal Processing. 66–67: 178–200. Bibcode:2016MSSP...66..178B. doi:10.1016/j.ymssp.2015.05.009.

kit.edu

isl.anthropomatik.kit.edu

Hampshire, J.B.; Waibel, A. (July 1992). "The Meta-Pi network: building distributed knowledge representations for robust multisource pattern recognition" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 14 (7): 751–769. doi:10.1109/34.142911.

mistral.ai

AI, Mistral (2023-12-11). "Mixtral of experts". mistral.ai. Retrieved 2024-02-04.

mit.edu

direct.mit.edu

Jacobs, Robert A.; Jordan, Michael I.; Nowlan, Steven J.; Hinton, Geoffrey E. (February 1991). "Adaptive Mixtures of Local Experts". Neural Computation. 3 (1): 79–87. doi:10.1162/neco.1991.3.1.79. ISSN 0899-7667. PMID 31141872. S2CID 572361.
Jordan, Michael I.; Jacobs, Robert A. (March 1994). "Hierarchical Mixtures of Experts and the EM Algorithm". Neural Computation. 6 (2): 181–214. doi:10.1162/neco.1994.6.2.181. hdl:1721.1/7206. ISSN 0899-7667.

mlr.press

proceedings.mlr.press

Lewis, Mike; Bhosale, Shruti; Dettmers, Tim; Goyal, Naman; Zettlemoyer, Luke (2021-07-01). "BASE Layers: Simplifying Training of Large, Sparse Models". Proceedings of the 38th International Conference on Machine Learning. PMLR: 6265–6274. arXiv:2103.16716.

neurips.cc

proceedings.neurips.cc

Nowlan, Steven; Hinton, Geoffrey E (1990). "Evaluation of Adaptive Mixtures of Competing Experts". Advances in Neural Information Processing Systems. 3. Morgan-Kaufmann.
Jordan, Michael; Jacobs, Robert (1991). "Hierarchies of adaptive experts". Advances in Neural Information Processing Systems. 4. Morgan-Kaufmann.
Xu, Lei; Jordan, Michael; Hinton, Geoffrey E (1994). "An Alternative Model for Mixtures of Experts". Advances in Neural Information Processing Systems. 7. MIT Press.
Collobert, Ronan; Bengio, Samy; Bengio, Yoshua (2001). "A Parallel Mixture of SVMs for Very Large Scale Problems". Advances in Neural Information Processing Systems. 14. MIT Press.
Zhou, Yanqi; Lei, Tao; Liu, Hanxiao; Du, Nan; Huang, Yanping; Zhao, Vincent; Dai, Andrew M.; Chen, Zhifeng; Le, Quoc V.; Laudon, James (2022-12-06). "Mixture-of-Experts with Expert Choice Routing". Advances in Neural Information Processing Systems. 35: 7103–7114. arXiv:2202.09368.
Roller, Stephen; Sukhbaatar, Sainbayar; szlam, arthur; Weston, Jason (2021). "Hash Layers For Large Sparse Models". Advances in Neural Information Processing Systems. 34. Curran Associates, Inc.: 17555–17566.
Riquelme, Carlos; Puigcerver, Joan; Mustafa, Basil; Neumann, Maxim; Jenatton, Rodolphe; Susano Pinto, André; Keysers, Daniel; Houlsby, Neil (2021). "Scaling Vision with Sparse Mixture of Experts". Advances in Neural Information Processing Systems. 34: 8583–8595. arXiv:2106.05974.

nih.gov

pubmed.ncbi.nlm.nih.gov

Jacobs, Robert A.; Jordan, Michael I.; Nowlan, Steven J.; Hinton, Geoffrey E. (February 1991). "Adaptive Mixtures of Local Experts". Neural Computation. 3 (1): 79–87. doi:10.1162/neco.1991.3.1.79. ISSN 0899-7667. PMID 31141872. S2CID 572361.
Chamroukhi, F. (2016-07-01). "Robust mixture of experts modeling using the t distribution". Neural Networks. 79: 20–36. arXiv:1701.07429. doi:10.1016/j.neunet.2016.03.002. ISSN 0893-6080. PMID 27093693. S2CID 3171144.
Chen, K.; Xu, L.; Chi, H. (1999-11-01). "Improved learning algorithms for mixture of experts in multiclass classification". Neural Networks. 12 (9): 1229–1252. doi:10.1016/S0893-6080(99)00043-X. ISSN 0893-6080. PMID 12662629.

nii.ac.jp

cir.nii.ac.jp

TRESP, V. (2001). "Committee Machines". Handbook of Neural Network Signal Processing. Electrical Engineering & Applied Signal Processing Series. Vol. 5. doi:10.1201/9781420038613.ch5 (inactive 7 June 2025). ISBN 978-0-8493-2359-1.{{cite book}}: CS1 maint: DOI inactive as of June 2025 (link)

orenleung.com

"Transformer Deep Dive: Parameter Counting". Transformer Deep Dive: Parameter Counting. Retrieved 2023-10-10.

sciencedirect.com

Nguyen, Hien D.; McLachlan, Geoffrey J. (2016-01-01). "Laplace mixture of linear experts". Computational Statistics & Data Analysis. 93: 177–191. doi:10.1016/j.csda.2014.10.016. ISSN 0167-9473.
Chamroukhi, F. (2016-07-01). "Robust mixture of experts modeling using the t distribution". Neural Networks. 79: 20–36. arXiv:1701.07429. doi:10.1016/j.neunet.2016.03.002. ISSN 0893-6080. PMID 27093693. S2CID 3171144.
Chen, K.; Xu, L.; Chi, H. (1999-11-01). "Improved learning algorithms for mixture of experts in multiclass classification". Neural Networks. 12 (9): 1229–1252. doi:10.1016/S0893-6080(99)00043-X. ISSN 0893-6080. PMID 12662629.

semanticscholar.org

api.semanticscholar.org

Jacobs, Robert A.; Jordan, Michael I.; Nowlan, Steven J.; Hinton, Geoffrey E. (February 1991). "Adaptive Mixtures of Local Experts". Neural Computation. 3 (1): 79–87. doi:10.1162/neco.1991.3.1.79. ISSN 0899-7667. PMID 31141872. S2CID 572361.
Chamroukhi, F. (2016-07-01). "Robust mixture of experts modeling using the t distribution". Neural Networks. 79: 20–36. arXiv:1701.07429. doi:10.1016/j.neunet.2016.03.002. ISSN 0893-6080. PMID 27093693. S2CID 3171144.

taylorfrancis.com

Alexander Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, Kevin J. Lang (1995). "Phoneme Recognition Using Time-Delay Neural Networks*". In Chauvin, Yves; Rumelhart, David E. (eds.). Backpropagation. Psychology Press. doi:10.4324/9780203763247. ISBN 978-0-203-76324-7.{{cite book}}: CS1 maint: multiple names: authors list (link)

web.archive.org

"200 languages within a single AI model: A breakthrough in high-quality machine translation". ai.facebook.com. 2022-06-19. Archived from the original on 2023-01-09.

wired.com

Knight, Will. "Inside the Creation of the World's Most Powerful Open Source AI Model". Wired. ISSN 1059-1028. Retrieved 2024-03-28.

worldcat.org

search.worldcat.org

Jacobs, Robert A.; Jordan, Michael I.; Nowlan, Steven J.; Hinton, Geoffrey E. (February 1991). "Adaptive Mixtures of Local Experts". Neural Computation. 3 (1): 79–87. doi:10.1162/neco.1991.3.1.79. ISSN 0899-7667. PMID 31141872. S2CID 572361.
Jordan, Michael I.; Jacobs, Robert A. (March 1994). "Hierarchical Mixtures of Experts and the EM Algorithm". Neural Computation. 6 (2): 181–214. doi:10.1162/neco.1994.6.2.181. hdl:1721.1/7206. ISSN 0899-7667.
Jordan, Michael I.; Xu, Lei (1995-01-01). "Convergence results for the EM approach to mixtures of experts architectures". Neural Networks. 8 (9): 1409–1431. doi:10.1016/0893-6080(95)00014-3. hdl:1721.1/6620. ISSN 0893-6080.
Nguyen, Hien D.; McLachlan, Geoffrey J. (2016-01-01). "Laplace mixture of linear experts". Computational Statistics & Data Analysis. 93: 177–191. doi:10.1016/j.csda.2014.10.016. ISSN 0167-9473.
Chamroukhi, F. (2016-07-01). "Robust mixture of experts modeling using the t distribution". Neural Networks. 79: 20–36. arXiv:1701.07429. doi:10.1016/j.neunet.2016.03.002. ISSN 0893-6080. PMID 27093693. S2CID 3171144.
Chen, K.; Xu, L.; Chi, H. (1999-11-01). "Improved learning algorithms for mixture of experts in multiclass classification". Neural Networks. 12 (9): 1229–1252. doi:10.1016/S0893-6080(99)00043-X. ISSN 0893-6080. PMID 12662629.
Fedus, William; Zoph, Barret; Shazeer, Noam (2022-01-01). "Switch transformers: scaling to trillion parameter models with simple and efficient sparsity". The Journal of Machine Learning Research. 23 (1): 5232–5270. arXiv:2101.03961. ISSN 1532-4435.
Knight, Will. "Inside the Creation of the World's Most Powerful Open Source AI Model". Wired. ISSN 1059-1028. Retrieved 2024-03-28.