DOKUMEN123.COM

LaBSE
LaBSE
Developer	Google Research
Initial release	July 15, 2020
Written in	Python; TensorFlow;
Operating system	Cross-platform
Type	Open-source machine learning / Natural language processing
License	Apache License 2.0
Repository	tfhub.dev/google/LaBSE

LaBSE (Language-agnostic BERT Sentence Embedding) is an open-source sentence embedding model developed by Google Research and published in 2020.^[1]

It extends BERT language model with a multilingual dual-encoder architecture trained on parallel translation data, enabling semantically comparable sentence vectors across more than one hundred languages.^[2]

LaBSE is distributed via TensorFlow Hub and is widely used for cross-lingual information retrieval, semantic search, and machine translation evaluation.^[3]^[4]

Overview

LaBSE was introduced by Google Research as part of its multilingual representation learning program. The model maps text from diverse languages into a shared 768-dimensional vector space, where semantically equivalent sentences are located close to each other.^[5]^[6]

Unlike traditional translation-based systems, LaBSE relies on a single shared transformer encoder for all languages, allowing direct comparison between sentences without translation.^[1]

Architecture

The system follows the structure of BERT-base (12 transformer layers, 12 attention heads) but employs a dual-encoder training setup similar to the Universal Sentence Encoder.^[7]^[8]

Each sentence is tokenized using a joint multilingual WordPiece vocabulary covering 109 languages. Mean pooling across the final hidden states yields a fixed-size sentence representation. Training uses a translation ranking loss that maximizes cosine similarity between parallel sentences and minimizes it for unrelated pairs.^[9]^[10]

Training

LaBSE was trained on large multilingual corpora combining public datasets such as OPUS with internal translation data from Google.^[11]^[12]

Optimization employed Adam with in-batch negatives and temperature-scaled cross-entropy. According to the authors, LaBSE achieved state-of-the-art results on cross-lingual retrieval benchmarks such as BUCC and Tatoeba at the time of its release.^[1]

Applications

The model is publicly available on TensorFlow Hub and integrated into popular frameworks such as Hugging Face Transformers and Spark NLP. Typical applications include:

Cross-lingual document and semantic search.
Automatic evaluation of machine translation quality.
Multilingual clustering, deduplication, and classification.
Serving as a universal encoder for zero-shot learning tasks.

Reception and impact

LaBSE has been cited extensively in academic literature on cross-lingual representation learning.^[13] Independent evaluations report that it remains competitive with later multilingual embedding models such as LASER2 and multilingual Sentence-BERT.^[14]

Its introduction marked a milestone in multilingual semantic similarity research and influenced subsequent releases of multilingual encoders in the open-source ecosystem.^[15]^[16]^[17]

References

^ ^a ^b ^c Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
^ Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the ACL. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
^ "tfhub.dev/google/LaBSE". TensorFlow Hub. Retrieved 2025-10-10.
^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL). 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
^ ""Language-Agnostic BERT Sentence Embedding"". Google Research Blog. 2020-08-18. Retrieved 2025-10-10.
^ Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
^ "Samanantar: The Largest Publicly Available Parallel Corpus". MIT Press. 2022. doi:10.1162/tacl_a_00452. Retrieved 2025-10-10.
^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
^ Reimers, Nils; Gurevych, Iryna (2020). "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation". Transactions of the Association for Computational Linguistics. 8: 121–135. doi:10.1162/tacl_a_00343.
^ "Notes on LaBSE". Ceshine AI Blog. 2021-02-01.
^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
^ Mao, Zhuoyuan; Chu, Chenhui; Kurohashi, Sadao (2022). "Efficient and Effective Massively Multilingual Sentence Embedding (EMS)". arXiv:2205.15744 [cs.CL].
^ "Comparative Study of Multilingual Sentence Embedding Models for Semantic Search". Hugging Face Blog. 2023-03-15. Retrieved 2025-10-10.

External links

LaBSE repository
Language-Agnostic BERT Sentence Embedding (by Yinfei Yang and Fangxiaoyu Feng, Software Engineers, Google Research).
TensorFlow Hub – LaBSE

[Feng2020-1] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].

[2] Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the ACL. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.

[3] "tfhub.dev/google/LaBSE". TensorFlow Hub. Retrieved 2025-10-10.

[4] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL). 1: 878–891. doi:10.18653/v1/2022.acl-long.62.

[5] ""Language-Agnostic BERT Sentence Embedding"". Google Research Blog. 2020-08-18. Retrieved 2025-10-10.

[6] Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.

[7] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.

[8] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.

[9] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.

[10] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.

[11] "Samanantar: The Largest Publicly Available Parallel Corpus". MIT Press. 2022. doi:10.1162/tacl_a_00452. Retrieved 2025-10-10.

[12] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.

[13] Reimers, Nils; Gurevych, Iryna (2020). "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation". Transactions of the Association for Computational Linguistics. 8: 121–135. doi:10.1162/tacl_a_00343.

[14] "Notes on LaBSE". Ceshine AI Blog. 2021-02-01.

[15] Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.

[16] Mao, Zhuoyuan; Chu, Chenhui; Kurohashi, Sadao (2022). "Efficient and Effective Massively Multilingual Sentence Embedding (EMS)". arXiv:2205.15744 [cs.CL].

[17] "Comparative Study of Multilingual Sentence Embedding Models for Semantic Search". Hugging Face Blog. 2023-03-15. Retrieved 2025-10-10.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]