Draft:LaBSE

LaBSE
DeveloperGoogle Research
Initial releaseJuly 15, 2020 (2020-07-15)
Written in
Operating systemCross-platform
TypeOpen-source machine learning / Natural language processing
LicenseApache License 2.0
Repositorytfhub.dev/google/LaBSE

LaBSE (Language-agnostic BERT Sentence Embedding) is an open-source sentence embedding model developed by Google Research and published in 2020.[1]

It extends BERT language model with a multilingual dual-encoder architecture trained on parallel translation data, enabling semantically comparable sentence vectors across more than one hundred languages.[2]

LaBSE is distributed via TensorFlow Hub and is widely used for cross-lingual information retrieval, semantic search, and machine translation evaluation.[3][4]

Overview

LaBSE was introduced by Google Research as part of its multilingual representation learning program. The model maps text from diverse languages into a shared 768-dimensional vector space, where semantically equivalent sentences are located close to each other.[5][6]

Unlike traditional translation-based systems, LaBSE relies on a single shared transformer encoder for all languages, allowing direct comparison between sentences without translation.[1]

Architecture

The system follows the structure of BERT-base (12 transformer layers, 12 attention heads) but employs a dual-encoder training setup similar to the Universal Sentence Encoder.[7][8]

Each sentence is tokenized using a joint multilingual WordPiece vocabulary covering 109 languages. Mean pooling across the final hidden states yields a fixed-size sentence representation. Training uses a translation ranking loss that maximizes cosine similarity between parallel sentences and minimizes it for unrelated pairs.[9][10]

Training

LaBSE was trained on large multilingual corpora combining public datasets such as OPUS with internal translation data from Google.[11][12]

Optimization employed Adam with in-batch negatives and temperature-scaled cross-entropy. According to the authors, LaBSE achieved state-of-the-art results on cross-lingual retrieval benchmarks such as BUCC and Tatoeba at the time of its release.[1]

Applications

The model is publicly available on TensorFlow Hub and integrated into popular frameworks such as Hugging Face Transformers and Spark NLP. Typical applications include:

  • Cross-lingual document and semantic search.
  • Automatic evaluation of machine translation quality.
  • Multilingual clustering, deduplication, and classification.
  • Serving as a universal encoder for zero-shot learning tasks.

Reception and impact

LaBSE has been cited extensively in academic literature on cross-lingual representation learning.[13] Independent evaluations report that it remains competitive with later multilingual embedding models such as LASER2 and multilingual Sentence-BERT.[14]

Its introduction marked a milestone in multilingual semantic similarity research and influenced subsequent releases of multilingual encoders in the open-source ecosystem.[15][16][17]

See also

References

  1. ^ a b c Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].
  2. ^ Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the ACL. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
  3. ^ "tfhub.dev/google/LaBSE". TensorFlow Hub. Retrieved 2025-10-10.
  4. ^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL). 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
  5. ^ ""Language-Agnostic BERT Sentence Embedding"". Google Research Blog. 2020-08-18. Retrieved 2025-10-10.
  6. ^ Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
  7. ^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
  8. ^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
  9. ^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
  10. ^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2022). "Language-agnostic BERT Sentence Embedding". Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1: 878–891. doi:10.18653/v1/2022.acl-long.62.
  11. ^ "Samanantar: The Largest Publicly Available Parallel Corpus". MIT Press. 2022. doi:10.1162/tacl_a_00452. Retrieved 2025-10-10.
  12. ^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
  13. ^ Reimers, Nils; Gurevych, Iryna (2020). "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation". Transactions of the Association for Computational Linguistics. 8: 121–135. doi:10.1162/tacl_a_00343.
  14. ^ "Notes on LaBSE". Ceshine AI Blog. 2021-02-01.
  15. ^ Feng, Fangxia; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (2024). "EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 32: 2841–2856. arXiv:2007.01852. Bibcode:2024ITASL..32.2841M. doi:10.1109/TASLP.2024.3402064.
  16. ^ Mao, Zhuoyuan; Chu, Chenhui; Kurohashi, Sadao (2022). "Efficient and Effective Massively Multilingual Sentence Embedding (EMS)". arXiv:2205.15744 [cs.CL].
  17. ^ "Comparative Study of Multilingual Sentence Embedding Models for Semantic Search". Hugging Face Blog. 2023-03-15. Retrieved 2025-10-10.

Content Disclaimer

Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.

  1. The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
  2. There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
  3. It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
  4. Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
  5. Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.