# FastBertTokenizer **Repository Path**: a602447933/FastBertTokenizer ## Basic Information - **Project Name**: FastBertTokenizer - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-10-16 - **Last Updated**: 2024-10-16 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

# FastBertTokenizer [![NuGet version (FastBertTokenizer)](https://img.shields.io/nuget/v/FastBertTokenizer.svg?style=flat)](https://www.nuget.org/packages/FastBertTokenizer/) ![.NET Build](https://github.com/georg-jung/FastBertTokenizer/actions/workflows/ci.yml/badge.svg) [![codecov](https://codecov.io/github/georg-jung/FastBertTokenizer/graph/badge.svg?token=PEINHYEBGH)](https://codecov.io/github/georg-jung/FastBertTokenizer) A fast and memory-efficient library for WordPiece tokenization as it is used by BERT. Tokenization correctness and speed are automatically evaluated in extensive unit tests and benchmarks. Native AOT compatible and support for `netstandard2.0`. ## Goals * Enabling you to run your AI workloads on .NET in production. * **Correctness** - Results that are equivalent to [HuggingFace Transformers' `AutoTokenizer`'s](https://huggingface.co/docs/transformers/v4.33.0/en/model_doc/auto#transformers.AutoTokenizer) in all practical cases. * **Speed** - Tokenization should be as fast as reasonably possible. * **Ease of use** - The API should be easy to understand and use. ## Getting Started ```bash dotnet new console dotnet add package FastBertTokenizer ``` ```csharp using FastBertTokenizer; var tok = new BertTokenizer(); await tok.LoadFromHuggingFaceAsync("bert-base-uncased"); var (inputIds, attentionMask, tokenTypeIds) = tok.Encode("Lorem ipsum dolor sit amet."); Console.WriteLine(string.Join(", ", inputIds.ToArray())); var decoded = tok.Decode(inputIds.Span); Console.WriteLine(decoded); // Output: // 101, 19544, 2213, 12997, 17421, 2079, 10626, 4133, 2572, 3388, 1012, 102 // [CLS] lorem ipsum dolor sit amet. [SEP] ``` [*example project*](src/examples/QuickStart/) ## Comparison to [BERTTokenizers](https://github.com/NMZivkovic/BertTokenizers) * about 1 order of magnitude faster * allocates more than 1 order of magnitude less memory * [better whitespace handling](https://github.com/NMZivkovic/BertTokenizers/issues/24) * [handles unknown characters correctly](https://github.com/NMZivkovic/BertTokenizers/issues/26) * [does not throw if text is longer than maximum sequence length](https://github.com/NMZivkovic/BertTokenizers/issues/18) * handles unicode control chars * handles other alphabets such as greek and right-to-left languages Note that while [BERTTokenizers handles token type incorrectly](https://github.com/NMZivkovic/BertTokenizers/issues/18), it does support input of two pieces of text that are tokenized with a separator in between. *FastBertTokenizer* currently does not support this. ## Speed / Benchmarks > tl;dr: FastBertTokenizer can encode 1 GB of text in around 2 s on a typical notebook CPU from 2020. All benchmarks were performed on a typical end user notebook, a ThinkPad T14s Gen 1: ```txt BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3527/23H2/2023Update/SunValley3) AMD Ryzen 7 PRO 4750U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores .NET SDK 8.0.204 ``` Similar results can also be observed using [GitHub Actions](https://github.com/georg-jung/FastBertTokenizer/actions/workflows/benchmark.yml). Note that using shared CI runners for benchmarking has drawbacks and can lead to varying results though. ### on NET 6.0 vs. on NET 8.0 * `.NET 6.0.29 (6.0.2924.17105), X64 RyuJIT AVX2` vs `.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2` * Workload: Encode up to 512 tokens from each of 15,000 articles from simple english wikipedia. * Results: Total tokens produced: 3,657,145; on .NET 8: ~11m tokens/s single threaded, 73m tokens/s multi threaded. | Method | Runtime | Mean | Error | StdDev | Ratio | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio | |----------------------------- |--------- |----------:|---------:|---------:|------:|-----------:|-----------:|---------:|----------:|------------:| | Singlethreaded | .NET 6.0 | 450.39 ms | 7.340 ms | 6.866 ms | 1.00 | - | - | - | 2 MB | 1.00 | | MultithreadedMemReuseBatched | .NET 6.0 | 72.46 ms | 1.337 ms | 1.251 ms | 0.16 | 750.0000 | 250.0000 | 250.0000 | 12.75 MB | 6.39 | | | | | | | | | | | | | | Singlethreaded | .NET 8.0 | 332.51 ms | 6.574 ms | 7.826 ms | 1.00 | - | - | - | 1.99 MB | 1.00 | | MultithreadedMemReuseBatched | .NET 8.0 | 50.83 ms | 0.999 ms | 1.995 ms | 0.15 | 500.0000 | - | - | 12.75 MB | 6.40 | ### vs. [SharpToken](https://github.com/dmitry-brazhenko/SharpToken) * `SharpToken v2.0.2` * `.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2` * Workload: Fully encode 15,000 articles from simple english wikipedia. Total tokens produced by FastBertTokenizer: 5,807,949 (~9.4m tokens/s single threaded). This isn't an apples to apples comparison as BPE (what SharpToken does) and WordPiece encoding (what FastBertTokenizer does) are different tasks/algorithms. Both were applied to exactly the same texts/corpus though. | Method | Mean | Error | StdDev | Gen0 | Gen1 | Allocated | |------------------------------ |-----------:|---------:|---------:|----------:|----------:|----------:| | SharpTokenFullArticles | 1,551.9 ms | 25.82 ms | 24.15 ms | 5000.0000 | 2000.0000 | 32.56 MB | | FastBertTokenizerFullArticles | 620.3 ms | 7.00 ms | 6.21 ms | - | - | 2.26 MB | ### vs. HuggingFace [tokenizers](https://github.com/huggingface/tokenizers) (Rust) `tokenizers v0.19.1` I'm not really experienced in benchmarking rust code, but my attempts using criterion.rs (see `src/HuggingfaceTokenizer/BenchRust`) suggest that it takes tokenizers around * batched/multi threaded: ~2 s (~2.9m tokens/s) * single threaded: ~10 s (~0.6m tokens/s) to produce 5,807,947 tokens from the same 15k simple english wikipedia articles. Contrary to what one might expect, this does mean that FastBertTokenizer, beeing a managed implementation, outperforms tokenizers. It should be noted though that tokenizers has a much more complete feature set while FastBertTokenizer is specifically optimized for WordPiece/Bert encoding. The tokenizers repo states `Takes less than 20 seconds to tokenize a GB of text on a server's CPU.` As 26 MB of text take ~2s on my notebook CPU, 1 GB would take roughly 80 s. I think it makes sense that "a server's CPU" might be 4x as fast as my notebook's CPU and thus think my results seem plausible. It is however also possible that I unintentionally handicapped tokenizers somehow. Please let me know if you think so! ### vs. [BERTTokenizers](https://github.com/NMZivkovic/BertTokenizers) * `BERTTokenizers v1.2.0` * `.NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX2` * Workload: Prefixes of the contents of 15k simple english wikipedia articles, preprocessed to make them encodable by BERTTokenizers. | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------------------------------- |-----------:|---------:|---------:|------------:|-----------:|----------:|-----------:| | NMZivkovic_BertTokenizers | 2,576.0 ms | 15.49 ms | 13.73 ms | 968000.0000 | 40000.0000 | 1000.0000 | 3430.51 MB | | FastBertTokenizer_SameDataAsBertTokenizers | 229.8 ms | 4.55 ms | 6.23 ms | - | - | - | 1.03 MB | ## Logo Created by combining in .NET brand color with .