# llama3.java **Repository Path**: linghushaoxia/llama3.java ## Basic Information - **Project Name**: llama3.java - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-06-07 - **Last Updated**: 2024-06-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Llama3.java Practical [Llama 3](https://github.com/meta-llama/llama3) inference implemented in a single Java file.

This project is the successor of [llama2.java](https://github.com/mukel/llama2.java) based on [llama2.c](https://github.com/karpathy/llama2.c) by [Andrej Karpathy](https://twitter.com/karpathy) and his [excellent educational videos](https://www.youtube.com/c/AndrejKarpathy). Besides the educational value, this project will be used to test and tune compiler optimizations and features on the JVM, particularly for the [Graal compiler](https://www.graalvm.org/latest/reference-manual/java/compiler). ## Features - Single file, no dependencies - [GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) parser - Llama 3 tokenizer based on [minbpe](https://github.com/karpathy/minbpe) - Llama 3 inference with Grouped-Query Attention - Support for Q8_0 and Q4_0 quantizations - Fast matrix-vector multiplication routines for quantized tensors using Java's [Vector API](https://openjdk.org/jeps/469) - Simple CLI with `--chat` and `--instruct` modes. Here's the interactive `--chat` mode in action:

## Setup Download pure `Q4_0` and (optionally) `Q8_0` quantized .gguf files from: https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF The `~4.3GB` pure `Q4_0` quantized model is recommended, please be gentle with [huggingface.co](https://huggingface.co) servers: ``` curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf # Optionally download the Q8_0 quantized model ~8GB # curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf ``` #### Optional: quantize to pure `Q4_0` manually In the wild, `Q8_0` quantizations are fine, but `Q4_0` quantizations are rarely pure e.g. the `output.weights` tensor is quantized with `Q6_K`, instead of `Q4_0`. A **pure** `Q4_0` quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source with the `quantize` utility from [llama.cpp](https://github.com/ggerganov/llama.cpp) as follows: ```bash ./quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0 ``` ## Build and run Java 21+ is required, in particular the [`MemorySegment` mmap-ing feature](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,long,long,java.lang.foreign.Arena)). [`jbang`](https://www.jbang.dev/) is a perfect fit for this use case, just: ``` jbang Llama3.java --help ``` Or execute directly, also via [`jbang`](https://www.jbang.dev/): ```bash chmod +x Llama3.java ./Llama3.java --help ``` #### Optional: Makefile + manually build and run A simple [Makefile](./Makefile) is provided, run `make` to produce `llama3.jar` or manually: ```bash javac -g --enable-preview -source 21 --add-modules jdk.incubator.vector -d target/classes Llama3.java jar -cvfe llama3.jar com.llama4j.Llama3 LICENSE -C target/classes . ``` Run the resulting `llama3.jar` as follows: ```bash java --enable-preview --add-modules jdk.incubator.vector -jar llama3.jar --help ``` ## Performance **Important Note** On GraalVM, please note that the Graal compiler doesn't support the Vector API yet, run with `-Dllama.VectorAPI=false`, but expect sub-optimal performance. Vanilla OpenJDK 21+ is recommended for now, which supports the Vector API. ### llama.cpp Vanilla `llama.cpp` built with `make -j 20`. ```bash ./main --version version: 2879 (4f026363) built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu ``` Executed as follows: ```bash ./main -m ../Meta-Llama-3-8B-Instruct-Q4_0.gguf \ -n 512 \ -s 42 \ -p "<|start_of_header_id|>user<|end_of_header_id|>Why is the sky blue?<|eot_id|><|start_of_header_id|>assistant<|end_of_header_id|>\n\n" \ --interactive-specials ``` Collected the **"eval time"** metric in tokens\s. ### Llama3.java Running on OpenJDK 21.0.2. ```bash jbang Llama3.java \ --model ./Meta-Llama-3-8B-Instruct-Q4_0.gguf \ --max-tokens 512 \ --seed 42 \ --stream false \ --prompt "Why is the sky blue?" ``` ### Results #### Notebook Intel 13900H 6pC+8eC/20T 64GB (5200) Linux 6.6.26 | Model | tokens/s | Implementation | |----------------------------------|----------|------------------| | Llama-3-8B-Instruct-Q4_0.gguf | 7.53 | llama.cpp | | Llama-3-8B-Instruct-Q4_0.gguf | 6.95 | llama3.java | | Llama-3-8B-Instruct-Q8_0.gguf | 5.16 | llama.cpp | | Llama-3-8B-Instruct-Q8_0.gguf | 4.02 | llama3.java | #### Workstation AMD 3950X 16C/32T 64GB (3200) Linux 6.6.25 ****Notes** *Running on a single CCD e.g. `taskset -c 0-15 jbang Llama3.java ...` since inference is constrained by memory bandwidth.* | Model | tokens/s | Implementation | |----------------------------------|----------|------------------| | Llama-3-8B-Instruct-Q4_0.gguf | 9.26 | llama.cpp | | Llama-3-8B-Instruct-Q4_0.gguf | 8.03 | llama3.java | | Llama-3-8B-Instruct-Q8_0.gguf | 5.79 | llama.cpp | | Llama-3-8B-Instruct-Q8_0.gguf | 4.92 | llama3.java | ## License MIT