# word2vec-graph

**Repository Path**: everybit/word2vec-graph

## Basic Information

- **Project Name**: word2vec-graph
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-20
- **Last Updated**: 2026-04-20

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# word2vec graph

This visualization builds graphs of nearest neighbors from high-dimensional
word2vec embeddings.

![demo 1](https://i.imgur.com/dn0Egqo.gif)
![demo 2](https://i.imgur.com/Xtv1Haq.gif)
![demo words](https://i.imgur.com/zJKZEve.gif)

## Available Graphs

The dataset used for this visualization comes from [GloVe](https://nlp.stanford.edu/projects/glove/),
and has 6B tokens, 400K vocabulary, 300-dimensional vectors.

* [Distance < 0.9](https://anvaka.github.io/pm/#/galaxy/word2vec-wiki?cx=-4431&cy=3921&cz=-1124&lx=0.3701&ly=0.4218&lz=-0.0634&lw=0.8253&ml=300&s=1.75&l=1&v=d50_clean_small) -
In this visualization edge between words is formed when distance between corresponding
words' vectors is smaller than 0.9. All words with non-word characters and digits are removed.
The final visualization is sparse, yet meaningful.

* [Distance < 1.0](https://anvaka.github.io/pm/#/galaxy/word2vec-wiki?cx=88&cy=-10541&cz=1431&lx=0.1555&ly=0.6672&lz=-0.1453&lw=0.7139&ml=300&s=1.75&l=1&v=d50_clean) -
Similar to above, yet distance requirement is relaxed. Words with distance smaller than 1.0 are
given edges in the graph. All words with non-word characters and digits are removed. 
The visualization becomes more populated as more words are added. Still meaningful.

* [Raw; Distance < 0.9](https://anvaka.github.io/pm/#/galaxy/word2vec-wiki?cx=-7912&cy=-941&cz=-5655&lx=-0.3936&ly=-0.6815&lz=0.0636&lw=0.6137&ml=150&s=1.75&l=1&v=d50) (6.9 MB) -
Unlike visualizations above, this one was not filtered and includes all words from the
dataset. Majority of the clusters formed here have numerical nature. I didn't find this one
particularly interesting, yet I'm including it to show how word2vec finds numerical clusters.

### Common Crawl

I have also made a graph from Common Crawl dataset (840B tokens, 2.2M vocab, 300d vectors).
Words with non-word characters and numbers were removed.

Many clusters that remained represent words with spelling errors:

![spelling error](https://i.imgur.com/Ftj2Ce7.gif)

I had hard time deciphering meaning of many clusters here. Wikipedia embeddings were much more
meaningful. Nevertheless I want to keep this visualization to let you explore it as well:

* [Common Crawl visualization](https://anvaka.github.io/pm/#/galaxy/word2vec-crawl?cx=-2411&cy=6376&cz=-7215&lx=0.0797&ly=-0.8449&lz=-0.4924&lw=0.1930&ml=150&s=1.75&l=1&v=d300) - 28.4MB

# Intro and Details

[word2vec](https://en.wikipedia.org/wiki/Word2vec) is a family of algorithms
that allow you to find embeddings of words into high-dimensional vector spaces.

```
// For example
cat => [0.1, 0.0, 0.9]
dog => [0.9, 0.0, 0.0]
cow => [0.6, 1.0, 0.5]
```

Vectors with shorter distances between them usually share common contexts in the
corpus. This allows us to find distances between words:

```
|cat - dog| = 1.20
|cat - cow| = 1.48


"cat" is closer to "dog" than it is to the "cow".
```

## Building a graph

We can simply iterate over every single word in the dictionary and add them into
a graph. But what would be an edge in this graph?

We draw an edge between two words if distance between embedding vectors is
shorter than a given threshold.

Once the graph is constructed, I'm using a method described here: [Your own graphs](https://github.com/anvaka/pm#your-own-graphs)
to construct visualizations.

*Note* From practical standpoint, searching all nearest neighbors in high dimensional
space is a very CPU intensive task. Building an index of vectors help. I didn't
know a good library for this task, so I [consulted Twitter](https://twitter.com/anvaka/status/971812468950487040).
Amazing recommendations by [@gumgumeo](https://twitter.com/gumgumeo) and [@AMZoellner](https://twitter.com/AMZoellner)
led to [spotify/annoy](https://github.com/spotify/annoy).


# Data

I'm using pre-trained word2vec models from the [GloVe](https://nlp.stanford.edu/projects/glove/)
project.


## Preprocessing

My original attempts to render `word2vec` graphs resulted in overwhelming presence
of numerical clusters. `word2vec` models really loved to put numerals together (and
I think it makes sense, intuitively). Alas - that made visualizations not very
interesting to explore. As I hoped from one cluster to another, just to find out
that one was dedicated to numbers `2017 - 2300`, while the other to `0.501 .. 0.403`

In Common Crawl word2vec encoding, I removed all words that had non-word characters
or numbers. In my opinion, this made visualization more interesting to explore, yet
still, I don't recognize a lot of clusters.

# Local setup

## Prerequisites

Make sure [`node.js`](https://nodejs.org/en/) is installed.

```
git clone https://github.com/anvaka/word2vec-graph.git
cd word2vec-graph
npm install
```

Install [spotify/annoy](https://github.com/spotify/annoy)

## Building graph file

1. Download the vectors, and extract them into graph-data
2. Run `save_text_edges.py -h` to see how to point it to th newly extracted.
vectors (also see file content for more details)
3. run `python save_text_edges.py` - depending on input vector file size
this make take a while. The output file `edges.txt` will be saved in the 
`graph-data` folder.
4. run `node edges2graph.js graph-data/edges.txt` - this will save graph in 
binary format into `graph-data` folder (graph-data/labels.json, graph-data/links.bin)
5. Now it's time to run layout. There are two options. One is slow, the other one is
much faster especially on the multi-threaded CPU.

### Running layout with node

You can use

```
node --max-old-space-size=12000 layout.js
```

To generate layout. This will take a while to converge (layout stops after 500 iterations).
Also note, that we need to increase maximum allowed RAM for node process
(`max-old-space-size` argument). I'm setting it to ~12GB - it was enough for my case

### Running layout with C++

Much faster version is to compile `layout++` module. You will need to manually
download and compile [`anvaka/ngraph.native`](https://github.com/anvaka/ngraph.native) package.

On ubuntu it was very straightforward: Just run `./compile-demo` and `layout++`
file will be created in the working folder. You can copy that file into this repository,
and run:

```
./layout++ ./graph-data/links.bin
```

The layout will converge much faster, but you will need to manually kill it (Ctrl + C)
after 500-700 iterations.

You will find many `.bin` files. Just pick the one with the highest number,
and copy it as `positions.bin` into `graph-data/` folder. E.g.:

```
cp 500.bin ./graph-data/positions.bin
```

That's it. Now you have both graph, and positions ready. You can use instructions from
[Your own graphs](https://github.com/anvaka/pm#your-own-graphs) to visualize your
new graph with `https://anvaka.github.io/pm/#/`