# twitter-cache-trace **Repository Path**: brucewanght/twitter-cache-trace ## Basic Information - **Project Name**: twitter-cache-trace - **Description**: This repository describes the traces from Twitter's in-memory caching (Twemcache/Pelikan) clusters. - **Primary Language**: Unknown - **License**: CC-BY-4.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-09-13 - **Last Updated**: 2021-09-13 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## Anonymized Cache Request Traces from Twitter Production ### Trace Overview This repository describes the traces from Twitter's in-memory caching ([Twemcache](https://github.com/twitter/twemcache)/[Pelikan](https://github.com/twitter/pelikan)) clusters. The current traces were collected from 54 clusters in Mar 2020. The traces are one-week-long. More details are described in the following paper and blog. * [Juncheng Yang, Yao Yue, Rashmi Vinayak, A large scale analysis of hundreds of in-memory cache clusters at Twitter. _14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)_, 2020](https://www.usenix.org/conference/osdi20/presentation/yang). --- ### Trace Format The traces are compressed with [zstd](https://github.com/facebook/zstd), to decompress run `zstd -d /path/file`. The decompressed traces are plain text structured as comma-separated columns. Each row represents one request in the following format. * `timestamp`: the time when the cache receives the request, in sec * `anonymized key`: the original key with anonymization * `key size`: the size of key in bytes * `value size`: the size of value in bytes * `client id`: the anonymized clients (frontend service) who sends the request * `operation`: one of get/gets/set/add/replace/cas/append/prepend/delete/incr/decr * `TTL`: the time-to-live (TTL) of the object set by the client, it is 0 when the request is not a write request. Note that during key anonymization, we preserve the namespaces, for example, if the anonymized key is `nz:u:eeW511W3dcH3de3d15ec`, the first two fields `nz` and `u` are namespaces, note that the namespaces are not necessarily delimited by `:`, different workloads use different delimiters with different number of namespaces. A sample of the traces are attached under samples. --- ### Trace Download The full traces are large (2.8 TB in compressed form, 14 TB uncompressed), and can be downloaded from the following places. #### Carnegie Mellon University PDL cluster https://ftp.pdl.cmu.edu/pub/datasets/twemcacheWorkload/open_source #### SNIA http://iotta.snia.org/tracetypes/17 #### Storj see [storj](storj) for how to access (Good for worldwide access, especially Asia and Europe, but not available after Dec 2020) #### Baidu pan https://pan.baidu.com/s/1Jm2nAW-UhsjXU6JYoA07LA access code: wcws (Good for Asia access, but UI only has Chinese) These traces are splitted into smaller files of 1000000000 lines (smaller for SNIA) each and compressed with zstd, so a file with name clusterN.0.zst means this file contains the first 1000000000 requests of cluster N. Feel free to contact us if you have problem downloading the traces. --- ### Choice of traces for different evaluations For different evaluation purposes, we recommend the following clusters/workloads * **miss ratio related (admission, eviction)**: cluster52, cluster17 (low miss ratio), cluster18 (low miss ratio), cluster24, cluster44, cluster45, cluster29. * **write-heavy workloads**: cluster12, cluster15, cluster31, cluster37. * **TTL-related**: mix of small and large TTLs: cluster 52, cluster22, cluster25, cluster11; small TTLs only: cluster18, cluster19, cluster6, cluster7. *others?*: feel free to contact us if you are looking for a trace for specific purpose. --- ### More information about each workload is included under `stat/` We release a computed statistics of each cluster workload under `stat/`, the latest is [here](stat/2020Mar.md). This table includes the following fields, each field is the mean value of the metric either from production or from the traces. The fields include `production miss ratio`, `workload category` (1: storage, 2: computation, 3: transient item), `key size`, `value size`, `request rate`, `mean object frequency`, `one-hit-wonder ratio (%)`, `compulsory miss ratio (%)`, `common TTLs`, `working set size`, `operations`, `Zipf alpha`. --- ### Misc * Please join our [discussion channel](http://groups.google.com/group/cache-trace) for questions and updates. * We provide a **[trace bibliography](bibliography.bib)** of papers that have used and/or analyzed the traces, and encourage anybody who publishes one to add it to the bibliography by creating an issue or pull request on GitHub. --- ### Acknowledgement We thank Carnegie Mellon University PDL, SNIA and Storj for hosting the traces. ### License ![Creative Commons CC-BY license](https://i.creativecommons.org/l/by/4.0/88x31.png) The data and trace documentation are made available under the [CC-BY](https://creativecommons.org/licenses/by/4.0/) license. By downloading it or using them, you agree to the terms of this license.