# active-learning

**Repository Path**: cxxsheng/active-learning

## Basic Information

- **Project Name**: active-learning
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-06-26
- **Last Updated**: 2024-06-26

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Continuous Learning for Android Malware Detection (USENIX Security 2023)

Yizheng Chen, Zhoujie Ding, and David Wagner

Paper: https://arxiv.org/abs/2302.04332

## Datasets

Download [this](https://drive.google.com/file/d/1O0upEcTolGyyvasCPkZFY86FNclk29XO/view?usp=drive_link) from Google Drive. The zipped file contains DREBIN features of the APIGraph dataset and AndroZoo dataset we used in the paper.

Extract the downloaded file to `data/`, such that the datasets are under `data/gen_apigraph_drebin` and `data/gen_androzoo_drebin`.

* We collected `data/gen_apigraph_drebin` by downloading the sample hashes released by the APIGraph paper. The samples are from 2012 to 2018.
* We collected `data/gen_androzoo_drebin` by downloading apps from AndroZoo. The samples are from 2019 to 2021.

## Example Active Learning Run

The following example trains an `enc-mlp` model using `hi-dist-xent` loss, i.e., our Hierarchical Contrastive Classifier, and it runs active learning with 200 samples / month budget using our Psuedo Loss Sample Selector.

```
#! /bin/bash

SEQ=088
LR=0.003
OPT=sgd
SCH=step
DECAY=0.95
E=250
WLR=0.00015
WE=100
DATA=gen_apigraph_drebin
TRAIN_START=2012-01
TRAIN_END=2012-12
TEST_START=2013-01
TEST_END=2018-12
RESULT_DIR=results_ours
AL_OPT=adam

CNT=200

modeldim="512-384-256-128"
S='half'
B=1024
LOSS='hi-dist-xent'
TS=$(date "+%m.%d-%H.%M.%S")

nohup python -u relabel.py	                                \
            --data ${DATA}                                  \
            --benign_zero                                   \
            --mdate 20230501                                \
            --train_start ${TRAIN_START}                    \
            --train_end ${TRAIN_END}                        \
            --test_start ${TEST_START}                      \
            --test_end ${TEST_END}                          \
            --encoder simple-enc-mlp                        \
            --classifier simple-enc-mlp                     \
            --loss_func ${LOSS}                             \
            --enc-hidden ${modeldim}                        \
            --mlp-hidden 100-100                            \
            --mlp-dropout 0.2                               \
            --sampler ${S}                                  \
            --bsize ${B}                                    \
            --optimizer ${OPT}                              \
            --scheduler ${SCH}                              \
            --learning_rate ${LR}                           \
            --lr_decay_rate ${DECAY}                        \
            --lr_decay_epochs "10,500,10"                   \
            --epochs ${E}                                   \
            --encoder-retrain                               \
            --al_optimizer ${AL_OPT}                        \
            --warm_learning_rate ${WLR}                     \
            --al_epochs ${WE}                               \
            --xent-lambda 100                               \
            --display-interval 180                          \
            --al                                            \
            --count ${CNT}                                  \
            --local_pseudo_loss                             \
            --reduce "none"                                 \
            --sample_reduce 'mean'                          \
            --result experiments/020_revision/${RESULT_DIR}/gen_apigraph_cnt${CNT}_${SEQ}_warm_lr${LR}_${OPT}_${SCH}_${DECAY}_e${E}_${AL_OPT}_wlr${WLR}_we${WE}_test_${TEST_START}_${TEST_END}_cnt${CNT}.csv \
            --log_path experiments/020_revision/${RESULT_DIR}/gen_apigraph_cnt${CNT}_${SEQ}_warm_lr${LR}_${OPT}_${SCH}_${DECAY}_e${E}_${AL_OPT}_wlr${WLR}_we${WE}_test_${TEST_START}_${TEST_END}_cnt${CNT}_${TS}.log \
            >> experiments/020_revision/${RESULT_DIR}/gen_apigraph_cnt${CNT}_${SEQ}_warm_lr${LR}_${OPT}_${SCH}_${DECAY}_e${E}_${AL_OPT}_wlr${WLR}_we${WE}_test_${TEST_START}_${TEST_END}_cnt${CNT}_${TS}.log 2>&1 &
```

### Example Scripts

We used the scripts under `experiments/020_revision` to run experiments in the paper. We ran these jobs on a Slurm GPU cluster (thanks to Center for AI Safety). If you would like to run the same script on a GPU server, not managed by Slurm, you would need to remove the lines starting with `#SBATCH` and also the last line (i.e. `wait`).