# 3dnetmod

**Repository Path**: hi-c_baseline/3dnetmod

## Basic Information

- **Project Name**: 3dnetmod
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-03
- **Last Updated**: 2026-04-03

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README


**NOTICE!*
========

|   *PLEASE* use https://bitbucket.org/creminslab/cremins_lab_tadsubtad_calling_pipeline_11_6_2021.  Most current pipeline.  This repo is outdated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|   *PLEASE* use https://bitbucket.org/creminslab/cremins_lab_tadsubtad_calling_pipeline_11_6_2021.  Most current pipeline.  This repo is outdated.


**Introduction**
=============
3DNetMod is a package of python scripts that sensitively detects nested, hierarchical domains in HiC data.


**Installation**
==============
To run 3DNetMod, the source code must be downloaded from Bitbucket and its dependencies must also be installed. 
Below is a step by step guide to downloading the code from Bitbucket and installing all necessary dependencies using pip.

**Download source code from Bitbucket**
  ::
  
  
    git clone https://bitbucket.org/creminslab/3dnetmod_method_v1.0_10_06_17
    
**Create a python virtual environment with the necessary dependencies** 
Below is an example of commands to run to install the necessary dependencies using conda and pip. 
  ::
  
    mkdir 3DNetMod_venv
    conda install python=2.7.5
    conda install virtualenv
    virtualenv 3DNetMod_venv
    source 3DNetMod_venv/bin/activate
    conda install numpy=1.10.1
    pip install bctpy==0.5.0
    pip install multiprocess==0.70.4
    pip install pandas==0.15.2
    pip install sklearn==0.0
    pip install patsy==0.4.1
    pip install pyparsing==2.1.10
    pip install pysam==0.8.4
    pip install python-dateutil==2.6.0
    pip install pytz==2016.10
    pip install dill==0.2.5
    pip install pyBigWig==0.3.3
    pip install seaborn==0.7.1
    pip install scipy==0.15.1
    pip install scikit-learn==0.17.1
    pip install interlap==0.2.2
    pip install git+https://github.com/ericsuh/statutils
    wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
    tar -zxvf bedtools-2.25.0.tar.gz
    cd bedtools2
    make
    cp -r bin/* ../3DNetMod_venv3/bin/
    pip install pybedtools==0.7.10
    
    
**To run 3DNetMod, first activate the virtual environment you've created by running:**
  ::


    source 3DNetMod_venv/bin/activate
    

**Usage of 3DNetMod**
==================

**Step 1. Pre-process**
  ::
  
  
      python 3DNetMod_preprocess.py settings.txt
  
| The pre-processing step takes genome-wide HiC data and splits it into overlapping regions to improve speed and sensitivity of domain detection (panel A below). 
|  
**Step 2. 3DNetMod-GPS-MMCP**
  ::
  
  
      python 3DNetMod_GPS_MMCP.py settings.txt
  
  
| The Gamma Plateau Sweep/Modularity Maximization and Consensus Partition (GPS-MMCP) steps identify domains across resolution scales by maximizing network modularity (panels B-G below).    
|  
**Step 3. 3DNetMod-HSVM**
  ::


      python 3DNetMod_HSVM.py settings.txt
  
| The Hierarchical Spatial Variance Minimization (HSVM) step identifies high confidence domains by stratifying domains by size and then thresholding on boundary spatial variance (panels H-L below) 
|  
|
|  All parameters for the method are contained within settings.txt. Prior to running the commands above, place Hi-C .bed and .counts file in the input directory (see below).  
|
|  Test input HiC data of chromosome 18 from Jiang et al. 2017 [1] are supplied with the repository. The settings_TEST.txt supplied with the respository contains the default settings used on the Jiang et al. 2017 [1] data in Norton et al. 2017.
|
.. image:: overview.png
|
| 
**Input files**
============

Place Hi-C interaction data (one file for each Hi-C sample) and .bed (one for the set of Hi-C samples) files into the input/ subdirectory. HiC .counts files should already be normalized (matrix balanced, etc). 

|

**Bed file form**
  ::
  

      chr1    0        30000     1
      chr1    40000    80000     2
      chr1    80000    120000    3
      chr1    120000   160000    4
      ....    .........    .........    .....
      chrX    178000000    178040000    60000   
 

  |
  |  1st col must be lower case chromosome (except X Y in chrX chrY). String value
  |  2nd col start genomic coord.  Integer value
  |  3rd col stop genomic coord.  Integer value
  |  4th col bin number.  Integer value
  |  Tab delimited
  |

**Hi-C interaction data file form**
  ::
    
    1     1    0.766649 
    1     2    10.98993
    1     3    56.00003
    1     4    3.222222
    2     2    0.988885
    2     3    0.400002
    3     3    5.344442
    3     4    10.22222   
  |
  | 1st col is bin1 of interaction pair.  Integer value
  | 2nd col is bin2 of interaction pair.  Integer value
  | 3rd col is counts interation value.   Float value
  | Tab delimited
  | Ascending order of bin index 

|   Note: The method assumes that all samples (cell types/replicates) to be analyzed together have been divided into the same genomic bins which can be accurately represented by a single .bed file.
|   

**Output files**
=============
|
| Step 1 preprocess outputs: 
* "chr#"*"sample_#"*finalpvalues.counts and "chr#"*"sample_#"*final.bed located in input/ directory.
* File names contain parameters used in the preprocess portion of settings file (see parameters section). Additionally each chromosome file is a "super region" (3 consecutive  regions of a chromosome) with form "chr#.supperregion#"
|
|
| Step 2 GPS_MMCP outputs: 
* List of regions per chromosome that were not considered based on deficiency of counts in region ("bad regions") and left over list of regions for determining calls ("good regions") are placed in **output/GPS/bad_region_removal**
* Max gamma found per chromosome and the selected gammas used per region based on max gamma placed in **output/GPS/gamma_dynamic_range**
* 5 selected “good regions” randomly rewired prior to max gamma determination placed in **output/GPS/random_networks**
* Genomic coordinates for which domains will be removed if the chaos filter is turned on placed in **output/GPS/chaos_regions**
* All communities prior to GPS parameter filtering placed in **output/MMCP/communities_pre_filters**
* Communities found after filtering prior to redundancy removal placed in **output/MMCP/consensus_filtered_files**
* **Final GPS_MMCP pre-HSVM filtered calls used for step 3 in output/MMCP/unique_communities/results_files**
|
| OPTIONAL output for plots = True in settings file:
* Average number of communities vs gamma and Q vs gamma placed in **output/GPS/scatterplots**
* Plot of consensus partition per gamma placed in **output/variance_thresholding/consensus**
* Communities per partition across block per gamma placed in **output/MMCP/communities_across_numpart**
|
|
| Step 3 HSVM outputs:
* Distribution of sizes of domains for each size strata placed in **output/HSVM/size_hist**
* Histograms of variance distributions for each size stratum placed in **output/HSVM/variance_distributions**
* Values for each selected % AUC threshold placed in **output/HSVM/variance_thresholds**
* **Community calls placed in output/HSVM/variance_thresholded_communities/merged and output/HSVM/variance_thresholded_communities/unmerged (merging based on Boundary_buffer parameter. See parameters below)**
* **Optional finalized chaos filtered domain calls(if True, see parameters below) placed in output/HSVM/chaos_filtered_communities**
* **Optional finalized domains that are consistent between biological replicates of a given cell type placed in output/FINAL_DOMAIN_CALLS/**
|
|
**Parameters**
===========
|
| **All parameters are placed in settings.txt.**  Parameters are grouped under separate headings (preprocess, GPS, MMCP, HSVM) that correspond to the stage of the method.  Filenames for each stage incorporate corresponding stage parameters. Parameters are summarized below. An asterisk after a parameter name (*) indicates that the parameter is optional and can be omitted.
|
   
|
| **##preprocess##**
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|    parameter                      |              Description                                  |      Recommended value      |
+===================================+===========================================================+=============================+
|  **bed_file**                     | Name of bedfile in input directory                        |          N/A                |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **sample_1**                     | Label for the first biological replicate of celltype 1    |          N/A                |                       
|                                   | Ex: WT1                                                   |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **counts_file_1**                | File name of the HiC interaction data for sample_1. This  |          N/A                |
|                                   | file should be placed in the input/ directory.            |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **sample_2**\*                   | Optional (if not used, remove from the settings file).    |                             |
|                                   | sample_2 can be a different cell type from sample_1       |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **counts_file_2**\*              | Optional (if not used, remove from the settings file).    |          N/A                |
|                                   | File name of the HiC interaction data for sample_2.  This |                             |
|                                   | file should be placed in the input /directory.            |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **sample_3**\*                   | Optional (if not used, remove from the settings file.)    |         N/A                 |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **counts_file_3**\*              | Optional (if not used, remove from the settings file.)    |      N/A                    |
|                                   | File name of the HiC interaction data for sample_3. This  |                             |
|                                   | file should be placed in the input/ directory.            |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **sample_4**\*                   | Optional (if not used, remove from the settings file.)    |         N/A                 |
|                                   | Samp1e_4 can be any cell type.                            |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **counts_file_4**\*              | Optional (if not used, remove from the settings file.)    |                             |
|                                   | File name of the HiC interaction data for sample_4. This  |     N/A                     |
|                                   | file should be placed in the input/ directory.            |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **sample_n**\*                   | Optional.  Can coninue to add labels for additional       |        N/A                  |
|                                   | biological samples.                                       |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **counts_file_n**\*              | Optional.  Can coninue to add HiC interaction data in     |             N/A             |
|                                   | input/ directory. Corresponds to sample_n. Note that each |                             |
|                                   | sample must have a corresponding counts_file.             |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **region_size**                  | Size of regions into which the genome should be sliced.   | *150-300* is                |
|                                   | Units are bins. Integer value.                            | recommended for             | 
|                                   |                                                           | optimal performance at 40 KB|
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **overlap**                      | Number of overlapping bins between adjacent regions.      | *100*  recommended          |
|                                   | Integer value.                                            | for >= 20 kb resolution     |
|                                   |                                                           | *200* recommended           |
|                                   |                                                           | for 10 kb resolution        |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **logged**                       | Option to log counts output. *True* or *False*. Use       | should be *True*            |
|                                   | *True* if input data are not logged and you wish to log   | input data is already       |  
|                                   | them. Use *False* if input data are already logged or if  | logged.                     |
|                                   | you do not wish to log the data. If *True*, HiC           |                             |
|                                   | count values < 1 are thresholded to avoid negative values |                             |
|                                   | after logging and repulsion in louvain-like algorithm.    |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|  **processors**                   | Controls number of computing cores utilized in            | If running locally, set     |
|                                   | multiprocessing.Pool.  Depending on available resources,  | to the number of cores      |
|                                   | the maximum possible value should be number of chromosomes| available. For quad core,   | 
|                                   | multiplied by the number of samples. Integer value.       | set to *4*. If using a high |
|                                   |                                                           | performance computing       |
|                                   |                                                           | cluster, *20* is recommended|
+-----------------------------------+-----------------------------------------------------------+-----------------------------+

|
|
|
| **##GPS##**
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|    parameter                      |              Description                                  |      Recommended value      |
+===================================+===========================================================+=============================+
| **badregionfile**\*               | Optional parameter. Omit if not used. Name of a file      | We recommend                |
|                                   | located in the main directory containing genomic          | *mm9_cent_tel,*             |
|                                   | coordinates that should not be queried by 3DNetMod. We    | *mm10_cent_tel,*            |
|                                   | supply *mm9_cent_tel*, *mm10_cent_tel*, and               | *hg19_cent_tel,*            |
|                                   | *hg19_cent_tel*, which are text files of the genomic      | or a similar file           |
|                                   | coordinates of centromeres and telomeres in the mm9, mm10,| corresponding to your       | 
|                                   | and hg19 genomes, respectively.                           | genome of interest.         |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **badregionfilter**               | *True* or *False*. If True, regions with sparse counts    | *True* is recommended       |
|                                   | are removed from further analysis.                        |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **scale**                         | *genomewide* or individual chromosome such as *chr1*,     |      N/A                    |
|                                   | *chr2*, *chrX*, *chrY* etc. Initial input .bed and HiC    |                             |
|                                   | interaction data MUST have the chromosome specified in    |                             |
|                                   | the same format  (*chr*, *chr2*, *chrX*, *chrY*, etc).    |                             |
|                                   | If *genomewide* is used initially for preprocessing       |                             |
|                                   | step, the parameter can subsequently be adjusted to an    |                             |
|                                   | individual chromosome (e.g. *chr1*) for subsequent steps. |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **plateau**                       | Minimum number of consecutive gamma values with the same  | *3* is recommended at 40 kb |
|                                   | number of communities required for the gammas to be       | 10 - 20 works best for 8 kb |           
|                                   | considered in a 'plateau' during GPS. A larger value will | or 10 kb at high reading    |
|                                   | lead to fewer gamma values queried with MMCP and fewer    | depth                       |
|                                   | resulting domains detected.                               |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **chaosfilter**                   | *True* or *False*. If True, domains will be subjected to  | *True* for 40 kb or lower.  |
|                                   | chaos filtering.                                          | *False* for 10 kb or higher |                         
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **chaos_pct**\*                   | Optional parameter. If chaosfilter = *True* , this        |  *0.85* is a good starting  |
|                                   | parameter sets the stringency of the chaos filter. Should |  value to try. However,     |
|                                   | be a decimal value from 0 to 1. Large values are more     |  multiple values should     |
|                                   | stringent and will result in fewer final domains.         |  be tested.                 |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **diagonal_density**\*            | Optional parameter. Removes regions than do not exceed    |  Currently set to very      |
|                                   | percent nonzero in diagonal specified.  Default           |  stringent value which may  |
|                                   | value if not provided is 0.95.                            |  lose region of interest    |
|                                   |                                                           |  Future update will relax   |
|                                   |                                                           |  default to 0.65            |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **consecutive_diagonal_zero**\*   | Optional parameter. Removes regions where consecutive     |  Although currently set     |
|                                   | zeros in diagonal exceeds number provided.  Default value |  to 3 it can probably be    |
|                                   | if not provided is currently 3 to deal with older low     |  relaxed to 20 to capture   |
|                                   | read sparsity.                                            |  more regions.  Future      |
|                                   |                                                           |  updates will relax default |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+


|
|
|
| **##MMCP##**
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|    parameter                      |              Description                                  |     Recommended Value       |
+===================================+===========================================================+=============================+
| **num_part**                      | Sets the number of times the louvain-like algorithm       | *20* is recommended.        |
|                                   | is applied for a given gamma value. A consensus partition |                             |
|                                   | is determined across the **num_part** number of partitions|                             |
|                                   | per gamma value.                                          |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **plots**                         | *True* or *False*. If *True*, .png plots for Q_v_gamma,   | *False recommended*         |
|                                   | boundary variance distributions, consensus plots, etc.    |                             |
|                                   | are produced for every region.  To reduce memory costs,   |                             |
|                                   | only use *True* if testing a particular chromosome. It is |                             |
|                                   | not recommended to run genomewide with *True*.            |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **pctile_threshold**              | Filters communities with counts in boundary within a given| *0* recommended. If Hi-C    |
|                                   | percentile (pct_value) of region counts that fall below   | data have low sequencing    |
|                                   | percent threshold of bins in boundary. Keep low to remove | depth, *1* is recommended.  |
|                                   | boundaries that have high percentage of empty counts.     |                             |
|                                   | Integer value (0-100). Only use on low sequencing depth   |                             |
|                                   | HiC.                                                      |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **pct_value**                     | Counts value that fall within percentage of all counts in | *0* recommended. If Hi-C    |             
|                                   | distribution for region. Decimal value (0-1).             | data have low sequencing    |
|                                   |                                                           | depth, *0.80* is            |
|                                   |                                                           | recommended.                |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+

|
|
|
| **##HSVM##**
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
|    parameter                      |              Description                                  |     Recommended Value       |
+===================================+===========================================================+=============================+
| **size_threshold**                | Lower size cutoff for domains (in units of bins). Final   | Minimum of 150 kb           |
|                                   | domains called have size greater than bin cutoff.         |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **size_s1**                       | Sets the upper size limit in number of base pairs of the  | *400000* recommended. If    |
|                                   | first (or only) size stratum, L1.                         | only one size stratum is    |
|                                   | size_threshold < L1 <= size_s1.                           | used, value should be the   |
|                                   |                                                           | region size.                |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+                                
| **size_s2**\*                     | Optional. Sets the upper size limit in bases of the second| *800000* recommended        |
|                                   | stratum. Domains within the second size stratum, L2, are: |                             |
|                                   | size_s1 < L2 <= size_s2.  If not used, remove parameter   |                             |
|                                   | and key from settings file.                               |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **size_s3**\*                     | Optional. Sets the upper size limit in bases of the third | *1600000* recommended       |
|                                   | size stratum. Domains within the third size stratum, L3,  |                             |
|                                   | are: size_s2 < L3 <= size_s3. If not used, remove         |                             |
|                                   | parameter and key from settings file.                     |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **size_s4**\*                     | Optional. Sets the upper size limit in bases of the       | *3000000* recommended       |
|                                   | fourth size stratum. Domains within the fourth size       |                             |
|                                   | stratum,L4, are: size_s3 < L4 <= size_s4. If not used,    |                             |
|                                   | remove parameter and key from settings file.              |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **size_s5**\*                     | Optional. Sets the upper size limit in bases of the fifth | *12000000* recommended      |
|                                   | size stratum. Domains within the fifth size, L5, are:     |                             |
|                                   | size_s4 < L5 <= size_s5.  If not used, remove parameter   |                             |
|                                   | and key from settings file.                               |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **size_sm**\*                     | Optional. Additional strata parameters can be included.   |                             |
|                                   | Domains within m size, LM, are:                           |                             |
|                                   | size_sm-1 < LM <= size_sm.                                |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **var_thresh1**                   | Variance threshold on size_s1 communities (smallest). If  |  It is highly recommended   |
|                                   | a value > 0 is supplied, the variance threshold is        |  to test multiple values    |
|                                   | interpreted as  % Area under the Curve (AUC) of the       |  for this parameter.        |
|                                   | variance distribution of domains within size_s1 size.     |                             |
|                                   | If a value of 0 is supplied, the variance threshold is    |                             |
|                                   | interpreted as 0 variance.                                |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **var_thresh2**\*                 | Optional. Variance threshold on size_s2 communities       | It is highly recommended    |
|                                   | (second smallest). Interpreted as AUC of domains within   | to test multiple values     |
|                                   | size_s2 sizes. 100 used for Won et al. 2016. If not used, | for this parameter.         |
|                                   | remove parameter value and key from settings file.        |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **var_thresh3**\*                 | Optional. Variance threshold on size_s3 communities.      | It is highly recommended    |
|                                   | Interpreted as AUC of domains within size_s3 sizes. 100   | to test multiple values     |
|                                   | used for Won et al. 2016. If not used, remove parameter   | for this parameter.         |
|                                   | value and key from settings file.                         |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **var_thresh4**\*                 | Optional. Variance threshold on size_s4 communities.      | It is  highly recommended   |
|                                   | Interpreted as AUC of domains within size_s4 sizes. 60    | to test multiple values     |
|                                   | used for Won et al. 2016. If not used, remove parameter   | for this parameter.         |
|                                   | value and key from settings file.                         |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **var_thresh5**\*                 | Optional. Variance threshold on size_s5 communities.      | *0* recommended to          |
|                                   | Interpreted as AUC of domains within size_s5 sizes.       | remove compartments.        |
|                                   | If not used, remove parameter value and key from settings |                             |
|                                   | file.                                                     |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **var_threshm**\*                 | Optional. Variance threshold on size_sm communities.      |                             |
|                                   | Interpreted as AUC of domains within size_sm sizes. If not|                             |
|                                   | used, removefrom settings file .                          |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+
| **boundary_buffer**               | Tolerance in difference in domain boundary coordinates    | The bin size of data        |
|                                   | across replicates for two domains to be considered        | is recommended (e.g. *40000*|
|                                   | the same domain and subject to merging                    | for 40 kb binned data)      |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+ 
| **final_consistent_domains**      | *True* or *False*. If *True*, a file of final domain calls| *True* if sample_1 and      |
|                                   | that are consistent across biological replicates (within  | sample_2 are biological     |
|                                   | boundary_buffer of genomic coordinates of upstream and    | replicates and sample_3 and |
|                                   | downstream domain boundaries) will be written and can be  | sample_4 are biological     |
|                                   | found in output/FINAL_DOMAIN_CALLS.                       | replicates.                 |
|                                   | Assumes sample_1 and sample_2 are                         |                             |
|                                   | biologial replicates of the same cell type. Assumes that  |                             |
|                                   | sample_3 and sample_4 are biological replicates of the    |                             |
|                                   | same cell type.                                           |                             |
+-----------------------------------+-----------------------------------------------------------+-----------------------------+


| NOTE: the method is flexible to different numbers of size strata. For example, if only 3 size strata are desired instead of the default 5 size strata, simply use size_1, size_2, size_3, var_thresh1, var_thresh2, and var_thresh3 flags.
| 


**Getting Started**
=====================
| For first time users, we recommend that you do the following tutorial on test data supplied with the code to get started. 
| **First:** Follow the Installation instructions above for downloading the code and installing necessary dependencies.
| **Second:** View the settings_TEST.txt file. This file contains the settings for the test data (chromosome 18 of wild-type and Setdb1 knock-out cells from Jiang et al. 2017). Notice how the input HiC interaction data and .bed files are located in the input/ directory.
| **Third:** Run 3DNetMod using the parameters supplied in the settings_TEST.txt file. 
**Commands to run:**
  ::
  
  
      python 3DNetMod_preprocess.py settings_TEST.txt
      python 3DNetMod_GPS_MMCP.py settings_TEST.txt
      python 3DNetMod_HSVM.py settings_TEST.txt


| **Fourth:** Try modifying some of the parameter values in the settings file. For example, try changing var_thresh1 to 100. When changing parameter values, the resulting files will not be over-written as each parameter value is part of the file name. When working with a new data set, it is essential to test a range of variance threshold values and chaos filter values.
| **Fifth:** Copy settings_TEST.txt to a new settings file. Update the bed_file, sample_1, counts_file_1, etc parameters for your Hi-C data. Place your HiC data (interaction counts files and .bed files) into the input/ directory.
| **Sixth:** Run the above commands using your new settings file. 


**Tips for Fine Tuning Calls**
===========================
|
| **When experimenting, it is most useful to test on a single chromosome first.**  The most important parameter values to fine tune for a given data set are the variance thresholds (var_thres1, var_thres2, etc). We recommend performing a full sweep of variance threshold parameter values for each size stratum in increments of 10 (e.g. for var_thresh1, try 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 while holding the values for the other levels constant at 100).  Below, we provide heatmaps of domains called with multiple variance thresholds for each size stratum across 4 different data sets of different sequencing depth and resolution. Our selected variance thresholds for each level are indicated with a green box.
|
.. image:: Variance1.png
|
.. image:: Variance2.png
|
.. image:: Variance3.png
|
.. image:: Variance4.png
|
| **The trash community filter (i.e. pct_value and pctile_threshold) should be kept at 0 as default for high read depth data**.  If you observe egregiously wrong domain calls that are driven by noise in low complexity libraries, it is best to use very conservative low pct_value and high pct_threshold to remove most egregious calls (e.g. 0.01 pct_value and 80 pct_threshold will remove only communities where more than 80% of boundary counts are in the lowest 1% of counts values).  
|
|
| **When experimenting with the region size and overlap between regions in the pre-processing step, smaller region sizes are more sensitive to smaller subTADs and less sensitive to larger TADs. Larger region sizes achieve greater sensitivity to larger TADs with less sensitivity to smaller TADs.**  However, from experience, the region size should not exceed 1200 bins due to the computational costs of re-wiring large regions and finding consensus partitions on large regions.
|
| The run time per region as a function of number of nodes scales roughly linearly for regions <= ~ 300 nodes.  Below is a table of run times as a function of region size performed on Won et al. 2016 Hi-C data with a plateau size of 3:

|
| **Method run time for a single region of different sizes**
+-----------------------+----------------------------+-----------------------------+
|   76 node region      |   151 node region          |   301 node region           |
+=======================+============================+=============================+
|    203.5 seconds      |    147.5 seconds           |    184.1 seconds            |
+-----------------------+----------------------------+-----------------------------+


**FAQs for trouble-shooting**
============================
**I have too few/many domains.** There are 3 main parameters that can be adjusted to increase or decrease the number of domains that are identified:

1. **plateau**: this parameter defines how frequently gamma values are sampled when identifying domains. A smaller plateau value leads to more frequent gamma sampling, and thus more domains. The plateau size is defined in units if ‘gamma_step’. For example, a ‘plateau’ of 3 and a ‘gamma_step’ of 0.01 means that three consecutive gamma values in increments of 0.01 must have the same average number of communities in order the gamma value to be considered stable. We recommend a plateau value of 3 for high resolution data and ~8 for lower resolution data with a ‘gamma_step’ of 0.01, but if you have a low number of domains and are missing many sub-TADs, consider decreasing the plateau size. Conversely, if you want to decrease the number of domains identified, consider increasing the plateau size.
2. **var_thresh**: this parameter sets the % area under the curve of the boundary spatial variance distribution that is used to determine a variance threshold. A different variance threshold can be set for each size stratum.  Try increasing AUC for a given stratum to increase the number of domains called and decreasing AUC for a given stratum to decrease the number of domains called.
3. If you are missing many sub-domains it may be useful to decrease the region size (**3 - 6 MB appears to work in most instances corresponding to 150 nodes in 20 and 40kb binned data**).  This will decrease sensitivity to large domains and compartments and increase sensitivity to smaller domains.

| Additionally, if you have too few domains, it is possible that other settings are too stringent. Consider not using chaos filter, decreasing the size threshold if set too high, using default pct_value ( 0 ) and pct_threshold ( 0 ). If you have too many domains, consider increasing the stringency of the chaos filter.

**What if I don't have access to a high performance computing cluster?** The 3DNetMod method can be run on a local laptop if individual chromosomes are run in series. Simply change 'scale' in the settings file to the given chromosome of interest and run the method on each chromosome individually. For example, you could create 23 different settings files (settings_chr1.txt, settings_chr2.txt, etc) for 23 different chromosomes. If you have a quad-core laptop, you could run four chromosomes at the same time in 4 different terminal windows.