This folder (SNPlocs.Hsapiens.dbSNP144.GRCh38/inst/tools/) contains the
tools used for building the OnDiskLongTable directory structure contained
in this package (in inst/extdata/) from the dbSNP dump files.

dbSNP Home Page:

  https://www.ncbi.nlm.nih.gov/snp/

Here is how these .rda files were made:

  1. Download the ds_flat_ch*.flat.gz files for chromosomes 1-22, X, Y,
     and MT from:

       ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/ASN1_flat

     You can use the download_ds_flat.sh script located in this folder
     for this.

  2. Uncompress the downloaded files.
     These uncompressed files are the "source files".
     NB: The ASN.1 flatfile format (and many other formats used on
     the snp section of the FTP site) is described here:

       ftp://ftp.ncbi.nih.gov/snp/00readme.txt

  3. Check the source files with for example

       ./prechecking.sh path/to/ds_flat_ch16.flat

     and pay attention to the output.
     Note that the final nb of SNPs per chromosome will be less than the
     nb of records tagged with "snp" because of additional filtering during
     steps 5 & 6 below.

  4. Compile filter2_ds_flat.c with:

       gcc -Wall filter2_ds_flat.c -o filter2_ds_flat

  5. Adjust settings in make_tmp_files.sh and run it. This script will extract
     and curate the SNPs from the flat files (dropping those that don't satisfy
     filtering criteria 1-3 described in man/package.Rd), and dump them in
     temporary files (in the directory controlled by the TMP_DIR variable set
     in the script).
     This step took 2h08 on rhino3 (64-bit Ubuntu 14.04 with 56 cores and
     384GB of RAM) and produced a total of 133,030,779 SNPs.

  6. Adjust settings in build_OnDiskLongTable.sh and run it from the
     inst/extdata/ folder of the SNPlocs.Hsapiens.dbSNP144.GRCh38 package
     source tree. This script will read the SNPs from the temporary files
     generated in step 5, perform additional filtering (dropping those that
     don't satisfy criteria 4 described in man/package.Rd), and dump them in
     an OnDiskLongTable directory structure (this directory structure will be
     created in the current folder).
     This step took about 28 min on rhino3 and produced a total of
     133,030,779 SNPs (no SNPs were dropped from the SNPs produced at step 5).

NOTE: Not all SNPs are consistent with the reference genome i.e. the ambiguity
letter associated with a SNP (and representing the alleles with respect to the
plus strand) is not necessarily compatible with the nucleotide found in GRCh38
at the position reported for the SNP. For example, on chromosome 1,
5561/10352408 SNPs (0.054%) have alleles that are inconsistent with the
reference sequence. To get the number of inconsistent SNPs:

  library(SNPlocs.Hsapiens.dbSNP144.GRCh38)
  snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38
  chr1_snps <- snpsBySeqname(snps, "1")
  chr1_alleles <- mcols(chr1_snps)$alleles_as_ambig
  chr1_alleles <- DNAString(paste(chr1_alleles, collapse=""))

  library(BSgenome.Hsapiens.UCSC.hg38)
  genome <- BSgenome.Hsapiens.UCSC.hg38
  neditAt(genome$chr1[pos(chr1_snps)], chr1_alleles, fixed=FALSE)

