High-priority TODO list
=======================

o Document makeogrannote() and related utilities.

o Investigate BIG inconsistency between local IgBLAST and web IgBLAST:
  https://www.ncbi.nlm.nih.gov/igblast/

  Try to run the former with the -remote option to execute search remotely,
  and compare. Collect as much data on this issue as possible and send an
  email to the IgBLAST folks at NCBI (blast-help@ncbi.nlm.nih.gov or
  nlm-support@nlm.nih.gov).

o Mismatch/indel summarization:

  The goal is to summarize information about the mismatches and indels
  between the query sequences (BCR or TCR nucleotide sequences) and the
  germline gene alleles sequences that they’re aligned to.

  Deliverables:

  - tabulate_mismatches(), tabulate_insertions(), tabulate_deletions():
    take the AIRR-formatted data.frame and return a matrix of counts
    with one row per query sequence (i.e. one row per row in the data.frame)
    and 7 columns: fwr1, cdr1, fwr2, cdr2, fwr3, cdr3, fwr4.
    By default the counts are for the mismatches/indels at the nucleotide
    level. What we count is the number of nucleotides involved in
    mismatches or indels, not the number of events e.g. an insertion
    of 3 nucleotides counts for 3 not for 1.

  - Discussed at Hyrien lab meeting on Oct 29:
    (a) support summarization at amino acid level
    (b) report % identity per CDR/FR regions

  Questions:

  - Should we also add columns for the V, D, J, C regions?

o igbrowser() improvements:
  - Display pairwise alignment between BCR query sequence and germline
    V/D/J/C sequences.
  - Take a look at visualization tool from IMGT/V-QUEST for inspiration.

o Maybe implement the following advice given by IgBLAST when using one
  of the num_alignments_V/D/J arguments:

  Warning messages:
  1: In .parse_and_issue_warnings(stderr_file) :
    Warning: To obtain better run time performance, please run blastdb_aliastool
    -seqid_file_in <INPUT_FILE_NAME> -seqid_file_out <OUT_FILE_NAME> and use
    <OUT_FILE_NAME> as the argument to -seqidlist


Things to do at BioC 3.22 release time
======================================

o Update README.md:
  - Update "Install and load igblastr" section.

o Advertize igblastr:
  - Announce on various bioc-community Slack channels.
  - Announce on the FH-Data Slack (fhdata.slack.com) on channels
    #r-user-comm and #general.
  - Announce on LinkedIn.
  - Try to get an entry in the next R Journal advertizing igblastr.
  - Bioinformatics accepts short articles introducing new software.


Low-priority TODO list
======================

o Older versions of IgBLAST (e.g. version 1.19.0) don't necessarily include
  the same data as the latest version (version 1.22.0). However, we
  always initialize igblastr_cache(LIVE_IGDATA) with the content of
  inst/extdata/igdata_store/ (a.k.a. "the igdata store") regardless of what
  version of IgBLAST is used by igblastr (note that the igdata store contains
  the data included in IgBLAST 1.22.0). This means that, if we use a version
  of IgBLAST that includes different data, then igblastr_cache(LIVE_IGDATA)
  in its original state already differs from the data included in the
  IgBLAST that we are using.
  There are some problems with this:
  1. The content of igblastr_cache(LIVE_IGDATA) is not guaranteed to be
     compatible with all versions of IgBLAST.
  2. If we're using a version of IgBLAST that includes data that differs
     from the igdata store, then from the very start (i.e. before
     any run of update_live_igdata()), igdata_info() shows differences
     between live and original auxiliary files. Running reset_live_igdata()
     of course doesn't help because it's a no-op when
     igblastr_cache(LIVE_IGDATA) is in its original state.
  3. Running update_live_igdata() doesn't help either because the updates
     available at NCBI are for the latest version of IgBLAST.
  Proposed solution:
  (a) Implement an internal helper (e.g. compatible_with_igdata_store())
      that can quickly compare the data included in an installation of
      IgBLAST with the igdata store. Should return TRUE or FALSE to indicate
      whether the data is identical or not.
  (b) When using a non-compatible IgBLAST (could be an internal or external
      installation), reset igblastr_cache(LIVE_IGDATA) with the data included
      in that IgBLAST.
      More generally, calling set_igblast_root() should:
      - reset igblastr_cache(LIVE_IGDATA) with the data included in the
        selected IgBLAST;
      - raise an error if the internal_data/ or optional_file/ subdir is
        missing in the selected IgBLAST;
      - print a message that suggests running update_live_igdata() only
        when selecting a compatible IgBLAST (message should be similar to
        what .print_tip_if_live_igdata_needs_check() does when is.infinite(dt),
        see R/zzz.R).
      Open question: When do we do the above if the IgBLAST to use is an
      external IgBLAST installation selected via IGBLAST_ROOT?
  (c) Disable update_live_igdata() if we're using a non-compatible IgBLAST.
      In this case, the function should raise an error with an error message
      that explains the situation.
  (d) The .onLoad() hook should only print the "igblastr tip" if igblastr
      already has access to an IgBLAST installation (i.e. if get_igblast_root()
      works) and if that installation compatible with the igdata store. Note
      that since update_live_igdata() is disabled when the selected IgBLAST
      non-compatible, time_since_live_igdata_last_checked() will always
      return Inf in that case, but we shouldn't even need to call
      time_since_live_igdata_last_checked() in that case.
  (e) Change what igdata_info() displays when the selected IgBLAST is
      non-compatible. For example 'last_checked:' and 'last_updated:' could
      display something like 'checking is disabled' and 'updating is disabled'.
      Or don't display these fields.
  (f) About install_igblast(): ncbi-igblast-1.22.0+.dmg is missing the
      internal_data/ and optional_file/ folders but install_igblast() is
      able to fix this.
      Make sure that the fix is applied when installing version 1.22.0 only.

o Migrate code in R/AIRR-utils.R from OGRDB API v1 to OGRDB API v2.
  See OGRDB API v2.0.0 Guide here:
  https://github.com/airr-community/ogrdb/blob/master/schema/ogrdb_api_v2_guide.md

o Add igblastp(), a wrapper to the igblastp standalone executable included
  in IgBLAST. Requested by Dr Iman Haddad in an email from Aug 12, 2025.

o Add 'clonotype_out' arg to igblastn(). Add examples in man page and
  vignette that use this functionality.

o It was mentioned that some people use mixeR to analyse TCR sequences.
  How does this compare to using igblastn(..., ig_seqtype="TCR")?

o Add functionality to install/use the updated internal and/or auxiliary
  files that are sometimes made available at:
    https://ftp.ncbi.nih.gov/blast/executables/igblast/release/patch/
  See https://ncbi.github.io/igblast/cook/How-to-set-up.html for the details.

o Add bibliography to vignette. See AuthoringRmdVignettes.Rmd vignette in
  BiocStyle for how to do this.

o Add Seqinfo to Imports (but wait until BioC 3.23 for that). Note
  that we'll still need GenomeInfoDb just for list_ftp_dir().

o Clarify provenance of 1279067_1_Paired_sequences.fasta.gz and its licence.
  Give appropriate credit. See https://opig.stats.ox.ac.uk/webapps/oas/

o More investigation to assess the consequences of using the static auxiliary
  data included in IgBLAST.

o Figure out a way to automatically stamp AIRR germline dbs with a
  version number that allows to go back in time when needed.

o One should be able to pass the name of an IMGT germline db to
  install_IMGT_germline_db(), or a vector of names.

o Improve read_igblastn_fmt7_output.Rd man page (e.g. document customized
  format 7 and list_outfmt7_specifiers()) as well as associated unit tests (in
  tests/testthat/test-outfmt7-utils.R).

o Make 'num_threads' an explicit argument with default to 4. The doc should
  show how to specify a higher but still reasonable custom value based on
  detectCores().

o Parse $footer part of output format 7.

o Implement parsing of output formats 3 and 4?

o Set environment variable IGDATA to point to the internal_data directory.
  Note that IGDATA must be set to the **parent** directory of the internal_data
  directory.

o Great resource for how to use AIRR Community Reference germline sets with
  IgBLAST: https://williamdlees.github.io/receptor_utils/_build/html/airrc_sets_with_igblast.html
  In particular, the author seems to be using an OGRDB REST API version 2:
    https://ogrdb.airr-community.org/api_v2
  but where is this API documented?
  All the download utilities implemented in igblastr/R/AIRR-utils.R use
  the OGRDB REST API at
    https://ogrdb.airr-community.org/api
  which is poorly documented and is somewhat confusing (see below).

o Investigate the following mysteries about the germline sets provided
  by AIRR/OGRDB:

  1. The OGRDB API at https://ogrdb.airr-community.org/api/ allows downloading
     the germline sequences in 2 formats: ungapped or ungapped_ex.
     Which format is appropriate to use with IgBLAST?
     Note that downloading germline sets directly by clicking on
     the "FASTA Ungapped" links here
       https://ogrdb.airr-community.org/germline_sets/Homo%20sapiens
     or
       https://ogrdb.airr-community.org/germline_sets/Mus%20musculus
     seems to retrieve the "ungapped_ex" sequences for Human and the "ungapped"
     sequences for Mouse. Confusing!

  2. For some Mouse strains, OGRDB seems to provide germline sequences
     only for a limited number of loci/groups. For example for strain A/J,
     only sequences from the light chain (i.e. groups IGKV, IGKJ, IGLV,
     and IGLJ) seem to be available.
     See https://ogrdb.airr-community.org/germline_sets/Mus%20musculus

o Implement install_AIRR_germline_db(). Will download the germline sequences
  from https://ogrdb.airr-community.org/ (link provided by Kellie).

