Noamteyssier Gia Versions Save

gia: Genomic Interval Arithmetic

0.2

2 months ago

Changelog

File Support

Native support for all types of interval files

  1. BED3
  2. BED4
  3. BED6
  4. BED12
  5. BedGraph
  6. Generic BED (BED3 + columns)
  7. GTF

Specialized functions for HTSlib data structures

  1. BAM
  2. VCF / BCF

Auto-determined Naming Schemes

  1. User doesn't need to provide whether files are named/unnamed
  2. File format is automatically determined and will default to generic BED if it cannot be figured out.

BGZIP Support

  1. FASTA
  2. VCF

Utilities

Current List of Utilities for Native BED Files

  1. Closest
  2. Cluster
  3. Complement
  4. Coverage
  5. Extend
  6. Flank
  7. Get Fasta
  8. Intersect
  9. Merge
  10. Random
  11. Sample
  12. Segment
  13. Shift
  14. Sort
  15. Spacing
  16. Subtract
  17. Unionbedg
  18. Window

Specialized HTSlib Utilities

  1. BAM Convert
  2. BAM Coverage
  3. BAM Filter
  4. BCF Filter

Stranded Methods

  1. Closest
  2. Coverage
  3. Extend
  4. Flank
  5. Get Fasta
  6. Intersect
  7. Merge
  8. Subtract
  9. Window
  10. BAM Coverage
  11. BAM Filter
  12. BCF Filter

Multiple Inputs

  1. Closest
  2. Coverage
  3. Intersect
  4. Subtract
  5. Window

Commit Changelog

๐Ÿš€ Features

File Support

  • Added bed6 support for get_fasta
  • Implement bed6 support for merge subcommand. added format argument to cli
  • Increment bedrs version to 0.1.10
  • Implement Coordinates for references to NumericBed6 for code generalization
  • Made intersect (inplace) compatible with bed6 file format inputs. also refactored internal function calls to have streamed match branch selection inside the function instead of within main
  • Add support for bed6 with subtract submodule.
  • Add a reorder trait which operates on different coordinate types
  • Add an extra method onto numbericbed6 to return and update the name
  • Added a named argument to random to allow for named genome inputs. using new refactored genome struct
  • Added compression threads and level as global arguments to the cli
  • Update bedrs version and added rayon feature
  • Added parallel sorting argument on sort
  • Added support for bed12 file format
  • Added 3 column output as a function within writenamediter for merging
  • Update gia version to 0.2
  • Added auto determination of string/numeric format with BedReader
  • Added flanking function to gia with an optional genome file
  • Added percentage to flank command
  • Added shifting subcommand to gia
  • Added window overlaps as subcommand
  • Added an interval depth structure for fast serde serialization
  • Added a naive implementation of coverage
  • Added direct type conversion to inputs
  • Use direct type conversion in closest
  • Added bed4 as an auto-determined input format
  • Added an ambiguous input format which reads in all 3+ columns into a tab-delim string
  • Added a split translater which keeps an internal translator for the chr and metadata separately
  • Added a split translater which contains two internal translaters. one for the chr translation and one for the meta translation. During sorting, only the chr translater is sorted which heavily reduces the amount of keys to reorder
  • Skip commented lines in input matching
  • Added in gtf set parsing
  • Added in reading functions for gtf
  • Added spacing to the cli
  • Added implementation for spacing - as well as a type for spacing interval outputs which appends a Score to the TSV to incorporate dots for Nulls
  • Added a command which wraps the segmentation algorithm
  • Added a unionbedg command to cli as well as a shared multiinput which accepts gt 1 filename
  • Added a bedreader over bedgraph files
  • Implementation of the unionbedg algorithm using a union over the bed sets, segmentation, then intersections
  • Added a specific writing utility for segments with variable score slices used in unionbedg without reinit csv writers and flushing
  • Added bedgraph to generic dispatch mechanisms
  • Added a cluster command which uses the depth interval struct for writing out
  • Added noodles for bam parsing
  • Added a bam subcommand with an internal convert subcommand to convert bam into bed
  • Added an unimplemented warning with bail instead of panic
  • Added bam output options
  • Added cli interface for mixed inputs bam/bed and a filter command which can be used to select bam intervals that meet overlap criteria
  • Moved bam parsing functions into a shared utility directory
  • Added new dispatch for bam and header with variable bed format
  • Added convenience tool for pulling chr idx directory without specifying a group
  • Implementation of the bam filtering algorithm given an interval file as b
  • Added invert as an output predicate to bam filter, bit slow than bedtools so should compare whether noodles or htslib is faster for writing
  • Added htslib and removed noodles
  • Added a vcf filtering method - borrowing API of rust_htslib-bam centric methods. Also renamed some overlapping namespaces to delineate bam and vcf origins
  • Added clone derive for all subcommand args
  • Add both format and compression status to single output format
  • Added stranded method to Growth to propagate stranded methods to flank, window, extend
  • Added stranded and specific stranded methods to merge. Also added a demote parameter so that merge will by default return the same output format as input format but can be demoted to bed3 if specified
  • Added stranded methods to bam filter
  • Added strandedness to closest and match bedrs-2.0 api for call
  • Added a bam coverage command which accepts a BAM/BED input and counts the number of BAM records that overlap at the BED record
  • Added thread count to bam coverage when reading bam
  • Added threaded option to bam convert
  • Refactored get-fasta into a module and write a bgzip get-fasta using rust-htslib
  • Changed b to allow multiple inputs and set up a ranking system for type demotion
  • Build dispatch with multiple rhs option
  • Added multiple b-file concatenation to all dual input commands

๐Ÿ› Bug Fixes

  • Fix bug in tests where wrong read function was imported
  • Modify tests for new shorthand
  • Fix keyword 'about' to 'description'
  • Sort was not retranslating the name field of bed6
  • Allow named chr names in input genome file to extend
  • Update cli to remove argument bounds on inverse for windows
  • Update tests to use new cli
  • Fix bug in tests where columns were being split on newline instead of tab
  • Bug where intersect was skipping sorting file pairs
  • Force meta intervals to always be named because their metadata must always be interpreted as a string
  • Update subtract tests to follow inheritance rules of scores. remove score types from generic
  • Update segment ordering to match bedtools ordering
  • Take explicit end of vcf for structural variants
  • Rename stream in tests
  • Update formatting
  • Update dual generics on StrandedBed3 to match bedrs-2.0 development

๐Ÿšœ Refactor

  • Remove write_records, write_named_records, and write_records_with by implementing WriteIterImpl for references to coordinates
  • Folded write_records and all associated versions into the WriteIter trait to avoid handling multiple versions of essentially the same code. Needed to also handle generic translaters for this.
  • Needed to specify a specific type to the None in intersect write
  • Remove dead code for format set - will implement in a future version in a different branch
  • Move internal read methods to private to limit number of public read methods
  • Since unnamed iter was already generalized it didn't make sense to include it in bed3. instead I created a new file 'iter' and reexport it from there publicly
  • Create a new struct for a genome with multiple build styles
  • Allow genome to accept an externally provided translater in cases where named bed inputs are read in first
  • Take an external compression threads and level so they are not fixed at compile time
  • Use full rust version of gzp to avoid external cmake dependency
  • Use bedrs 0.2 for lib
  • Remove all mentions of Containers and use IntervalContainer structs instead
  • Update tests to use IntervalContainer structs instead
  • Use bedrs buildin types instead of custom-spun bed6 and bed12 as well as GenomicIntervals
  • Create a BedReader struct which handles file IO and autodetermines input format
  • Include flate2 for input instead of niffler
  • Remove InputFormat impl as it is rolled into the BedReader
  • Use BedReader for sort module
  • Update sample to use new input/field format scheme and bedreader
  • Updated merge module with new format inputs and also generalized streaming iterator to all unnamed file formats
  • Update get_fasta with new formats and generalize initialization of fasta and interval reader before writing
  • Update extend to use new input format specs
  • Major refactor of intersection to handle mixed file formats and using the bedreader struct
  • Refactor closest to use mixed file formats - required rewriting the pairs struct to handle mixed interval types as well as named conversions
  • Update subtract to used mixed file formats and dispatch pattern. can fully remove overlaps module now since that is handled internally by bedrs
  • Remove all old read pairs code since it is handled better via dispatch and bedreader
  • Update extend methods to use built-in bounds
  • Use built-in methods for calculating percentages and bounding extensions in bedrs
  • Used owned find iter to avoid constant rebuffering of output
  • Move cli to separate module
  • Have closest use new argument dispatching and argument folding
  • Update complement to use new cli flattening
  • Update coverage with new flattened arg structure and introduced an overlap_predicates that can be shared
  • Update extend to use new flattened arg structure and introduced percentages
  • Update flank to use new flattened arg structure and introduced percentages
  • Update flank to set left bound at zero
  • Remove name map completely
  • Update get_fasta with new flattened arg structure
  • Update intersect and merge with flattened cli
  • Update random with flattened cli
  • Update sample with flattened cli
  • Update shift with flattened cli
  • Update sort with flattened cli
  • Update windows with flattened cli
  • Update subtract with flattened cli and unnested the module
  • Refactor closest to use args,params design
  • Remove all single-match dispatches and created a macro to handle expressions
  • Refactor all dispatch methods using macros
  • Update file names to avoid nested mod conflicts
  • Update dispatch_pair to use internal crate to avoid importing multiple macros at call sites
  • Rename get_handle to get_writer for clarity
  • Rename all output_handle to writer
  • Put translater into separate module and split sub structs and traits into files
  • Used dispatch_single macro
  • Made implementation of Reorder generic to avoid code bloat
  • Update all bam IO with htslib for increased performance
  • Rename all VCF naming instances to BCF to emphasize binary implementations over plaintext
  • Use updated query API from bedrs-2.0 and build wrapper over StrandMethod that is CLI parseable
  • Cleaner implementation using find
  • Major refactor of the IO system using macros to build up similar functionality between the differently named file io functions
  • Use bedreader funcitons for complement instead of direct read func
  • Move all dispatch methods into separate submodule and use nested macros for better shared behavior and less redundant code
  • Put shared import into dispatch to reduce imports at each command
  • Clean up macro system with single dispatch using nested macros in match to autogenerate functions
  • Remove old single dispatch methods
  • Move paired into separate module for macros
  • Use nested macro system for multi and standard input dispatch
  • Include importing input format in dispatch to clean up imports in all commands

๐Ÿ“š Documentation

  • Added documentation on README (#42)
  • Link to documentation site on the README
  • Remove old documentation before lexicographical sorting of chr
  • Added documentation to bedreader struct
  • Updated help menu to provide useful demarcations between option types following the subcommand flatten pattern
  • Added documentation of output format choices and set default to bcf
  • Added documentation to fasta format to reflect gzip
  • Added documentation to macro system on bedreader
  • Update help text of multiple bed input to b

๐ŸŽจ Styling

  • Change fasta header to better separate the different fields of bed6
  • Cargo fmt
  • Cargo formatting
  • Use same short char for input format across subcommands
  • Replace write methods with correct writeln and write_all
  • Update extend to follow similar format as other functions with format and added growth warning args
  • Update tests do avoid checks

๐Ÿงช Testing

  • Included tests of bed6 file format in get fasta.
  • Added testing support for bed6 merge subcommand
  • Add tests for gia sort with bed6 format
  • Add test for intersect with bed6 input file format
  • Add test for bed6 file format to match expected
  • Add test for correct retranslation of name field
  • Added testing for named and unnamed genome IO
  • Added testing for gzip and bgzip IO
  • Added testing for par sort
  • Added testing for bed12 for applicable commands
  • Fixed testing for new cli for get fasta merge and sample
  • Updated testing for closest - no longer returning intervals that do not have a closest match instead of generating a default
  • Removing old commented code that is redundant to tests
  • Added testing functions for flank
  • Added testing functions for shift
  • Added testing for windows subcommand
  • Added testing for coverage
  • Update tests for subtract to reflect csv output of floats
  • Added testing suite for spacing subcommand
  • Added testing for gia segment
  • Added testing for unionbedg
  • Added testing for cluster
  • Added testing for bam conversion to bed and for filtering bam files given a bed file
  • Added testing for vcf and bcf file formats
  • Update tests to use demote flag
  • Added testing for gzip fasta input
  • Added testing for different get-fasta flags

โš™๏ธ Miscellaneous Tasks

  • Increment patch version
  • Increment patch version
  • Update Cargo.toml to include required keywords for publish on crates (#44)
  • Publish gh-pages on push to main
  • Added homepage to toml and bump version
  • Make new genome struct visible through crate
  • Update random Genome instantiation to accept an external translater
  • Remove unused code for old genome io
  • Bump patch
  • Update all match_output signatures to include compression level and compression threads and propagate up the function signatures
  • Bump patch
  • Bump patch
  • Added formatting to CI
  • Remove old commented code
  • Bump patch version
  • Update patch version - large changes to internal structure but very little changes to the cli
  • Update patch version
  • Update patch version
  • Update patch version
  • Update patch version
  • Update dependency versions
  • Update patch version
  • Update patch version
  • Update patch version
  • Update gia to match bedrs and no longer use GenomicInterval
  • Updated Score functionality for bed6 and bed12
  • Update patch version
  • Remove dead code
  • Update dependencies
  • Update patch version
  • Bump patch version
  • Bump patch version
  • Update formatting
  • Bump patch version
  • Update complement to use args.params pattern
  • Update random to use args.params pattern
  • Bump patch version
  • Update patch version
  • Update patch version
  • Update patch version
  • Update patch version
  • Update patch version
  • Update bedrs to live 2 version