Noamteyssier Gia Versions Save

gia: Genomic Interval Arithmetic

0.2

2 months ago

Changelog

File Support

Native support for all types of interval files

BED3
BED4
BED6
BED12
BedGraph
Generic BED (BED3 + columns)
GTF

Specialized functions for HTSlib data structures

BAM
VCF / BCF

Auto-determined Naming Schemes

User doesn't need to provide whether files are named/unnamed
File format is automatically determined and will default to generic BED if it cannot be figured out.

BGZIP Support

FASTA
VCF

Utilities

Current List of Utilities for Native BED Files

Closest
Cluster
Complement
Coverage
Extend
Flank
Get Fasta
Intersect
Merge
Random
Sample
Segment
Shift
Sort
Spacing
Subtract
Unionbedg
Window

Specialized HTSlib Utilities

BAM Convert
BAM Coverage
BAM Filter
BCF Filter

Stranded Methods

Closest
Coverage
Extend
Flank
Get Fasta
Intersect
Merge
Subtract
Window
BAM Coverage
BAM Filter
BCF Filter

Multiple Inputs

Closest
Coverage
Intersect
Subtract
Window

Commit Changelog

🚀 Features

File Support

Added bed6 support for get_fasta
Implement bed6 support for merge subcommand. added format argument to cli
Increment bedrs version to 0.1.10
Implement Coordinates for references to NumericBed6 for code generalization
Made intersect (inplace) compatible with bed6 file format inputs. also refactored internal function calls to have streamed match branch selection inside the function instead of within main
Add support for bed6 with subtract submodule.
Add a reorder trait which operates on different coordinate types
Add an extra method onto numbericbed6 to return and update the name
Added a named argument to random to allow for named genome inputs. using new refactored genome struct
Added compression threads and level as global arguments to the cli
Update bedrs version and added rayon feature
Added parallel sorting argument on sort
Added support for bed12 file format
Added 3 column output as a function within writenamediter for merging
Update gia version to 0.2
Added auto determination of string/numeric format with BedReader
Added flanking function to gia with an optional genome file
Added percentage to flank command
Added shifting subcommand to gia
Added window overlaps as subcommand
Added an interval depth structure for fast serde serialization
Added a naive implementation of coverage
Added direct type conversion to inputs
Use direct type conversion in closest
Added bed4 as an auto-determined input format
Added an ambiguous input format which reads in all 3+ columns into a tab-delim string
Added a split translater which keeps an internal translator for the chr and metadata separately
Added a split translater which contains two internal translaters. one for the chr translation and one for the meta translation. During sorting, only the chr translater is sorted which heavily reduces the amount of keys to reorder
Skip commented lines in input matching
Added in gtf set parsing
Added in reading functions for gtf
Added spacing to the cli
Added implementation for spacing - as well as a type for spacing interval outputs which appends a Score to the TSV to incorporate dots for Nulls
Added a command which wraps the segmentation algorithm
Added a unionbedg command to cli as well as a shared multiinput which accepts gt 1 filename
Added a bedreader over bedgraph files
Implementation of the unionbedg algorithm using a union over the bed sets, segmentation, then intersections
Added a specific writing utility for segments with variable score slices used in unionbedg without reinit csv writers and flushing
Added bedgraph to generic dispatch mechanisms
Added a cluster command which uses the depth interval struct for writing out
Added noodles for bam parsing
Added a bam subcommand with an internal convert subcommand to convert bam into bed
Added an unimplemented warning with bail instead of panic
Added bam output options
Added cli interface for mixed inputs bam/bed and a filter command which can be used to select bam intervals that meet overlap criteria
Moved bam parsing functions into a shared utility directory
Added new dispatch for bam and header with variable bed format
Added convenience tool for pulling chr idx directory without specifying a group
Implementation of the bam filtering algorithm given an interval file as b
Added invert as an output predicate to bam filter, bit slow than bedtools so should compare whether noodles or htslib is faster for writing
Added htslib and removed noodles
Added a vcf filtering method - borrowing API of rust_htslib-bam centric methods. Also renamed some overlapping namespaces to delineate bam and vcf origins
Added clone derive for all subcommand args
Add both format and compression status to single output format
Added stranded method to Growth to propagate stranded methods to flank, window, extend
Added stranded and specific stranded methods to merge. Also added a demote parameter so that merge will by default return the same output format as input format but can be demoted to bed3 if specified
Added stranded methods to bam filter
Added strandedness to closest and match bedrs-2.0 api for call
Added a bam coverage command which accepts a BAM/BED input and counts the number of BAM records that overlap at the BED record
Added thread count to bam coverage when reading bam
Added threaded option to bam convert
Refactored get-fasta into a module and write a bgzip get-fasta using rust-htslib
Changed b to allow multiple inputs and set up a ranking system for type demotion
Build dispatch with multiple rhs option
Added multiple b-file concatenation to all dual input commands

🐛 Bug Fixes

Fix bug in tests where wrong read function was imported
Modify tests for new shorthand
Fix keyword 'about' to 'description'
Sort was not retranslating the name field of bed6
Allow named chr names in input genome file to extend
Update cli to remove argument bounds on inverse for windows
Update tests to use new cli
Fix bug in tests where columns were being split on newline instead of tab
Bug where intersect was skipping sorting file pairs
Force meta intervals to always be named because their metadata must always be interpreted as a string
Update subtract tests to follow inheritance rules of scores. remove score types from generic
Update segment ordering to match bedtools ordering
Take explicit end of vcf for structural variants
Rename stream in tests
Update formatting
Update dual generics on StrandedBed3 to match bedrs-2.0 development

🚜 Refactor

Remove write_records, write_named_records, and write_records_with by implementing WriteIterImpl for references to coordinates
Folded write_records and all associated versions into the WriteIter trait to avoid handling multiple versions of essentially the same code. Needed to also handle generic translaters for this.
Needed to specify a specific type to the None in intersect write
Remove dead code for format set - will implement in a future version in a different branch
Move internal read methods to private to limit number of public read methods
Since unnamed iter was already generalized it didn't make sense to include it in bed3. instead I created a new file 'iter' and reexport it from there publicly
Create a new struct for a genome with multiple build styles
Allow genome to accept an externally provided translater in cases where named bed inputs are read in first
Take an external compression threads and level so they are not fixed at compile time
Use full rust version of gzp to avoid external cmake dependency
Use bedrs 0.2 for lib
Remove all mentions of Containers and use IntervalContainer structs instead
Update tests to use IntervalContainer structs instead
Use bedrs buildin types instead of custom-spun bed6 and bed12 as well as GenomicIntervals
Create a BedReader struct which handles file IO and autodetermines input format
Include flate2 for input instead of niffler
Remove InputFormat impl as it is rolled into the BedReader
Use BedReader for sort module
Update sample to use new input/field format scheme and bedreader
Updated merge module with new format inputs and also generalized streaming iterator to all unnamed file formats
Update get_fasta with new formats and generalize initialization of fasta and interval reader before writing
Update extend to use new input format specs
Major refactor of intersection to handle mixed file formats and using the bedreader struct
Refactor closest to use mixed file formats - required rewriting the pairs struct to handle mixed interval types as well as named conversions
Update subtract to used mixed file formats and dispatch pattern. can fully remove overlaps module now since that is handled internally by bedrs
Remove all old read pairs code since it is handled better via dispatch and bedreader
Update extend methods to use built-in bounds
Use built-in methods for calculating percentages and bounding extensions in bedrs
Used owned find iter to avoid constant rebuffering of output
Move cli to separate module
Have closest use new argument dispatching and argument folding
Update complement to use new cli flattening
Update coverage with new flattened arg structure and introduced an overlap_predicates that can be shared
Update extend to use new flattened arg structure and introduced percentages
Update flank to use new flattened arg structure and introduced percentages
Update flank to set left bound at zero
Remove name map completely
Update get_fasta with new flattened arg structure
Update intersect and merge with flattened cli
Update random with flattened cli
Update sample with flattened cli
Update shift with flattened cli
Update sort with flattened cli
Update windows with flattened cli
Update subtract with flattened cli and unnested the module
Refactor closest to use args,params design
Remove all single-match dispatches and created a macro to handle expressions
Refactor all dispatch methods using macros
Update file names to avoid nested mod conflicts
Update dispatch_pair to use internal crate to avoid importing multiple macros at call sites
Rename get_handle to get_writer for clarity
Rename all output_handle to writer
Put translater into separate module and split sub structs and traits into files
Used dispatch_single macro
Made implementation of Reorder generic to avoid code bloat
Update all bam IO with htslib for increased performance
Rename all VCF naming instances to BCF to emphasize binary implementations over plaintext
Use updated query API from bedrs-2.0 and build wrapper over StrandMethod that is CLI parseable
Cleaner implementation using find
Major refactor of the IO system using macros to build up similar functionality between the differently named file io functions
Use bedreader funcitons for complement instead of direct read func
Move all dispatch methods into separate submodule and use nested macros for better shared behavior and less redundant code
Put shared import into dispatch to reduce imports at each command
Clean up macro system with single dispatch using nested macros in match to autogenerate functions
Remove old single dispatch methods
Move paired into separate module for macros
Use nested macro system for multi and standard input dispatch
Include importing input format in dispatch to clean up imports in all commands

📚 Documentation

Added documentation on README (#42)
Link to documentation site on the README
Remove old documentation before lexicographical sorting of chr
Added documentation to bedreader struct
Updated help menu to provide useful demarcations between option types following the subcommand flatten pattern
Added documentation of output format choices and set default to bcf
Added documentation to fasta format to reflect gzip
Added documentation to macro system on bedreader
Update help text of multiple bed input to b

🎨 Styling

Change fasta header to better separate the different fields of bed6
Cargo fmt
Cargo formatting
Use same short char for input format across subcommands
Replace write methods with correct writeln and write_all
Update extend to follow similar format as other functions with format and added growth warning args
Update tests do avoid checks

🧪 Testing

Included tests of bed6 file format in get fasta.
Added testing support for bed6 merge subcommand
Add tests for gia sort with bed6 format
Add test for intersect with bed6 input file format
Add test for bed6 file format to match expected
Add test for correct retranslation of name field
Added testing for named and unnamed genome IO
Added testing for gzip and bgzip IO
Added testing for par sort
Added testing for bed12 for applicable commands
Fixed testing for new cli for get fasta merge and sample
Updated testing for closest - no longer returning intervals that do not have a closest match instead of generating a default
Removing old commented code that is redundant to tests
Added testing functions for flank
Added testing functions for shift
Added testing for windows subcommand
Added testing for coverage
Update tests for subtract to reflect csv output of floats
Added testing suite for spacing subcommand
Added testing for gia segment
Added testing for unionbedg
Added testing for cluster
Added testing for bam conversion to bed and for filtering bam files given a bed file
Added testing for vcf and bcf file formats
Update tests to use demote flag
Added testing for gzip fasta input
Added testing for different get-fasta flags

⚙️ Miscellaneous Tasks

Increment patch version
Increment patch version
Update Cargo.toml to include required keywords for publish on crates (#44)
Publish gh-pages on push to main
Added homepage to toml and bump version
Make new genome struct visible through crate
Update random Genome instantiation to accept an external translater
Remove unused code for old genome io
Bump patch
Update all match_output signatures to include compression level and compression threads and propagate up the function signatures
Bump patch
Bump patch
Added formatting to CI
Remove old commented code
Bump patch version
Update patch version - large changes to internal structure but very little changes to the cli
Update patch version
Update patch version
Update patch version
Update patch version
Update dependency versions
Update patch version
Update patch version
Update patch version
Update gia to match bedrs and no longer use GenomicInterval
Updated Score functionality for bed6 and bed12
Update patch version
Remove dead code
Update dependencies
Update patch version
Bump patch version
Bump patch version
Update formatting
Bump patch version
Update complement to use args.params pattern
Update random to use args.params pattern
Bump patch version
Update patch version
Update patch version
Update patch version
Update patch version
Update patch version
Update bedrs to live 2 version