GOR.jl Documentation
Purpose
GOR.jl is a Julia library to operate on genome ordered streams. Elements of genome ordered streams are sorted by chromosome and position, the first two items of each row element. Streaming allows operations on data sets that are larger than available memory, and genomic order speeds up relational operations like joins.
GOR.jl supports the Tables.jl interface. This means it works with all sources and sinks that conform to the Tables.jl interface, e.g.
- CSV files,
- SQLite3 tables,
- Parquet files,
- and DataFrames,
and are ordered by chromosome and position.
GOR.jl allows creation of complex pipelines by joining together operators using the |> syntax. Each operator is implemented as a Julia iterator that is parameterized on input and output types. This allows for easy extension of the library by user-defined operations.
I/O
GOR.GorFile — MethodGorFile(path; delim = " ", limit = 10_000, first = nothing, last = nothing)Open gor file at path as genome ordered stream.
A gor file is a tab-delimited file with header with first two columns corresponding to Chrom and Pos. The file needs to be sorted by (Chrom, Pos), using e.g. unix command sort -k1,1 -k2,2n.
Arguments
- path: path to file
- delim: file delimiter
- limit: number of rows to use for type inference
- first: first coordinate to report (chrom, pos) or
nothing - last: last coordinate to report (chrom, pos) or
nothing
GOR.write_gor — Methodwrite_gor(rows, path)
rows |> write_gor(path)Write genome ordered stream rows as tab-delimited text file.
GOR.GorzFile — MethodGorzFile(path; limit = 10_000, first = nothing, last = nothing)Open compressed gor file (.gorz) at path as genome ordered stream.
Use up to limit number of rows for type inference.
GOR.ParquetFile — MethodParquetFile(path)Open Parquet file at path as genome ordered stream.
This uses implementation in Parquet.jl, which is not very mature yet. Iterator struct contains state, so best to use data = () -> ParquetFile("data.parquet").
GOR.write_parquet — Methodwrite_parquet(rows, path)
rows |> write_parquet(path)Write genome ordered stream rows as Parquet file to path.
Operations on streams
GOR.verifyorder — Methodverifyorder(rows)
rows |> verifyorderVerify order of genome ordered stream rows.
An iterator that checks if its input is ordered by (elt[1], elt[2]). The iterator passes the input rows through, but throws an ErrorException if rows are out of order.
GOR.select — Methodselect(rows, columns::Symbol...)
rows |> select(columns::Symbol...)Select columns from genome ordered stream rows.
GOR.filter — Methodfilter(rows, predicate)
rows |> filter(predicate)Filter rows to include only rows that fulfil predicate.
GOR.rename — Methodrename(rows, args::Pair...)
rows |> rename(args::Pairs...)Rename columns in genome ordered stream rows. Old and new column names are specified as :oldcol => :newcol.
GOR.mutate — Methodmutate(rows, cols::Tuple, func)
mutate(rows, col::Symbol, func)
rows |> mutate(cols::Tuple, func)
rows |> mutate(col::Symbol, func)Add or replace columns cols computed as cols = func(row) to genome ordered stream rows.
GOR.merge — Methodmerge(left, right)
left |> merge(right)Merge left and right genome ordered streams.
Output stream contains union of columns with type promotion
GOR.map — Methodmap(rows, func)
rows |> map(func)Apply function func to elements of genome ordered stream rows. The function func should return a NamedTuple.
NOTE: Julia cannot infer the type of NamedTuples with Union{Missing,T}. This means that if the input stream rows has columns of type Union{Missing,T}, the pipeline probably fails.
GOR.join — Methodjoin(left, right; kind = :snpsnp, leftjoin = false, window = 0)
left |> join(right; kind = :snpsnp, leftjoin = false, window = 0)Join genome ordered streams left and right on (elt[1], elt[2]).
Arguments
leftjoin::Bool: should left join be performedkind::Symbol: how should overlap be determined (:snpsnp, :snpseg, :segsnp, :segseg)window::Int: allowwindowbase pairs difference in position
Grouping
GOR.groupby — Functionrows |> groupby(n=0, groupcols = []; aggregates...)Group genome ordered stream by position window and groupcols. Summarize groups using aggregates.
For each window and combination of grouping columns, compute the online-statistics specified.
Arguments
n::Int: group by window of sizen, genomewide forn=0groupcols::Vector{Symbol}: additional grouping columnsaggregates: online statistics to compute
Currently the following aggregators are implemented. See OnlineStats.jl for more ideas.
GOR.Sum — TypeSum(column::Symbol, val = 0.0)Aggregator for sum of column.
GOR.Count — TypeCount()Aggregator for count.
GOR.Avg — MethodAvg(column::Symbol)Aggregator for average of column.