GOR.jl Documentation

Purpose

GOR.jl is a Julia library to operate on genome ordered streams. Elements of genome ordered streams are sorted by chromosome and position, the first two items of each row element. Streaming allows operations on data sets that are larger than available memory, and genomic order speeds up relational operations like joins.

GOR.jl supports the Tables.jl interface. This means it works with all sources and sinks that conform to the Tables.jl interface, e.g.

  • CSV files,
  • SQLite3 tables,
  • Parquet files,
  • and DataFrames,

and are ordered by chromosome and position.

GOR.jl allows creation of complex pipelines by joining together operators using the |> syntax. Each operator is implemented as a Julia iterator that is parameterized on input and output types. This allows for easy extension of the library by user-defined operations.

I/O

GOR.GorFileMethod
GorFile(path; delim = "	", limit = 10_000, first = nothing, last = nothing)

Open gor file at path as genome ordered stream.

A gor file is a tab-delimited file with header with first two columns corresponding to Chrom and Pos. The file needs to be sorted by (Chrom, Pos), using e.g. unix command sort -k1,1 -k2,2n.

Arguments

  • path: path to file
  • delim: file delimiter
  • limit: number of rows to use for type inference
  • first: first coordinate to report (chrom, pos) or nothing
  • last: last coordinate to report (chrom, pos) or nothing
source
GOR.write_gorMethod
write_gor(rows, path)
rows |> write_gor(path)

Write genome ordered stream rows as tab-delimited text file.

source
GOR.GorzFileMethod
GorzFile(path; limit = 10_000, first = nothing, last = nothing)

Open compressed gor file (.gorz) at path as genome ordered stream.

Use up to limit number of rows for type inference.

source
GOR.ParquetFileMethod
ParquetFile(path)

Open Parquet file at path as genome ordered stream.

This uses implementation in Parquet.jl, which is not very mature yet. Iterator struct contains state, so best to use data = () -> ParquetFile("data.parquet").

source
GOR.write_parquetMethod
write_parquet(rows, path)
rows |> write_parquet(path)

Write genome ordered stream rows as Parquet file to path.

source

Operations on streams

GOR.verifyorderMethod
verifyorder(rows)
rows |> verifyorder

Verify order of genome ordered stream rows.

An iterator that checks if its input is ordered by (elt[1], elt[2]). The iterator passes the input rows through, but throws an ErrorException if rows are out of order.

source
GOR.selectMethod
select(rows, columns::Symbol...)
rows |> select(columns::Symbol...)

Select columns from genome ordered stream rows.

source
GOR.filterMethod
filter(rows, predicate)
rows |> filter(predicate)

Filter rows to include only rows that fulfil predicate.

source
GOR.renameMethod
rename(rows, args::Pair...)
rows |> rename(args::Pairs...)

Rename columns in genome ordered stream rows. Old and new column names are specified as :oldcol => :newcol.

source
GOR.mutateMethod
mutate(rows, cols::Tuple, func)
mutate(rows, col::Symbol, func)
rows |> mutate(cols::Tuple, func)
rows |> mutate(col::Symbol, func)

Add or replace columns cols computed as cols = func(row) to genome ordered stream rows.

source
GOR.mergeMethod
merge(left, right)
left |> merge(right)

Merge left and right genome ordered streams.

Output stream contains union of columns with type promotion

source
GOR.mapMethod
map(rows, func)
rows |> map(func)

Apply function func to elements of genome ordered stream rows. The function func should return a NamedTuple.

NOTE: Julia cannot infer the type of NamedTuples with Union{Missing,T}. This means that if the input stream rows has columns of type Union{Missing,T}, the pipeline probably fails.

source
GOR.joinMethod
join(left, right; kind = :snpsnp, leftjoin = false, window = 0)

left |> join(right; kind = :snpsnp, leftjoin = false, window = 0)

Join genome ordered streams left and right on (elt[1], elt[2]).

Arguments

  • leftjoin::Bool: should left join be performed
  • kind::Symbol: how should overlap be determined (:snpsnp, :snpseg, :segsnp, :segseg)
  • window::Int: allow window base pairs difference in position
source

Grouping

GOR.groupbyFunction
rows |> groupby(n=0, groupcols = []; aggregates...)

Group genome ordered stream by position window and groupcols. Summarize groups using aggregates.

For each window and combination of grouping columns, compute the online-statistics specified.

Arguments

  • n::Int : group by window of size n, genomewide for n=0
  • groupcols::Vector{Symbol}: additional grouping columns
  • aggregates : online statistics to compute
source

Currently the following aggregators are implemented. See OnlineStats.jl for more ideas.

GOR.SumType
Sum(column::Symbol, val = 0.0)

Aggregator for sum of column.

source
GOR.AvgMethod
Avg(column::Symbol)

Aggregator for average of column.

source

Index