The datasalad documentation
datasalad
is a pure-Python library with a collection of utilities for
working with data in the vicinity of Git and git-annex. While this is a
foundational library from and for the DataLad project, its implementations are standalone, and are meant to
be equally well usable outside the DataLad system.
A focus of this library is efficient communication with subprocesses, such as Git or git-annex commands, which read and produce data in some format.
Here is a demo of what can be accomplished with this library. The following
code queries a remote git-annex repository via a git annex find
command
running over an SSH connection in batch-mode. The output in JSON-lines format
is then itemized and decoded to native Python data types. Both inputs and
outputs are iterables with meaningful items, even though at a lower level
information is transmitted as an arbitrarily chunked byte stream.
>>> from more_itertools import intersperse
>>> from pprint import pprint
>>> from datasalad.runners import iter_subproc
>>> from datasalad.itertools import (
... itemize,
... load_json,
... )
>>> # a bunch of photos we are interested in
>>> interesting = [
... b'DIY/IMG_20200504_205821.jpg',
... b'DIY/IMG_20200505_082136.jpg',
... ]
>>> # run `git-annex find` on a remote server in a repository
>>> # that has these photos in the worktree.
>>> with iter_subproc(
... ['ssh', 'photos@pididdy.local',
... 'git -C "collections" annex find --json --batch'],
... # the remote process is fed the file names,
... # and a newline after each one to make git-annex write
... # a report in JSON-lines format
... inputs=intersperse(b'\n', interesting),
... ) as remote_annex:
... # we loop over the output of the remote process.
... # this is originally a byte stream downloaded in arbitrary
... # chunks, so we itemize at any newline separator.
... # each item is then decoded from JSON-lines format to
... # native datatype
... for rec in load_json(itemize(remote_annex, sep=b'\n')):
... # for this demo we just pretty-print it
... pprint(rec)
{'backend': 'SHA256E',
'bytesize': '3357612',
'error-messages': [],
'file': 'DIY/IMG_20200504_205821.jpg',
'hashdirlower': '853/12f/',
'hashdirmixed': '65/qp/',
'humansize': '3.36 MB',
'key': 'SHA256E-s3357612--700a52971714c2707c2de975f6015ca14d1a4cdbbf01e43d73951c45cd58c176.jpg',
'keyname': '700a52971714c2707c2de975f6015ca14d1a4cdbbf01e43d73951c45cd58c176.jpg',
'mtime': 'unknown'}
{'backend': 'SHA256E',
'bytesize': '3284291',
...
Package overview
Also see the Module Index.
Handling of Git's pathspecs with subdirectory mangling support |
|
Context manager to communicate with a subprocess using iterables |
|
Various iterators, e.g., for subprocess pipelining and output processing |
|
High-level utilities for execution of subprocesses |
|
Hierarchical, multi-source settings management |
Why datasalad
?
This is a base library for DataLad, hence the name Data-sa-Lad
. The sa
might stand for “support assemblage”, or “smart assets”. More importantly, the
library is a mixture of more-or-less standalone utilities that “make up the
salad”.
After ~10 years of developing DataLad, these utilities have been factored out of the codebase to form a clearer, faster, better documented, and more accessible set of building blocks for the next decade.