[clug] DGSH - directed graph shell. adding parallelism to shell & pipes

Brenton Ross rossb at fwi.net.au
Fri Jul 14 03:58:33 UTC 2017


I've had a preliminary look at dgsh, and I'm not overly taken with the
approach they took.
They have replaced the normal Unix pipe interface for stdin and stdout
with sockets, which means that the core utilities (and anything else you
want to use via pipes) has to be the modified version for dgsh. This
will mean having two versions of these programs which is a bit
problematic. There is also the question of maintenance - over time the
two version could drift apart as bug fixes and enhancements are applied.
If dgsh eventually becomes a normal part of a Unix/Linux distribution
then we could end up with two groups of incompatible programs, requiring
wrappers and other kludges to do something that has been easy since
about 1972.
I also note that their design only applies to stdin and stdout. The
stderr stream remains in its current form.

However, it got me wondering if there was another way, one that did not
require modifying the programs.
I think I could add a couple of extensions to VICI that would cover a
lot of dgsh's capabilities, and have some further advantages.

The first change would be to introduce named streams - the data flows
could be given a label. If a program connected to a named stream used
the name as a filename parameter, then VICI would substitute the label
with the path to a Unix named pipe. This would allow programs to connect
to multiple pipes. Of course it would not help for the cases where dgsh
has modified the actual interface to the program, such as grep having
multiple inputs and outputs, but you could create a modified grep with
that capability that would still be compatible with bash etc.

The second change is to introduce what I call a "manifold". This object
can have any number of stdin and stdout streams. It would have several
modes of operation:

     1. Sequential, where it reads from its first stream until its
        exhausted (closed), then reads from the second until that is
        finished, etc
     2. Merge, where any input is sent immediately to the output (line
        by line)
     3. Parallel, where reading blocks until something is ready on all
        the input streams. This would help to synchronise processing.
     4. Copy, where each input is sent to all the output streams
     5. Distribute, where the input lines are sent to the output streams
        in round-robin fashion.

The manifold would start a new thread for each of its output streams to
achieve the multiprocessing capability of dgsh.

Hence, I think it would have been possible to create dgsh without having
to fork the core utility programs to create an new set of incompatible
programs.

Brenton


On Wed, 2017-07-12 at 21:44 +1000, Brenton Ross via linux wrote:

> Steve,
> 
> Thanks for posting this. 
> I have been contemplating adding something similar to VICI. 
> I will have to read up on this to see how they manage the interaction
> between the data flow and the flow of control. 
> 
> Cheers
> Brenton
> 
> On Wed, 2017-07-12 at 09:35 +1000, steve jenkin via linux wrote:
> 
> > This is an interesting take on a 25+ year-old idea of ‘Multipipes’ in the Unix shell. Much more than the ‘parallel’ command or managing a bunch of named pipes.
> > This one is based on ‘bash’ with another 12 or so commands modified to read & write to multiple pipes.
> > 
> > One that appeals to me is ‘grep’. It takes 0-2 input streams and writes to 0-4 streams.
> > 	Available output streams (via arguments): matching files, non-matching files, matching lines, and non-matching lines
> > 
> > The paper uses the same examples & diagrams as the website, but has much more discussion, a good history of the topic and 46 references.
> > 
> > The design & examples are about a very Unix-y thing: streaming data and processing it just once. Not have to save intermediate files and reprocess them multiple times.
> > In a world of many cores and ‘Big Data’, being able to ‘naturally’ process data streams in parallel is an important new facility.
> > It’s even useful at the other end of the spectrum where I/O bandwidth & storage space is limited. On low-power, low-performance “IoT” devices like Single Board Computers and low-end smartphones.
> > Will we see a version built for ‘busybox’? It’s possible because of the design’s “coupling and cohesion” choices.
> > 
> > They’ve thought about the design and implementation - limiting it to a limited syntax change to the (bash) shell.
> > Not sure how well tested & debugged it is, but because of the design you’d think there wouldn’t be many.
> > 
> > regards
> > steve
> > 
> > ———————————
> > 
> > dgsh — directed graph shell
> > <https://www.spinellis.gr/sw/dgsh/#intro>
> > > The directed graph shell, dgsh (pronounced /dæɡʃ/ — dagsh), provides an expressive way to construct sophisticated and efficient big data set and stream processing pipelines using existing Unix tools as well as custom-built components.
> > > It is a Unix-style shell (based on bash) allowing the specification of pipelines with non-linear non-uniform operations.
> > > These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the operation's processing throughput.
> > > 
> > > If you want to get a feeling on how dgsh works in practice, skip right down to the examples section.
> > > 
> > > For a more formal introduction to dgsh or to cite it in your work, see:
> > > Diomidis Spinellis and Marios Fragkoulis. Extending Unix Pipelines to DAGs. IEEE Transactions on Computers, 2017. doi: 10.1109/TC.2017.2695447
> > 
> > 
> > Nuclear magnetic resonance processing - 12-stage pipeline run in parallel
> > <https://www.spinellis.gr/sw/dgsh/#NMRPipe>
> > 
> > 
> > Extending Unix Pipelines to DAGs
> > 	Diomidis Spinellis, Senior Member, IEEE
> > 	Marios Fragkoulis
> > <http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7903579>
> > > 
> > > Abstract—The Unix shell dgsh provides an expressive way to construct sophisticated and efficient non-linear pipelines using standard Unix tools, as well as third-party and custom-built components. 
> > > Dgsh allows the specification of pipelines that perform non-uniform non-linear processing. 
> > > These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the processing task’s throughput. 
> > > A number of existing Unix tools have been adapted to take advantage of the shell’s multiple pipe input/output capabilities. 
> > > The shell supports visualization of the process graphs, which can also aid debugging. 
> > > Dgsh was evaluated through a number of common data processing and domain-specific examples, and was found to offer an expressive way to specify processing topologies, while also generally increasing processing throughput.
> > > 
> > > Index Terms—Process-level parallelism, Unix, pipeline, pipes and filters architecture
> > 
> > --
> > Steve Jenkin, IT Systems and Design 
> > 0412 786 915 (+61 412 786 915)
> > PO Box 38, Kippax ACT 2615, AUSTRALIA
> > 
> > mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin
> > 
> > 
> 
> 




More information about the linux mailing list