How can we deal with large datasets? For example I have to work on large data of hundreds GB, how would this be integrated into Popper?¶
For datasets that are large enough that they cannot be managed by Git, solutions such as a PFS, GitLFS, Datapackages, ckan, among others exist. These tools and services allow users to manage large datasets and version-control them. From the point of view of Popper, this is just another tool that will get invoked as part of the execution of a pipeline. As part of our documentation, we have examples on how to use datapackages, and another on how to use data.world.
How can Popper capture more complex workflows? For example, automatically restarting failed tasks?¶
A Popper pipeline is a simple sequence of bash scripts. Popper is not a replacement for scientific workflow engines, instead, its goal is to capture the highest-most workflow: the human interaction with a terminal. For more on this, please take a look at the Popper vs. other software section of our documentation.
Can I follow Popper in computational science research, as opposed to computer science?¶
Yes, the goal for Popper is to make it a domain-agnostic experimentation protocol. Examples of how to follow Popper on distinct domains: atmospheric science, computational neuroscience, genomics and applied math.
How to apply the Popper protocol for applications that take large quantities of computer time?¶
popper run command has a
--skip argument that can be used to
execute a pipeline in multiple steps. So for example, assuming we have
a pipeline with the following scripts:
validate, then, you could run:
popper run --skip post-run,validate
which would run the first part (setup and execution). Then, later you
either manually check whether your jobs are done or you automate it in
post-run script (I know SLURM has a way of querying the status
of your jobs) and run:
popper run --skip setup.sh,run.sh
and the above will just execute the second half of your pipeline. The
post-run script could either first check the status of your jobs, or
just assume that they’re done and will
scp the data to your local
machine (or move output to a NAS, etc.), and the
invoke the analysis.
Another practice we have been following is to have a specific set of
parameters for the pipeline with the goal of running a smaller scale
simulation/analysis. The idea is to use this when running on a CI
service such as Travis in order to test the
entire pipeline in a relatively short amount of time (Travis times out
jobs after 3 hours). So this ends up looking something like
i.e. a conditional in a stage that, depending on the environment (in
this case a
CI variable defined), the parametrization and setup is
different, but the rest of the pipeline runs in the same fashion.
While this approach doesn’t really executes the actual original
simulation, at least it lets us test the integrity of the scripts.