New pipeline initialization¶
Create a Git repository:
mkdir mypaper cd mypaper git init echo '# mypaper' > README.md git add . git commit -m 'first commit'
Initialize the popper repository and add the configuration file to git:
popper init git add . git commit -m 'adds .popper.yml file'
Initialize pipeline using
popper init myexp
Show what this did:
ls -l pipelines/myexp
Commit the “empty” pipeline:
git add pipelines/myexp git commit -m 'adding myexp scaffold'
Executing a pipeline¶
To automatically run a pipeline:
popper run myexp
or to execute all the pipelines in a project:
Once a pipeline is run, one can show the logs:
ls -l pipelines/myexp/popper/host
Reusing existing pipelines¶
Many times, when starting an experiment, it is useful to be able to use existing pipelines as scaffolding for the operations we wish to make. The Popperized GitHub organization exists as a curated list of existing Popperized experiments and examples, for the purpose of both learning and scaffolding new projects. Additionally, the CLI includes capabilities easily sift through and import these pipelines.
Searching for existing pipelines¶
The Popper CLI is capable of searching for premade and template pipelines that
you can modify for your own uses. You can use the
popper search command to
find pipelines using keywords. For example, to search for pipelines that use
docker you can simply run:
$ popper search docker [####################################] Searching in popperized | 100% Search results: > popperized/popper-readthedocs-examples/docker-data-science > popperized/swc-lesson-pipelines/docker-data-science
By default, this command will look inside the
Popperized GitHub organization but you
can configure it to search the GitHub organization or repository of your choice
popper search --add <org-or-repo-name> command. If you’ve added
more organizations, you may list them with
popper search --ls, or remove one
with `popper search –rm <org-or-repo-name>
Additionally, when searching for a pipeline, you may choose to include the
contents of the readme in your search if you wish by providing the additional
--include flag to
Importing existing pipelines¶
Once you have found a pipeline you’re interested in importing, you can use
popper add plus the full pipeline name to add the pipeline to the popperized
$ popper add popperized/popper-readthedocs-examples/docker-data-science Downloading pipeline docker-data-science as docker-data-science... Updating popper configuration... Pipeline docker-data-science has been added successfully.
This will download the contents of the repo to your project tree and register
it in your
.popper.yml configuration file. If you want to add the pipeline
inside a different folder, you can also specify that in the
$ popper add popperized/popper-readthedocs-examples/docker-data-science docker-pipeline Downloading pipeline docker-data-science as docker-pipeline... Updating popper configuration... Pipeline docker-pipeline has been added successfully. $ tree mypaper └── pipelines └── docker-pipeline ├── README.md ├── analyze.sh ├── docker │ ├── Dockerfile │ ├── app.py │ ├── generate_figures.py │ └── requirements.txt ├── generate-figures.sh ├── results │ ├── naive_bayes.png │ ├── naive_bayes_results.csv │ ├── svm_estimator.png │ └── svm_estimator_results.csv └── setup.sh
You can also tell
popper add to instead pull the pipeline from another git
branch by optionally providing the
--branch <branch-name> option to the
Continously validating a pipeline¶
The following is the list of steps that are verified when validating an pipeline:
- For every pipeline, trigger an execution by sequentially invoking all the scripts for all the defined stages of the pipeline.
- After the pipeline finishes, if a
validate.shscript is defined, parse its output.
- Keep track of every pipeline and report their status.
There are three possible statuses for every pipeline:
GOLD. There are two possible values for the status of a
false. When the pipeline status is
list of validations is empty since the pipeline execution has failed
and validations are not able to execute at all. When the pipeline
GOLD, the status of all validations is
true. When the
pipeline runs correctly but one or more validations fail (pipeline’s
PASS), the status of one or more validations is
tool includes a
run subcommand that can be executed to test
locally. This subcommand is the same that is executed by the PopperCI
service, so the output of its invocation should be, in most cases, the
same as the one obtained when PopperCI executes it. This helps in
cases where one is testing locally. To execute test locally:
cd my/paper/repo popper run myexperiment [####################################] None status: SUCCESS
The status of the execution, as well as the
stderr output for
each stage is stored in the
popper/host directory inside your pipeline. In
addition to the
host directory, a new directory will be created for every
environment you set your pipeline to run on.
popper/host ├── popper_status ├── post-run.sh.err ├── post-run.sh.out ├── run.sh.err ├── run.sh.out ├── setup.sh.err ├── setup.sh.out ├── teardown.sh.err ├── teardown.sh.out ├── validate.sh.err └── validate.sh.out
These files are added to the
so they won’t be committed to the git repository when doing
To quickly remove them, one can clean the working tree:
# get list of files that would be deleted # include directories (-d) # include ignored files (-x) git clean -dx --dry-run # remove --dry-run and add --force to actually delete files git clean -dx --force
popper run will set a timeout on the execution of your
pipelines. You may modify the timeout using the
in the form of
popper run --timeout 600s. You can also disable
the timeout altogether by setting
--timeout to 0.
We maintain a badging service that can be used to keep track of the status of a pipeline.
Badges are commonly used to denote the status of a software project with respect to certain aspect, e.g. whether the latest version can be built without errors, or the percentage of code that unit tests cover (code coverage). Badges available for Popper are shown in the above figure. If badging is enabled, after the execution of a pipeline, the status of a pipeline is recorded in the badging server, which keeps track of statuses for every revision of ever pipeline.
Users can include a link to the badge in the
README page of a
pipeline, which can be displayed on the web interface of the version
control system (GitHub in this case). The CLI tool can generate links
popper badge --service popper
Which prints to
stdout the text that should be added to the
file of the pipeline.
Visualizing a pipeline¶
Popper gives a user the ability to visualize the workflow of a pipeline using the
popper workflow pipeline_name command. The command generates a workflow diagram
corresponding to a Popper pipeline, in the .dot format. The string defining
the graph is printed to stdout so it can be piped into other tools.
For example,to generate a png file, one can make use of the graphviz CLI tools:
popper workflow mypipe | dot -T png -o mypipe.png.
popper workflow co2-emissions | dot -T png -o co2_workflow.png
This will lead to the generation of the following dot graph:
Adding metadata to a project¶
Metadata to a project can be added using the
metadata command, which
key-value pair to the repository (to the
popper metadata --add author='Jane Doe'
The above adds the metadata item
author to the project. To retrieve
the list of keys:
And one removes a key by doing:
popper metadata --rm author
popper.yml configuration file¶
popper command reads the
.popper.yml file in the root of a
project to figure out how to execute pipelines. While this file can be
manually created and modified, the
popper command makes changes to
this file depending on which commands are executed.
The project folder we will use as example looks like the following:
$> tree -a -L 2 my-paper my-paper/ ├── .git ├── .popper.yml ├── paper └── pipelines ├── analysis └── data-generation
That is, it contains three pipelines named
.popper.yml for this project
metadata: access_right: open license: CC-BY-4.0 publication_type: article upload_type: publication pipelines: paper: envs: - host path: paper stages: - build data-generation: envs: - host path: pipelines/data-generation stages: - first - second - post-run - validate - teardown analysis: envs: - host path: pipelines/analysis stages: - run - post-run - validate - teardown popperized: - github/popperized
At the top-level of the YAML file there are entries named
pipelines YAML entry specifies the details for all the available
pipelines. For each pipeline, there is information about:
- the environment(s) in which the pipeline is be executed.
- the path to that pipeline.
- the various stages that are present in it.
paper pipeline is generated by executing
popper init paper and has by default a single stage named
envs entry in
.popper.yml specifies the environment in which a
pipeline is executed as part of the
popper run command. The available
host. The experiment is executed directly on the host.
centos-7.2. For each of these,
popper runis executed within a docker container whose base image is the given Linux distribution name. The container has
dockeravailable inside it so other containers can be executed from within the
popper init command can be used to initialize a pipeline. By
host is the registered environment when using
--env flag of
popper init can be used to specify another
environment. For example:
popper init mypipe --env=alpine-3.4
The above specifies that the pipeline named
mypipe will be executed
inside a docker container using the
alpine-3.4 popper check image.
To add more environment(s):
popper env mypipe --add ubuntu-xenial,centos-7.2
To remove an enviroment from the pipeline:
popper env mypipe --rm centos-7.2
stages YAML entry specifies the sequence of stages that are
executed by the
popper run command. By default, the
command generates scaffold scripts for
teardown.sh. If any of those are not
present when the pipeline is executed using
popper run, they are
just skipped (without throwing an error). At least one stage needs to
be executed, otherwise
popper run throws an error.
If arbitrary names are desired for a pipeline, the
--stages flag of
popper init command can be used. For example:
popper init arbitrary_stages \ --stages 'preparation,execution,validation' \
The above line generates the configuration for the
pipeline showed in the example.
metadata YAML entry specifies a set of key-value pairs that
describes and gives us information about a project.
By default, a project’s metadata will be initialized with the following key-value pairs:
$> popper metadata access_right: open license: CC-BY-4.0 publication_type: article upload_type: publication
A custom key-value pair can be added using the popper metadata –add KEY=VALUE` command. For example:
popper metadata --add year=2018
This adds a metadata entry ‘year’ to the metadata. The metadata will now look like:
access_right: open license: CC-BY-4.0 publication_type: article upload_type: publication year: '2019'
To remove the entry ‘year’ from the
popper metadata --rm KEY command can be used
as show below:
popper metadata --rm year
Archiving and DOI generation¶
Currently Popper CLI tool integrates with services like Zenodo and FigShare for archiving.
The first step is to create an account on Zenodo and generate an API token. Follow these steps (copied from [here] (http://developers.zenodo.org/#creating-a-personal-access-token)):
- Register for a Zenodo account if you don’t already have one.
- Go to your Applications, to create a new token.
- Select the OAuth scopes you need (you need at least
Now add some required metadata.
popper metadata --add title='<Your Title>' popper metadata --add author1='<First Last, email@example.com, Affiliation>' popper metadata --add abstract='<A short description of the your repo>' popper metadata --add keywords='<comma, separated, keywords>'
Now use the
popper archive command to perform the archiving.
popper archive --service zenodo
Enter the token obtained when prompted and you will have a DOI available foryour repository.
Popperized repositories catalog¶
popperized YAML entry specifies the list of Github organizations and repositories that contain popperized pipelines. By default, it points to the
github/popperized organization. This list is used to look for pipelines as part of the
popper search command.