The tragedy of data science is that 79% of an analyst’s time goes to data preparation. Data preparation is not only tedious, it steals time from analysis.
A data package is an abstraction that encapsulates and automates data preparation. More specifically, a data package is a tree of serialized data wrapped in a Python module. Each data package has a unique handle, a revision history, and a web page. Packages are stored in a server-side registry that enforces access control.
Example: Bike for Your Rights
Suppose you wish to analyze bicycle traffic on Seattle’s Fremont Bridge. You could locate the source data, download it, parse it, index the date column, etc. — as Jake Vanderplas demonstrates — or you could install the data as a package in less than a minute:
$ pip install quilt # requires HDF5; details below $ quilt install akarve/fremont_bike
Now we can load the data directly into Python:
from quilt.data.akarve import fremont_bike
In contrast to files, data packages require very little data preparation. Package users can jump straight to the analysis.
Less is More
The Jupyter notebooks shown in Fig. 1 perform the same analysis on the same data. The notebooks differ only in data injection. On the left we see a typical file-based workflow: download files, discover file formats, write scripts to parse, clean, and load the data, run the scripts, and finally begin analysis. On the right we see a package-based workflow: install the data, import the data, and begin the analysis. The key takeaway is that file-based workflows require substantial data preparation (red) prior to analysis (green).
Figure 1. File-based workflows (left) require significantly more prep than package-based workflows (right)
(Both notebooks are available on GitHub.)
Data Packages in Detail
$ pip install quilt
Get a Data Package
Recall how we acquired the Fremont Bridge data:
$ quilt install akarve/fremont_bike
quilt install connects to a remote registry and materializes a package on the calling machine.
quilt install is similar in spirit to
git clone or
npm install, but it scales to big data, keeps your source code history clean, and handles serialization.
Work with Package Data
To simplify dependency injection, Quilt rolls data packages into a Python module so that you can import data like you import code:
# python from quilt.data.akarve import fremont_bike
Importing large data packages is fast since disk I/O is deferred until the data are referenced in code. At the moment of reference, binary data are copied from disk into main memory. Since there’s no parsing overhead, deserialization is five to twenty times faster than loading data from text files.
We can see that
fremont_bike is a group containing two items:
# python >>> fremont_bike <GroupNode '/Users/akarve/quilt_packages/akarve/fremont_bike':''> README counts
A group contains other groups and, at its leaves, contains data:
# python >>> fremont_bike.counts.data() West Sidewalk East Sidewalk Date 2012-10-03 00:00:00 4 9 2012-10-03 01:00:00 4 6 2012-10-03 02:00:00 1 1 ... [39384 rows x 2 columns]
Create a Package
Let’s start with some source data. How do we convert source files into a data package? We’ll need a configuration file, conventionally called
quilt how to structure a package. Fortunately, we don’t need to write
build.yml by hand.
quilt generate creates a build file that mirrors the contents of any directory:
$ quilt generate src
Let’s open the file that we just generated,
contents: Fremont_Hourly_Bicycle_Counts_October_2012_to_present: file: Fremont_Hourly_Bicycle_Counts_October_2012_to_present.csv README: file: README.md
contents dictates the structure of a package.
build.yml to shorten the Python name for our data. Oh, and let’s index on the “Date” column:
contents: counts: file: Fremont_Hourly_Bicycle_Counts_October_2012_to_present.csv index_col: Date parse_dates: True README: file: README.md
counts — or any name that we write in its place — is the name that package users will type to access the data extracted from the CSV file. Behind the scenes,
parse_dates are passed to
pandas.read_csv as keyword arguments.
Now we can build our package:
$ quilt build YOUR_NAME/fremont_bike src/build.yml ... src/Fremont_Hourly_Bicycle_Counts_October_2012_to_present.csv... 100%|███████████████████████████| 1.13M/1.13M [00:09<00:00, 125KB/s] Saving as binary dataframe... Built YOUR_NAME/fremont_bike successfully.
You’ll notice that
quilt build takes a few seconds to construct the date index.
The build process has two key advantages: 1) parsing and serialization are automated; 2) packages are built once for the benefit of all users — there’s no repetitive data prep.
Push to the Registry
We’re ready to push our package to the registry, where it’s stored for anyone who needs it:
quilt login # accounts are free; only registered users can push quilt push YOUR_NAME/fremont_bike
The package now resides in the registry and has a landing page populated by
src/README.md. Landing pages look like this.
Packages are private by default, so you’ll see a 404 until and unless you log in to the registry. To publish a package, use
quilt access add YOUR_NAME/fremont_bike public
To share a package with a specific user, replace
public with their Quilt username.
Package handles, such as
akarve/fremont_bike, provide a common frame of reference that can be reproduced by any user on any machine. But what happens if the data changes?
quilt log tracks changes over time:
# run in same directory as you ran quilt install akarve/fremont_bike $ quilt log akarve/fremont_bike Hash Pushed Author 495992b6b9109a1f9d5e209d6... 2017-04-14 14:33:40 akarve 24bb9d6e9d80000d9bc5fdc1e... 2017-03-29 20:42:43 akarve 03d2450e755cf45fbbf9c3635... 2017-03-29 17:40:47 akarve
quilt install -x allows us to install historical snapshots:
quilt install akarve/fremont_bike -x 24bb9d6e9d80000d9bc5fdc1e89a0a77c40da33da5a054b05cdec29755ac408b
The upshot for reproducibility is that we no longer run models on “some data,” but on specific hash versions of specific packages.
Data packages make for fast, reproducible analysis by simplifying data prep, eliminating parsing, and versioning data. In round numbers, data packages speed both I/O and data preparation by a factor of 10.
In future articles we’ll virtualize data packages across Python, Spark, and R.
To learn more visit QuiltData.com.
The Quilt client is open source. Visit our GitHub repository to contribute.
Appendix: Command summary
1. We plan to transition to Apache Parquet in the near future.↩