Remap: Nested Data Multitool for Python
This entry is the first in a series of "cookbooklets" showcasing more advanced Boltons. If all goes well, the next 5 minutes will literally save you 5 hours.
Intro
Data is everywhere, especially within itself. That's right, whether it's public APIs, document stores, or plain old configuration files, data will nest. And that nested data will find you.
UI fads aside, developers have always liked "flat". Even Python, so often turned to for data wrangling, only has succinct built-in constructs for dealing with flat data. List comprehensions, generator expressions, map/filter, and itertools are all built for flat work. In fact, the allure of flat data is likely a direct result of this common gap in most programming languages.
Let's change that. First, let's meet this nested adversary. Provided you overlook my taste in media, it's hard to fault nested data when it reads as well as this YAML:
reviews: shows: - title: Star Trek - The Next Generation rating: 10 review: Episodic AND deep. <3 Data. tags: ['space'] - title: Monty Python's Flying Circus rating: 10 tags: ['comedy'] movies: - title: The Hitchiker's Guide to the Galaxy rating: 6 review: So great to see Mos Def getting good work. tags: ['comedy', 'space', 'life'] - title: Monty Python's Meaning of Life rating: 7 review: Better than Brian, but not a Holy Grail, nor Completely Different. tags: ['comedy', 'life'] prologue: title: The Crimson Permanent Assurance rating: 9
Even this very straightforwardly nested data can be a real hassle to manipulate. How would one add a default review for entries without one? How would one convert the ratings to a 5-star scale? And what does all of this mean for more complex real-world cases, exemplified by this excerpt from a real GitHub API response:
[{ "id": "3165090957", "type": "PushEvent", "actor": { "id": 130193, "login": "mahmoud", "gravatar_id": "", "url": "https://api.github.com/users/mahmoud", "avatar_url": "https://avatars.githubusercontent.com/u/130193?" }, "repo": { "id": 8307391, "name": "mahmoud/boltons", "url": "https://api.github.com/repos/mahmoud/boltons" }, "payload": { "push_id": 799258895, "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "27a4bc1b6d1da25a38fe8e2c5fb27f22308e3260", "before": "0d6486c40282772bab232bf393c5e6fad9533a0e", "commits": [ { "sha": "27a4bc1b6d1da25a38fe8e2c5fb27f22308e3260", "author": { "email": "mahmoud@hatnote.com", "name": "Mahmoud Hashemi" }, "message": "switched reraise_visit to be just a kwarg", "distinct": true, "url": "https://api.github.com/repos/mahmoud/boltons/commits/27a4bc1b6d1da25a38fe8e2c5fb27f22308e3260" } ] }, "public": true, "created_at": "2015-09-21T10:04:37Z" }]
The astute reader may spot some inconsistency and general complexity, but don't run away.
Remap, the recursive map, is here to save the day.
Remap is a Pythonic traversal utility that creates a transformed copy
of your nested data. It uses three callbacks -- visit
, enter
, and
exit
-- and is designed to accomplish the vast majority of tasks by
passing only one function, usually visit
. The API docs have full
descriptions, but the basic rundown is:
visit
transforms an individual itementer
controls how container objects are created and traversedexit
controls how new container objects are populated
It may sound complex, but the examples shed a lot of light. So let's get remapping!
Normalize keys and values
First, let's import the modules and data we'll need.
import json import yaml # https://pypi.python.org/pypi/PyYAML from boltons.iterutils import remap # https://pypi.python.org/pypi/boltons review_map = yaml.load(media_reviews) event_list = json.loads(github_events)
Now let's turn back to that GitHub API data. Earlier one may have been
annoyed by the inconsistent type of id
. event['repo']['id']
is an
integer, but event['id']
is a string. When sorting events by ID, you
would not want string ordering.
With remap
, fixing this sort inconsistency couldn't be easier:
from boltons.iterutils import remap def visit(path, key, value): if key == 'id': return key, int(value) return key, value remapped = remap(event_list, visit=visit) assert remapped[0]['id'] == 3165090957 # You can even do it in one line: remap(event_list, lambda p, k, v: (k, int(v)) if k == 'id' else (k, v))
By default, visit
gets called on every item in the root structure,
including lists, dicts, and other containers, so let's take a closer
look at its signature. visit
takes three arguments we're going to
see in all of remap's callbacks:
path
is a tuple of keys leading up to the current itemkey
is the current item's keyvalue
is the current item's value
key
and value
are exactly what you would expect, though it may
bear mentioning that the key
for a list item is its index. path
refers to the keys of all the parents of the current item, not
including the key
. For example, looking at
the GitHub event data, the commit author's
name's path is (0, 'payload', 'commits', 0, 'author')
, because the
key, name
, is located in the author of the first commit in the
payload of the first event.
As for the return signature of visit
, it's very similar to the
input. Just return the new (key, value)
you want in the remapped
output.
Drop empty values
Next up, GitHub's move away from Gravatars left an
artifact in their API: a blank 'gravatar_id'
key. We can get rid of
that item, and any other blank strings, in a jiffy:
drop_blank = lambda p, k, v: v != "" remapped = remap(event_list, visit=drop_blank) assert 'gravatar_id' not in remapped[0]['actor']
Unlike the previous example, instead of a (key, value)
pair, this
visit
is returning a bool
. For added convenience, when visit
returns True
, remap
carries over the original item
unmodified. Returning False
drops the item from the remapped structure.
With the ability to arbitrarily transform items, pass through old
items, and drop items from the remapped structure, it's clear that the
visit
function makes the majority of recursive transformations
trivial. So many tedious and error-prone lines of traversal code turn
into one-liners that usually remap
with a visit
callback is all
one needs. With that said, the next recipes focus on remap
's more
advanced callable arguments, enter
and exit
.
Convert dictionaries to OrderedDicts
So far we've looked at actions on remapping individual items, using
the visit
callable. Now we turn our attention to actions on
containers, the parent objects of individual items. We'll start doing
this by looking at the enter
argument to remap
.
# from collections import OrderedDict from boltons.dictutils import OrderedMultiDict as OMD from boltons.iterutils import remap, default_enter def enter(path, key, value): if isinstance(value, dict): return OMD(), sorted(value.items()) return default_enter(path, key, value) remapped = remap(review_list, enter=enter) assert remapped['reviews'].keys()[0] == 'movies' # True because 'reviews' is now ordered and 'movies' comes before 'shows'
The enter
callable controls both if and how an object is
traversed. Like visit
, it accepts path
, key
, and value
. But
instead of (key, value)
, it returns a tuple of (new_parent,
items)
. new_parent
is the container that will receive items
remapped by the visit
callable. items
is an iterable of (key,
value)
pairs that will be passed to visit
. Alternatively, items
can be False
, to tell remap that the current value should not be
traversed, but that's getting pretty advanced. The API docs have some
other enter
details to consider.
Also note how this code builds on the default remap logic by calling
through to the default_enter
function, imported from the same place
as remap
itself. Most practical use cases will want to do this, but
of course the choice is yours.
Sort all lists
The last example used enter
to interact with containers before they
were being traversed. This time, to sort all lists in a structure,
we'll use the remap
's final callable argument: exit
.
from boltons.iterutils import remap, default_exit def exit(path, key, old_parent, new_parent, new_items): ret = default_exit(path, key, old_parent, new_parent, new_items) if isinstance(ret, list): ret.sort() return ret remap(review_list, exit=exit)
Similar to the enter
example, we're building on remap
's default
behavior by importing and calling default_exit
. Looking at the
arguments passed to exit
and default_exit
, there's the path
and
key
that we're used to from visit
and enter
. value
is there,
too, but it's named old_parent
, to differentiate it from the new
value, appropriately called new_parent
. At the point exit
is
called, new_parent
is just an empty structure as constructed by
enter
, and exit
's job is to fill that new container with
new_items
, a list of (key, value)
pairs returned by remap
's
calls to visit
. Still with me?
Either way, here we don't interact with the arguments. We just call
default_exit
and work on its return value, new_parent
, sorting it
in-place if it's a list
. Pretty simple! In fact, very attentive
readers might point out this can be done with visit
, because
remap
's very next step is to call visit
with the
new_parent
. You'll have to forgive the contrived example and let it
be a testament to the rarity of overriding exit
. Without going into
the details, enter
and exit
are most useful when teaching remap
how to traverse nonstandard containers, such as non-iterable Python
objects. As mentioned in the "drop empty values"
example, remap
is designed to maximize the mileage you get out of
the visit
callback. Let's look at an advanced usage reason that's
true.
Collect interesting values
Sometimes you just want to traverse a nested structure, and you don't
need the result. For instance, if we wanted to collect the full set of
tags used in media reviews. Let's create a remap
-based function,
get_all_tags
:
def get_all_tags(root): all_tags = set() def visit(path, key, value): all_tags.update(value['tags']) return False remap(root, visit=visit, reraise_visit=False) return all_tags print(get_all_tags(review_map)) # set(['space', 'comedy', 'life'])
Like the first recipe, we've used the visit
argument to remap
, and
like the second recipe, we're just returning False
, because we don't
actually care about contents of the resulting structure.
What's new here is the reraise_visit=False
keyword argument, which
tells remap
to keep any item that causes a visit
exception. This
practical convenience lets visit
functions be shorter, clearer, and
just more EAFP. Reducing the example to a
one-liner is left as an exercise to the reader.
Add common keys
As a final advanced remap
example, let's look at adding items to
structures. Through the examples above, we've learned that visit
is
best-suited for 1:1 transformations and dropping values. This leaves
us with two main approaches for addition. The first uses the enter
callable and is suitable for making data consistent and adding data
which can be overridden.
base_review = {'title': '', 'rating': None, 'review': '', 'tags': []} def enter(path, key, value): new_parent, new_items = default_enter(path, key, value) try: new_parent.update(base_review) except: pass return new_parent, new_items remapped = remap(review_list, enter=enter) assert review_list['shows'][1]['review'] == '' # True, the placeholder review is holding its place
The second method uses the exit
callback to override values and
calculate new values from the new data.
def exit(path, key, old_parent, new_parent, new_items): ret = default_exit(path, key, old_parent, new_parent, new_items) try: ret['review_length'] = len(ret['review']) except: pass return ret remapped = remap(review_list, exit=exit) assert remapped['shows'][0]['review_length'] == 27 assert remapped['movies'][0]['review_length'] == 42 # True times two.
By now you might agree that remap
is making such feats positively
routine. Come for the nested data manipulation, stay for the
number jokes.
Corner cases
This whole guide has focused on data that came from "real-world" sources, such as JSON API responses. But there are certain rare cases which typically only arise from within Python code: self-referential objects. These are objects that contain references to themselves or their parents. Have a look at this trivial example:
self_ref = [] self_ref.append(self_ref)
The experienced programmer has probably seen this before, but most
Python coders might even think the second line is an error. It's a
list containing itself, and it has the rather cool repr:
[[...]]
.
Now, this is pretty rare, but reference loops do come up in programming. The good news is that remap handles these just fine:
print(repr(remap(self_ref))) # prints "[[...]]"
The more common corner case that arises is that of duplicate references, which remap also handles with no problem:
my_set = set() dupe_ref = (my_set, [my_set]) remapped = remap(dupe_ref) assert remapped[0] is remapped[-1][-1] # True, of course
Two references to the same set go in, two references to a copy of that set come out. That's right: only one copy is made, and then used twice, preserving the original structure.
Wrap-up
If you've made it this far, then I hope you'll agree that remap
is
useful enough to be your new friend. If that wasn't enough detail,
then there are the docs. remap
is
well-tested, but making something this
general-purpose is a tricky area. Please
file bugs and requests. Don't forget about pprint
and repr/reprlib, which can help with reading
large structures. As always, stay tuned for future boltons
cookbooklets, and much much more.