Remap: Nested Data Multitool for Python

This entry is the first in a series of "cookbooklets" showcasing more advanced Boltons. If all goes well, the next 5 minutes will literally save you 5 hours.

Contents

Intro

Data is everywhere, especially within itself. That's right, whether it's public APIs, document stores, or plain old configuration files, data will nest. And that nested data will find you.

UI fads aside, developers have always liked "flat". Even Python, so often turned to for data wrangling, only has succinct built-in constructs for dealing with flat data. List comprehensions, generator expressions, map/filter, and itertools are all built for flat work. In fact, the allure of flat data is likely a direct result of this common gap in most programming languages.

Let's change that. First, let's meet this nested adversary. Provided you overlook my taste in media, it's hard to fault nested data when it reads as well as this YAML:

reviews:
  shows:
    - title: Star Trek - The Next Generation
      rating: 10
      review: Episodic AND deep. <3 Data.
      tags: ['space']
    - title: Monty Python's Flying Circus
      rating: 10
      tags: ['comedy']
  movies:
    - title: The Hitchiker's Guide to the Galaxy
      rating: 6
      review: So great to see Mos Def getting good work.
      tags: ['comedy', 'space', 'life']
    - title: Monty Python's Meaning of Life
      rating: 7
      review: Better than Brian, but not a Holy Grail, nor Completely Different.
      tags: ['comedy', 'life']
      prologue:
        title: The Crimson Permanent Assurance
        rating: 9

Even this very straightforwardly nested data can be a real hassle to manipulate. How would one add a default review for entries without one? How would one convert the ratings to a 5-star scale? And what does all of this mean for more complex real-world cases, exemplified by this excerpt from a real GitHub API response:

[{
    "id": "3165090957",
    "type": "PushEvent",
    "actor": {
      "id": 130193,
      "login": "mahmoud",
      "gravatar_id": "",
      "url": "https://api.github.com/users/mahmoud",
      "avatar_url": "https://avatars.githubusercontent.com/u/130193?"
    },
    "repo": {
      "id": 8307391,
      "name": "mahmoud/boltons",
      "url": "https://api.github.com/repos/mahmoud/boltons"
    },
    "payload": {
      "push_id": 799258895,
      "size": 1,
      "distinct_size": 1,
      "ref": "refs/heads/master",
      "head": "27a4bc1b6d1da25a38fe8e2c5fb27f22308e3260",
      "before": "0d6486c40282772bab232bf393c5e6fad9533a0e",
      "commits": [
        {
          "sha": "27a4bc1b6d1da25a38fe8e2c5fb27f22308e3260",
          "author": {
            "email": "mahmoud@hatnote.com",
            "name": "Mahmoud Hashemi"
          },
          "message": "switched reraise_visit to be just a kwarg",
          "distinct": true,
          "url": "https://api.github.com/repos/mahmoud/boltons/commits/27a4bc1b6d1da25a38fe8e2c5fb27f22308e3260"
        }
      ]
    },
    "public": true,
    "created_at": "2015-09-21T10:04:37Z"
}]

The astute reader may spot some inconsistency and general complexity, but don't run away.

Remap, the recursive map, is here to save the day.

Remap is a Pythonic traversal utility that creates a transformed copy of your nested data. It uses three callbacks -- visit, enter, and exit -- and is designed to accomplish the vast majority of tasks by passing only one function, usually visit. The API docs have full descriptions, but the basic rundown is:

It may sound complex, but the examples shed a lot of light. So let's get remapping!

Normalize keys and values

First, let's import the modules and data we'll need.

import json
import yaml  # https://pypi.python.org/pypi/PyYAML
from boltons.iterutils import remap  # https://pypi.python.org/pypi/boltons

review_map = yaml.load(media_reviews)

event_list = json.loads(github_events)

Now let's turn back to that GitHub API data. Earlier one may have been annoyed by the inconsistent type of id. event['repo']['id'] is an integer, but event['id'] is a string. When sorting events by ID, you would not want string ordering.

With remap, fixing this sort inconsistency couldn't be easier:

from boltons.iterutils import remap

def visit(path, key, value):
    if key == 'id':
        return key, int(value)
    return key, value

remapped = remap(event_list, visit=visit)

assert remapped[0]['id'] == 3165090957

# You can even do it in one line:
remap(event_list, lambda p, k, v: (k, int(v)) if k == 'id' else (k, v))

By default, visit gets called on every item in the root structure, including lists, dicts, and other containers, so let's take a closer look at its signature. visit takes three arguments we're going to see in all of remap's callbacks:

key and value are exactly what you would expect, though it may bear mentioning that the key for a list item is its index. path refers to the keys of all the parents of the current item, not including the key. For example, looking at the GitHub event data, the commit author's name's path is (0, 'payload', 'commits', 0, 'author'), because the key, name, is located in the author of the first commit in the payload of the first event.

As for the return signature of visit, it's very similar to the input. Just return the new (key, value) you want in the remapped output.

Drop empty values

Next up, GitHub's move away from Gravatars left an artifact in their API: a blank 'gravatar_id' key. We can get rid of that item, and any other blank strings, in a jiffy:

drop_blank = lambda p, k, v: v != ""
remapped = remap(event_list, visit=drop_blank)

assert 'gravatar_id' not in remapped[0]['actor']

Unlike the previous example, instead of a (key, value) pair, this visit is returning a bool. For added convenience, when visit returns True, remap carries over the original item unmodified. Returning False drops the item from the remapped structure.

With the ability to arbitrarily transform items, pass through old items, and drop items from the remapped structure, it's clear that the visit function makes the majority of recursive transformations trivial. So many tedious and error-prone lines of traversal code turn into one-liners that usually remap with a visit callback is all one needs. With that said, the next recipes focus on remap's more advanced callable arguments, enter and exit.

Convert dictionaries to OrderedDicts

So far we've looked at actions on remapping individual items, using the visit callable. Now we turn our attention to actions on containers, the parent objects of individual items. We'll start doing this by looking at the enter argument to remap.

# from collections import OrderedDict
from boltons.dictutils import OrderedMultiDict as OMD
from boltons.iterutils import remap, default_enter

def enter(path, key, value):
    if isinstance(value, dict):
        return OMD(), sorted(value.items())
    return default_enter(path, key, value)

remapped = remap(review_list, enter=enter)
assert remapped['reviews'].keys()[0] == 'movies'
# True because 'reviews' is now ordered and 'movies' comes before 'shows'

The enter callable controls both if and how an object is traversed. Like visit, it accepts path, key, and value. But instead of (key, value), it returns a tuple of (new_parent, items). new_parent is the container that will receive items remapped by the visit callable. items is an iterable of (key, value) pairs that will be passed to visit. Alternatively, items can be False, to tell remap that the current value should not be traversed, but that's getting pretty advanced. The API docs have some other enter details to consider.

Also note how this code builds on the default remap logic by calling through to the default_enter function, imported from the same place as remap itself. Most practical use cases will want to do this, but of course the choice is yours.

Sort all lists

The last example used enter to interact with containers before they were being traversed. This time, to sort all lists in a structure, we'll use the remap's final callable argument: exit.

from boltons.iterutils import remap, default_exit

def exit(path, key, old_parent, new_parent, new_items):
    ret = default_exit(path, key, old_parent, new_parent, new_items)
    if isinstance(ret, list):
        ret.sort()
    return ret

remap(review_list, exit=exit)

Similar to the enter example, we're building on remap's default behavior by importing and calling default_exit. Looking at the arguments passed to exit and default_exit, there's the path and key that we're used to from visit and enter. value is there, too, but it's named old_parent, to differentiate it from the new value, appropriately called new_parent. At the point exit is called, new_parent is just an empty structure as constructed by enter, and exit's job is to fill that new container with new_items, a list of (key, value) pairs returned by remap's calls to visit. Still with me?

Either way, here we don't interact with the arguments. We just call default_exit and work on its return value, new_parent, sorting it in-place if it's a list. Pretty simple! In fact, very attentive readers might point out this can be done with visit, because remap's very next step is to call visit with the new_parent. You'll have to forgive the contrived example and let it be a testament to the rarity of overriding exit. Without going into the details, enter and exit are most useful when teaching remap how to traverse nonstandard containers, such as non-iterable Python objects. As mentioned in the "drop empty values" example, remap is designed to maximize the mileage you get out of the visit callback. Let's look at an advanced usage reason that's true.

Collect interesting values

Sometimes you just want to traverse a nested structure, and you don't need the result. For instance, if we wanted to collect the full set of tags used in media reviews. Let's create a remap-based function, get_all_tags:

def get_all_tags(root):
    all_tags = set()

    def visit(path, key, value):
        all_tags.update(value['tags'])
        return False

    remap(root, visit=visit, reraise_visit=False)

    return all_tags

print(get_all_tags(review_map))
# set(['space', 'comedy', 'life'])

Like the first recipe, we've used the visit argument to remap, and like the second recipe, we're just returning False, because we don't actually care about contents of the resulting structure.

What's new here is the reraise_visit=False keyword argument, which tells remap to keep any item that causes a visit exception. This practical convenience lets visit functions be shorter, clearer, and just more EAFP. Reducing the example to a one-liner is left as an exercise to the reader.

Add common keys

As a final advanced remap example, let's look at adding items to structures. Through the examples above, we've learned that visit is best-suited for 1:1 transformations and dropping values. This leaves us with two main approaches for addition. The first uses the enter callable and is suitable for making data consistent and adding data which can be overridden.

base_review = {'title': '',
               'rating': None,
               'review': '',
               'tags': []}

def enter(path, key, value):
    new_parent, new_items = default_enter(path, key, value)
    try:
        new_parent.update(base_review)
    except:
        pass
    return new_parent, new_items

remapped = remap(review_list, enter=enter)

assert review_list['shows'][1]['review'] == ''
# True, the placeholder review is holding its place

The second method uses the exit callback to override values and calculate new values from the new data.

def exit(path, key, old_parent, new_parent, new_items):
    ret = default_exit(path, key, old_parent, new_parent, new_items)
    try:
        ret['review_length'] = len(ret['review'])
    except:
        pass
    return ret

remapped = remap(review_list, exit=exit)

assert remapped['shows'][0]['review_length'] == 27
assert remapped['movies'][0]['review_length'] == 42
# True times two.

By now you might agree that remap is making such feats positively routine. Come for the nested data manipulation, stay for the number jokes.

Corner cases

This whole guide has focused on data that came from "real-world" sources, such as JSON API responses. But there are certain rare cases which typically only arise from within Python code: self-referential objects. These are objects that contain references to themselves or their parents. Have a look at this trivial example:

self_ref = []
self_ref.append(self_ref)

The experienced programmer has probably seen this before, but most Python coders might even think the second line is an error. It's a list containing itself, and it has the rather cool repr: [[...]].

Now, this is pretty rare, but reference loops do come up in programming. The good news is that remap handles these just fine:

print(repr(remap(self_ref)))
# prints "[[...]]"

The more common corner case that arises is that of duplicate references, which remap also handles with no problem:

my_set = set()

dupe_ref = (my_set, [my_set])
remapped = remap(dupe_ref)

assert remapped[0] is remapped[-1][-1]
# True, of course

Two references to the same set go in, two references to a copy of that set come out. That's right: only one copy is made, and then used twice, preserving the original structure.

Wrap-up

If you've made it this far, then I hope you'll agree that remap is useful enough to be your new friend. If that wasn't enough detail, then there are the docs. remap is well-tested, but making something this general-purpose is a tricky area. Please file bugs and requests. Don't forget about pprint and repr/reprlib, which can help with reading large structures. As always, stay tuned for future boltons cookbooklets, and much much more.


#python #data #boltons
Previously
Python Community Intro
10 Myths of Enterprise Python
Designing a fast
Colophon