Announcing glom: Restructured Data for Python
This post introduces glom, Python's missing operator for nested objects and data.
If you're an easy sell, full API docs and
tutorial are already available at
glom.readthedocs.io.
Harder sells, this 5-minute post is for you.
Really hard sells met me at PyCon,
where I gave this 5-minute talk.
The Spectre of Structure
In the Python world, there's a saying: "Flat is better than nested."
Maybe times have changed or maybe that adage just applies more to code than data. In spite of the warning, nested data continues to grow, from document stores to RPC systems to structured logs to plain ol' JSON web services.
After all, if "flat" was the be-all-end-all, why would namespaces be one honking great idea? Nobody likes artificial flatness, nobody wants to call a function with 40 arguments.
Nested data is tricky though. Reaching into deeply structured data can get you some ugly errors. Consider this simple line:
value = target.a['b']['c']
That single line can result in at least four different exceptions, each less helpful than the last:
AttributeError: 'TargetType' object has no attribute 'a' KeyError: 'b' TypeError: 'NoneType' object has no attribute '__getitem__' TypeError: list indices must be integers, not str
Clearly, we need our tools to catch up to our nested data.
Enter glom.
Restructuring Data
glom is a new approach to working with data in Python, featuring:
- Path-based access for nested structures
- Declarative data transformation using lightweight, Pythonic specifications
- Readable, meaningful error messages
- Built-in data exploration and debugging features
A tool as simple and powerful as glom attracts many comparisons.
While similarities exist, and are often intentional, glom differs from other offerings in a few ways:
Going Beyond Access
Many nested data tools simply perform deep gets and searches, stopping short after solving the problem posed above. Realizing that access almost always precedes assignment, glom takes the paradigm further, enabling total declarative transformation of the data.
By way of introduction, let's start off with space-age access, the classic "deep-get":
from glom import glom target = {'galaxy': {'system': {'planet': 'jupiter'}}} spec = 'galaxy.system.planet' output = glom(target, spec) # output = 'jupiter'
Some quick terminology:
- target is our data, be it dict, list, or any other object
- spec is what we want output to be
With output = glom(target, spec)
committed to memory, we're ready
for some new requirements.
Our astronomers want to focus in on the Solar system, and represent planets as a list. Let's restructure the data to make a list of names:
target = {'system': {'planets': [{'name': 'earth'}, {'name': 'jupiter'}]}} glom(target, ('system.planets', ['name'])) # ['earth', 'jupiter']
And let's say we want to capture a parallel list of moon counts with the names as well:
target = {'system': {'planets': [{'name': 'earth', 'moons': 1}, {'name': 'jupiter', 'moons': 69}]}} spec = {'names': ('system.planets', ['name']), 'moons': ('system.planets', ['moons'])} glom(target, spec) # {'names': ['earth', 'jupiter'], 'moons': [1, 69]}
We can react to changing data requirements as fast as the data itself can change, naturally restructuring our results, despite the input's nested nature. Like a list comprehension, but for nested data, our code mirrors our output.
And we're just getting started.
True Python-Native
Most other implementations are limited to a particular data format or pure model, be it jmespath or XPath/XSLT. glom makes no such sacrifices of practicality, harnessing the full power of Python itself.
Going back to our example, let's say we wanted to get an aggregate moon count:
target = {'system': {'planets': [{'name': 'earth', 'moons': 1}, {'name': 'jupiter', 'moons': 69}]}} glom(target, {'moon_count': ('system.planets', ['moons'], sum)}) # {'moon_count': 70}
With glom, you have full access to Python at any given moment. Pass
values to functions, whether built-in, imported, or defined inline
with lambda
. But glom
doesn't stop there.
Now we get to one of my favorite features by far. Leaning into Python's power, we unlock the following syntax:
from glom import T spec = T['system']['planets'][-1].values() glom(target, spec) # ['jupiter', 69]
What just happened?
T
stands for target, and it acts as your data's stunt
double. T
records every key you get, every attribute you
access, every index you index, and every method you call. And out
comes a spec that's usable like any other.
No more worrying if an attribute is None
or a key isn't set. Take
that leap with T
. T
never raises an exception, so worst case you
get a meaningful error message when you run glom()
on it.
And if you're ok with the data not being there, just set a default:
glom(target, T['system']['comets'][-1], default=None) # None
Finally, null-coalescing operators for Python!
But so much more. This kind of dynamism is what made me fall in love with Python. No other language could do it quite like this.
That's why glom will always be a Python library first and a CLI second. Oh, didn't I mention there was a CLI?
Library first, then CLI
Tools like jq provide a lot of value on the console, but leave a dubious path forward for further integration. glom's full-featured command-line interface is only a stepping stone to using it more extensively inside application logic.
$ pip install glom $ curl -s https://api.github.com/repos/mahmoud/glom/events \ | glom '[{"type": "type", "date": "created_at", "user": "actor.login"}]'
Which gets us:
[ { "date": "2018-05-09T03:39:44Z", "type": "WatchEvent", "user": "asapzacy" }, { "date": "2018-05-08T22:51:46Z", "type": "WatchEvent", "user": "CameronCairns" }, { "date": "2018-05-08T03:27:27Z", "type": "PushEvent", "user": "mahmoud" }, { "date": "2018-05-08T03:27:27Z", "type": "PullRequestEvent", "user": "mahmoud" } ... ]
Piping hot JSON into glom
with a cool Python literal spec, with
pretty-printed JSON out. A great way to process and filter API calls,
and explore some data. Something genuinely enjoyable, because you know
you won't be stuck in a pipe dream.
Everything on the command line ports directly into production-grade Python, complete with better error handling and limitless integration possibilities.
Next steps
Never before glom have I put a piece of code into production so quickly.
Within two weeks of the first commit, glom has paid its weight in gold, with glom specs replacing Django Rest Framework code 2x to 5x their size, making the codebase faster and more readable. Meanwhile, glom's core is so tight that we're on pace to have more docs and tests than code very soon.
The glom()
function is stable, along with the rest of the API,
unless otherwise specified.
A lot of other features are baking or in the works. For now, we'll be focusing on the following growth areas:
- Validation functionality, in the vein of schema and cerberus
- CLI robustness, better error messages, etc.
- Extension API, clean up some internal code, open up extensions
- Automatic default registration of default behaviors for co-installed packages (e.g., Django)
We'll be talking about all of this and more at PyCon, so swing by if you can. In either case, I hope you'll try glom out and let us know how it goes!