💻 Welcome to the Content Lake: An Introduction to Using Content as Data
Presented by: Knut Melver
Originally aired on June 11, 2023 @ 7:30 PM - 8:00 PM EDT
Content isn't just what goes onto a web page or something that you mostly stick in a markdown file. When you start to treat content as a data, you unlock a lot of possibilities. In this talk, we'll go through some of the things you can do with a content lake in terms of querying and editing content programmatically.
English
Developer Week
Transcript (Beta)
Hi! Thanks for having me on Cloudflare TV. I must say Cloudflare Pages looks pretty awesome and I can't wait to see what the community is going to do with it and I look forward to playing with it too.
But first let me introduce myself I'm Knut. I work at a company called Sanity.io and we are a startup that makes APIs and tools for developers who want to work with structured content in a new and slightly better way and at Sanity we are making kind of like different things to kind of like work with structured content.
We have something called the Sanity Studio. It's an open source and customizable content editing environment with full real-time collaboration that you can fire up in your browser and you can also extend it with React and JavaScript.
We have something called the Content Lake which I will be getting into today which is a full real -time document store with indexed relationships between documents and I will kind of talk about that in a moment.
We have something called Grok which is an open source query language for JSON.
We have CDNs and on-demand image transformations, open source libraries, specifications and tooling for structured content and most importantly maybe we have an active and welcoming and inclusive community with thousands of those who are working with structured content and sharing their work and helping each other.
So before I kind of dive into the Content Lake I want to take a step back and take a step back kind of like 20 years back actually.
Most of you probably recognize this syntax.
It's a markdown. It's a markdown file and indeed many of us are actually now editing our content with markdown in our blogs or even also on our websites.
Especially in the Jamstack world you will find static site generators that can take markdown files and turn them into HTML.
We also use it in GitHub comments and at a point we also used it in Slack before they rolled out their new editor.
So markdown was invented by John Gruber back in the early 2000s.
He invented it because he wanted a neater way to generate HTML and make it more readable because he wrote a lot and back then we were actually making content for websites by writing HTML by hand.
But around the same time we also had this other development.
We had a huge boom of inventions that made the web more accessible for content editors.
Not long after Gruber launched markdown we got WordPress.
It was brought on the scene and together with similar CMSs we could have user-friendly content editing on the web.
I feel a bit nostalgic when I see this code line still.
But these developments also brought with them some specific ideas and ways to think about content on the web and digital content that still is sticking around in a huge way.
So one of them is that content is structured mainly in the context of a page.
In a specific hierarchy.
Or maybe as a post in a continuous feed. Like pages and posts. We also wrapped the content in HTML to control the specific presentation of the web page.
With a few fields for metadata. Mainly to orchestrate where and when it ends up.
Like we have a title, maybe a slug, and publish date fields. And so -called dynamic content within this page can be included by creating special shortcodes or other things that lets authors insert syntax that we can then translate to something else.
Usually some renderer of HTML. And the content is mostly contained within a website.
And it can only be edited through specific user interfaces like the WordPress dashboard or directly in a flat file by committing it to Git or something.
Like a markdown file. But then in the last ten years or so, we have had this emergence of new frameworks that has changed the way we are building for the web and building for digital experiences.
We have things like React and Angular and Vue and Svelte on the mobile platforms.
We have Flutter and Swift and so on.
And then we also have a new range of devices like Alexa and Google Assistant.
And they have their own markup languages that they use. So, this creates some new opportunities and some new problems.
And they also sparked a new way to think about content.
And that's what we at Sanity are doing all the time. So, a third thing that we see now is that documents are increasingly composed of not only like paragraphs of text and images.
We also have video embeds and specialized components.
If you're making a marketing website, you want to have actions and email signups and so on.
And we might want to insert product data inside of our rich text.
And for the last five years, we have seen a boom of so-called block-based editors.
So, Notion would be one that many use. You have the new Gutenberg editor for WordPress.
And even Markdown, you have this new specification called MDX that lets you insert JSX components, like these richer components that control the presentation.
But there's a backside to some of this. And this is where you get kind of the old ideas are sticking.
So, we are still pretty much dealing with HTML in the end.
We are storing HTML in databases and we take it out and we put it on the website almost directly.
And it's a bit tedious to work with HTML over APIs, even though it's wonderful for like the browser.
HTML in browser is awesome, but not on APIs.
And I will try to elaborate on that. So, we wanted to kind of like not have to deal with parsing HTML.
This is kind of like a famous Stack Overflow answer where someone asked how can I use Regex to parse HTML.
And I guess the point here is you shouldn't, perhaps.
But we kind of want to avoid this thing. And you might recognize this from your own work if you do frontend development.
Sometimes you get HTML through an API or from a file and you have to use this dangerously set in your HTML prop if you're writing React or something similar in other frameworks.
It doesn't feel great to use a prop that is kind of like starting with dangerously.
And you have little control of what happens inside of it.
So, that brings us to something called portable text.
And it's kind of like block content that you can take with you anywhere you want.
It looks like this. It's basically JSON.
It's JSON in a specific way. That's why we wrote a specification for it.
So, we can have kind of like a standard or predictable way to parse this JSON file.
Of course, it's not the intention that you should actually write portable text yourself.
You should have an editor, a graphical user interface that takes care of that for you.
So, portable text is a pure JSON and structured data thing.
We have libraries that can take a portable text array and parse it to whatever.
To HTML, to React, Vue, Svelte, Clojure, .NET, Markdown, you have it.
And you can also use it extensively to have these complex data structures over content.
Think about a footnote that contains rich text. Portable text is able to express that.
And we also have an open source editor coming soon. And this is how you interact with portable text.
It is an array of blocks. You give it to a serializer and then you can map the different block types to components that you author.
And the data, the content will just be parsed as props, as pure data.
So, you don't have to deal with HTML and stuff like that. Because it's only data.
However, the editor experience is just one of the side of the story.
Let's see if I had a video here.
Now, you should be able to mark up your text with the different speech synthesis markup language.
And HEA are the result when pushing. Speak text. Now, you should try it yourself.
So, this video just played before I got to the point.
But what you just heard was kind of an editor for speech synthesis. This is typically what you put into a voice assistant.
And with portable text, you're able to express the properties of how a text should be read by a voice assistant.
So, here we have marked up this text with different ways of speaking and putting different emphasis and so on.
So, that is just kind of an example of what you can do with portable text and how portable it is.
Hence the name, I guess. Now, you should be able to mark up your text.
Right. So, that's portable text. It's kind of like this JSON format that you can bring to whatever frontend framework you have.
But having rich text block content in JSON also allows for something else.
And that takes us to something we call Grok, which is a query language for JSON.
So, I mentioned that Sandy.io had this realtime document store. And a document store is basically just a collection of JSON documents.
You can't see it probably, but this is just kind of like me dumping all the JSON documents in our content lake.
And if you have all this data, you need a way to kind of like get it out.
And only the data that you need. And in the way and shape and form that you need it in.
And when we made this content lake backend, we thought about how should we be able to access this?
Of course, we could generate kind of like REST APIs or even a GraphQL endpoint.
But we wanted to have it a bit more flexible. We don't want it to kind of like have to always prescribe resolvers in GraphQL and so on.
And we wanted to be able to kind of like take out just the data we wanted and not kind of like the whole payload that you would get in a REST endpoint.
So, we made a query language for JSON.
And we called it Grok. It's short for Graph Relational Object Queries.
And that's because it lets you kind of like also combine data in different ways.
That's the real power of the content lake. You can take different documents and match them and join them on different criteria.
And Grok, it consists of some different components.
I will show you how a Grok query looks like. So, this is a fairly simple Grok query.
What it does is it asks for all the documents that has the type of product.
And then it called it a production. It kind of like returns the ID, the name, description and price for those products and then order this array of results after the price in a descending order.
So, this is slightly reminiscent of what you have in GraphQL or even SQL in a way.
And if you break this query down, we can see that the star represents all of your content.
All the documents in your content lake. And then we have these square brackets.
And inside of them, we can have something we call filter expressions. So, this is our different logical expressions.
Whether that is something true or false. If it's true, include it.
If it's false, throw it out. And then we have something called production within these curly brackets where we can pick the fields.
We can also make subsequent filter queries and join data and all sorts of stuff that we don't have time to get into here.
And then we have this last thing where we can also compound more Grok to further shape data.
In this case, we order results after the price.
And you can get pretty advanced with this. A real example is from just recently, we launched a new API versioning system.
And then we have to rewrite a lot of our documentation.
And all of our documentation is, of course, in purple text.
And what we did here was to ask our content lake to return all the documentation articles.
Specifically, all the documentation articles that mentioned API in a code marked string.
So, we can ask our content lake to return all the examples, all the code examples in our documentation that mentions API to tease out all the places we have an endpoint.
Then we can change that. And that kind of teases some of the power of having structured content and something like portable text.
It makes all your content reachable by queries. That's pretty, pretty powerful.
We can also do other things since we also have relationships. And when you upload an image to Sanity, we take that image, we analyze it, we write that metadata on a document.
So, you have metadata about the color palette, dimensions, and other things.
And then you can use that to query stuff. So, this is an example where I queried give me all the documents that has a poster, kind of like a main image, where the main image has the dominant title color as white.
So, all the images where you would put a white title over it to have the best contrast, give me all of that content and return the type, title, and score.
Because I also want to boost the documents that mentions data and is a guide.
So, this is just kind of like a teaser, kind of like a demonstration of what you can do with Grok.
It's pretty flexible and versatile.
And we are kind of like in the last part.
Because now we have talked a lot about distribution and getting content out.
And this is something that a lot of, like, CMS technologies will talk about.
Omni channel. You can push and publish content. But we want to think about content not only as something we are pushing out and writing into a text field.
It's also something that we should kind of like augment and work with through APIs.
And that's why kind of like the other story about writing and mutating and patching data in the content is important.
So, this is the studio.
And this is where humans are working. And the studio is a client side single page application.
And to make that work, we need a bunch of APIs that it talks to through the browser.
And all of these APIs are you have access to them.
There's kind of like no hidden thing. And they needed to be pretty flexible.
Because everything is happening in real time. So, you should be able to listen to changes.
But also send these super granular patches. Because you don't want to kind of overwrite the whole document each time you're doing something like pushing a new letter into the rich text field.
So, we made a patch system. Here it is expressed in JSON.
We also have SDKs for this. And it's fairly simple. You point it to an ID.
And you can change any field. So, in this example, we are patching a document.
We are kind of like setting a name. And then we are also patching another document and inserting a reference to this document, first document that we made.
So, you can also have transactions and do multiple things in the same turn.
And this is an example of real examples from our documentation portable text.
So, this is kind of like cropped out a code block.
And I'm not sure if you can actually see this.
But it has language croc in lowercase. And let's say that we wanted to make sure that all the language properties in code blocks inside of a documentation article were in uppercase.
Because croc should be in uppercase. Then we can write this kind of like patch operation.
And it also is just a demonstration of how granular you can go with this patch API.
But here we can see that we are patching the this specific document.
You can also build kind of like a series of patches on different documents, of course.
And then we are setting we are kind of like going into the portable text array.
We are picking out all of the code blocks.
That's like the underscore type is code thing. And then we're also picking just the blocks that has croc as their language property.
And then we are setting that value to croc uppercase.
So, you don't have to kind of like reimplement the whole object structure or anything like that.
You can kind of like point exactly to the place you want to change something and change that.
And you can do this without worrying about race conditions or me just typing at the same time in the studio.
Because we can also look for the revision IDs and so on to prevent that.
And this is also what real time means. That you can kind of like interact with content through APIs and also while humans are working in the studio.
And that's pretty awesome and opens for a lot of opportunities and possibilities to just maintain and improve and work with your content the way you want to.
So, this is what we mean with content as data.
So, it means that we need to rethink some things when it comes to content.
Content can just be something that you or it can, of course, be something you just stick in a markdown file or you store as HTML in a table in a SQL database.
But that kind of like is kind of like gives these constraints that doesn't really make sense anymore.
So, what we propose is to kind of like rethink some ways of approaching content.
And you don't have to use sanity necessarily to do this.
But it's worth thinking about. So, first of all, it means that we should escape kind of like this idea of the page based content model.
Where you model where it's kind of like your content into the concept of pages.
Or kind of like files with this kind of like YAML front matter. You should structure it after what it means.
So, don't think about product pages. Think about products.
Don't think about kind of like authors. Think about persons. Because you can use that person data for offering information, but also a bunch of other things within your content lake.
You should not only be able to distribute your content to wherever you need it.
If it ends up in a browser, a voice assistant app, whatever.
You should also be able to interact with it programmatically. So, that means that you should be able to run migration scripts and patches and talk to back office systems to kind of like update that product information so everything is correct and you don't have kind of like this double bookkeeping situation.
And that requires, we think at least, that your content is living on a realtime backend.
So, that your APIs and humans don't step on each other. So, that was kind of like the meat of the presentations.
If you think this is interesting or kind of like want to check out Grok or Portable Text or any other things I mentioned, you can go some different places to kind of like explore more.
One of those places is kind of like Grok Arcade. And this is a playground where you can load any JSON from a URL and use Grok to query it.
If you just go to Grok Dev, there's this to-do list data and you can write Grok and see how it works.
And there's some other examples here.
You can, for example, do complex queries on Pokemons like this.
It's pretty fun to mess around. And you can also share data and queries and kind of like check it out.
And you can also use Grok in your command line.
So, wherever you have JSON, you can use Grok to query it. It is pretty neat, I must say.
And also, if you are looking at this web page, you need to click this tail of the queue and find the hidden Easter egg that I just told you is there.
And if you want to try out Sanity, you can go to Sanity.io.
If you look under exchange, you can also see kind of like the projects that people are putting together.
And there's plugins and frameworks that you can integrate with and starters if you just want to check it out for a bit.
And of course, if you join our Slack community, you can also meet people that are building with Sanity and sharing their knowledge.
So, you're very welcome to join us there. And that's it for my talk. Thank you for watching.
Thanks for having me on the Cloudflare Developer Week. I look forward to see all the other amazing talks that is going to happen here.
So, thank you.
My name is Justin Hennessy.
I'm the VP of Engineering at Neto. Okay. So, I understand Neto is an e-commerce platform based in Australia.
Tell us a little bit more about it.
Neto is a omni-channel sales platform for retailers and wholesalers.
So, essentially what it allows us to do is enable the retailers and wholesalers to sell their products in multitudes of sales channels.
Tell us about the importance of automation in your business.
I came on board as the lead automation engineer.
So, I think automation is key to anything in this day and age.
Like, if you're not looking at ways to automate the low-value work and then put your people in the high-value areas or high-leverage areas, I think you're just going to get left behind.
So, you know, as a technology company, obviously it's critical for us to make sure that automation is at the core of what we do.
When did Neto begin working with Cloudflare? So, in the beginning, when Neto was looking to migrate from an old cloud provider, we also wanted to improve what we call our go-live flow or our onboarding flow for merchants.
And a big part of that was obviously provisioning a website, a custom domain name, and a custom SSL certificate.
Requesting and getting granted that certificate in the old process took two domain experts full time.
It was a very lengthy and technical process which took, you know, could sometimes took up to two, three weeks.
So, you can imagine, you know, a customer who's itching to get online, that kind of barrier, you know, presents a pretty big problem.
So, what Cloudflare enabled us to do was to literally automate that onboarding or go -live process to almost a one-click process.
And it also allowed us to diversify the people that could actually do that process.
So, now anybody in the business can make that, you know, set a customer live with a very simple process, and it's very rapid.
So, that's where we started. What are some of the security challenges you face in your business and how are you managing them?
Any online service has to take security very seriously, and at Neto, security is job zero.
So, we always bake in thinking and process and tooling around security. So, what Cloudflare does for us is literally gives us a really good protective layer on the very edge of our platform.
So, things like DDoS mitigation, web application firewall protection, all of that obviously is then translated into a really solid base of security for all of our metrics as well.
Security is obviously front of mind for Neto as a business, and online e -commerce presents a lot of security challenges.
So, denial of service attacks, cross-app scripting, we have automated attacks that are trying to find exploits in our, you know, in our forms and our, you know, our platform generally.
So, prior to having Cloudflare, obviously we had measures in place, but what we've gained from Cloudflare is a consolidation of that strategy.
So, we are able to look through a single lens and we can look at all of the aspects of our security for the platforms.
And I think it's probably safe to say that now more than ever, a good online strategy is crucial to success.
you