Parsing rules

Let’s say you want to connect your page to another page on a different website. There’s a problem: that other site doesn’t support HDOCs. Can we work around that? It turns out yes, we can. HDOCs are very simple. They’re basically a title and the content (plus a few optional fields). If you can locate those two things on a page, you can build an HDOC locally even if the original page isn’t an HDOC.

In other words, we need to be able to parse arbitrary web pages. But all client apps must parse them the same way. If they don’t, you’ll end up in a situation where a floating link works in one app and looks broken in another. And you won’t be able to fix it, because the underlying text is slightly different depending on how each client parsed the page. We need deterministic parsing. That’s where parsing rules come in.

Here’s what a URL with parsing rules may look like:

https://example.com/some-page#pr=c/body/t/.main-title/r/.page,.some-class/d/.date/a/.author

Everything after #pr= is the parsing rules section. It’s a set of key–value pairs where each value is a selector. Most of them are optional. The only required one is the content selector. Because of that, for many pages URLs with parsing rules will look much simpler, for example:

https://example.com/some-page#pr=c/body

Here are all the supported selectors:

  • c — content selector (required)
  • t — title selector (optional if the title is in an <h1> and is unambiguous)
  • r — list of selectors to remove from the final content (optional). Multiple selectors are separated by commas. Each selector is URL-encoded individually.
  • d — publication date selector (optional)
  • a — author name selector (optional)

All selectors must be URL-encoded.

Behind the scenes, a client app simply calls querySelector with the selector you provided. For the “remove” list, it calls querySelectorAll.

How to find selectors using the LZ Desktop app

If you use the LZ Desktop app together with the LZ Desktop Helper Chrome extension, you can download pages directly from your browser. For some websites (currently only English Wikipedia and Gutenberg.org), downloaded pages already include parsing rules in their URLs. For most sites, though, the app relies on a text-density parsing library that tries to guess where the main content is. The problem: pages parsed this way cannot be used as connected pages. You can view them locally and even create visible connections, but if you publish them, nobody—including you—will see those connections when viewing the page online. That’s intentional and prevents the “different clients parse differently” issue that breaks floating links.

Your first step should be to right-click the downloaded page and choose “Download using parsing rules.” The app downloads the page again, but this time it doesn’t analyse text density. Instead, it goes through a predefined list of selectors and uses the first one that matches something meaningful on the page. This approach works about 50% of the time. Sometimes you’ll get clean parsing rules; sometimes the page will look broken. In the broken cases, you’ll need to create parsing rules manually.

By the way, this is exactly why LZ Desktop uses two parsing methods: the automatic one is more reliable for personal reading, and the deterministic one is necessary for publishing visible connections.

How to find selectors manually

If LZ Desktop didn’t help—or you don’t want to use it—you can always pick selectors yourself.

  1. Open the page in Chrome.
  2. Open Developer Tools (View > Developer > Developer Tools).
  3. Use the “select element” arrow to inspect the page.
  4. Locate the element that contains all the real content (without comments, navigation, ads, etc.).
  5. Check its classes, IDs, or anything else that could work as a selector.

If you’re unsure about selector syntax, ask ChatGPT. Show it relevant pieces of HTML and ask what selectors you can use in querySelector method.

You can also open the LZ Desktop Helper popup, click “Configure URL”, and type your parsing rules there. The extension will handle URL encoding. Then you can download the page into LZ Desktop using those rules. Through trial and error you’ll eventually discover the right set.

Sometimes this is easy. Sometimes it’s frustrating. And some pages are effectively impossible because all the classes are machine-generated nonsense.

Good news

Once you figure out the parsing rules, every client will parse the page identically. Users won’t even know parsing rules exist—everything “just works.” Only the author suffers during setup.

Another bit of good news: for many websites, the same parsing rules work across all pages. If you use the LZ Desktop Helper extension, you can configure site-wide rules, and every page you download from that site will automatically use them.

In the future we may create a public database of parsing rules for different websites which may be integrated with the extension, so you won’t have to come up with parsing rules that often, especially for the more popular websites.

And over time—this is the optimistic scenario—HDOCs become common enough that parsing rules aren’t needed as often.

Special cases

A special parsing rule exists for plain-text pages:

https://example.com/some-page#pr=text

Here "text" means the page is plain text and should be treated as such. Also, if the parser finds a line starting with Title: ..., it treats it as the page title in this case.

There are also two WordPress-specific parsing modes:

https://example.com/some-page#pr=wppage
https://example.com/some-post#pr=wppost

WordPress powers roughly 40% of the web, and most WP sites expose a public API. Using that API we can get clean content that isn’t polluted by themes and plugins. HDOCs generated this way may even include comments with pagination. These keywords (wppage and wppost) tell the client to use the API instead of scraping the original HTML.

The client app will call URLs that look like this:

https:/example.com/wp-json/wp/v2/pages?slug=some-page
https:/example.com/wp-json/wp/v2/posts?slug=some-post

How do you know if you can use wppost or wppage for a given web page?


When you choose “Download with parsing rules” in LZ Desktop, the app first pretends the site is WordPress and tests its APIs. If the request succeeds, the newly downloaded page will contain #pr=wppage or #pr=wppost in its URL. If the request fails, the app falls back to selecting possible CSS selectors, as described earlier.

Using WordPress APIs may feel like a hack, but it may save a lot of work. And if HDOCs ever become popular enough, WordPress could start supporting them natively (right now you need a plugin), the same way it ships with built-in RSS support. If that happens, client apps will be able to ignore wppage and wppost entirely and simply load the HDOC directly from the URL.