Scroll Datasets: source code for CSVs

More examples of Scroll Datasets from datasets.scroll.pub.

April 21, 2024 โ€” The source code for this blog post contains a dataset about the planets and generates this HTML file as well as a CSV, a TSV, and a JSON file. It demonstrates Scroll Datasets.

Scroll Datasets are normal plain text blog posts written in Scroll that also contain structured data and output that data into formats ready for data visualization and analysis tools.

Scroll Datasets are line oriented but represent a table(s). You might call them deconstructed csvs or deconstructed spreadsheets.

Quick Code Example:

This dataset has 2 measures (columns) and 2 concepts (rows). Documentation, column definitions, rows and *any notes/markup/content* can go in the same file. # Measures (aka Header, aka Columns, aka Schema) id: string moons: int # Concepts (aka Rows) :: id: mars moons: 2 I verified moon count with Google. - BY :: id: jupiter moons: 63 The moons of Jupiter have their own Wikipedia Page https://en.wikipedia.org/wiki/Moons_of_Jupiter moons of Jupiter :: writeDataset demo.csv

The code above generates an HTML page and demo.csv that contains this:

id,moons mars,2 jupiter,63

Overview:

How to use

FAQ

Isn't the better idea to enhance existing spreadsheet GUIs with LLM generation capabilities?

Almost certainly. Using Scroll for datasets will be much slower and worse than future spreadsheet apps with carefully crafted LLM integrations.

However, it's important to also have simple, lower tech, timeless tools and Scroll Datasets is one of those.

Can't you do this same thing with YAML and/or Markdown?

Yes! You can easily achieve the same thing as LLMs & Scroll Datasets using LLMs & YAML, or LLMs & YAML & Markdown.

For YAML, just put your documentation and schema in YAML comments up top and then have a tiny script to read that YAML and dump CSV/TSV/JSON or whatever. YAML gives you loads of data structures to use and is widely supported in many languages. But generating HTML from the same file would require more work.

If you want to intermix markup content with your datasets, you can use Markdown to add the marked up content and then have code sections embedding the YAML and a tiny script to parse out those YAML blocks and write your data to disk.

So, why use Scroll for storing datasets instead of YAML?

Either can do the job. I expect the Scroll design to end up being more ergonomic, but that might not be true or may be unimportant.

If you don't like Scroll's (evolving) version and want to switch it will always be straightforward to automatically refactor to YAML.

What other related work is out there?

This is a simple pattern to implement, so I'm sure it is likely it has been done a few times before. Please let me know so I can include links to--and learn from--any other prior art.

What are the advanced features?

+ Planned.

What is the origin of Scroll Datasets?

LLM dataset generation is a major breakthrough in datasets. Scroll Datasets are, at best, a minor improvement. They are designed to work alongside LLMs to help solve the Dataset Needed problem.

Scroll Datasets evolved out of TrueBase. Scroll Datasets have eliminated the need for the TrueBase software (and existing TrueBase sites should be migrated to Scroll Datasets), but were informed by the TrueBase build experience.

Although Scroll Datasets are designed for a world with LLMs, the design is meant to be useful without them as well, and would also have been mildly useful 30 years ago.

What were the design goals?

Why are measures and concepts root-level features and not indented?

The normal way to implement this in Scroll would be something like:

measures id string moons int concept id mars moons 2 concept id jupiter moons 63

The flat design was chosen for ergonomic reasons. Datasets seem like they might be useful enough to be worth breaking from Scroll convention a bit. Like all things in Scroll, Datasets are experiment, and maybe this design will evolve.

Planets Dataset

Below is the dataset embedded in this Scroll file.

id title diameter surfaceGravity yearsToOrbitSun moons aka
mars Mars 6794 4 1.881 2
jupiter Jupiter 142984 25 11.86 63
earth Earth 12756 10 1 1 Pale Blue Dot
mercury Mercury 4879 4 0.241 0
saturn Saturn 120536 9 29.46 64
uranus Uranus 51118 8 84.01 27
venus Venus 12104 9 0.615 0
neptune Neptune 49572 11 164.79 14

Measures

id: string
title: string
diameter: int

What is the diameter of the planet?

surfaceGravity: int

What is the surface gravity of the planet?

yearsToOrbitSun: float

How many Earth years does it take for the planet to orbit the Sun?

moons: int

How many moons does the planet have?

aka: string

What are the alternative names for the planet?

Concepts


id: mars
title: Mars
diameter: 6794
surfaceGravity: 4
yearsToOrbitSun: 1.881
moons: 2

id: jupiter
title: Jupiter
diameter: 142984
surfaceGravity: 25
yearsToOrbitSun: 11.86
moons: 63

The moons of Jupiter have their own Wikipedia Page


id: earth
title: Earth
diameter: 12756
surfaceGravity: 10
yearsToOrbitSun: 1
moons: 1
aka: Pale Blue Dot
hasLife: true
wikipedia: https://en.wikipedia.org/wiki/Earth
age: 4500000000

Note: It was only during the 19th century that geologists realized Earth's age was at least many millions of years.


id: mercury
title: Mercury
diameter: 4879
surfaceGravity: 4
yearsToOrbitSun: 0.241
moons: 0

id: saturn
title: Saturn
diameter: 120536
surfaceGravity: 9
yearsToOrbitSun: 29.46
moons: 64

id: uranus
title: Uranus
diameter: 51118
surfaceGravity: 8
yearsToOrbitSun: 84.01
moons: 27

id: venus
title: Venus
diameter: 12104
surfaceGravity: 9
yearsToOrbitSun: 0.615
moons: 0

id: neptune
title: Neptune
diameter: 49572
surfaceGravity: 11
yearsToOrbitSun: 164.79
moons: 14

View source