Cheaper by the Dozen, compiling the Internet, the value of global collaboration

Continuing my analysis of the paper at https://iopscience.iop.org/article/10.3847/1538-4357/ad31a1
 
I have been treating it as a blob of text and links. The 5+ Billion humans using the internet can find it with their browsers, but it is not easily absorbed as a piece of knowledge. There are different versions – the HTML+images at that URL. A PDF copy with all the serious limitations of that closed format. And an ePub version which only tries to create “yet another proprietary format for print”. Another group trying to control knowledge, not facilitate it.
 
I am concerned with the high cost these formats impose on humans just trying to read these “papers” (all the attendant baggage of centuries of paper publishing and habits and limitations). That affects the rate of global knowledge diffusion and use. It is hundreds of time slower than pre-analysed “knowledge”. The whole world is working at 1/1000th of its potential. (A world with too much concentrated in the hands and minds of a few.)
 
I am concerned NONE of the large language model wrapper program groups (LWPs) is even trying to compile the papers (in the sense of compile by computer for exact uses). That means, for the foreseeable future all those LWPs are going to give 30% or 70% hints to answers, and many times 100% wrong or misleading responses. They are NOT answers, they are guessing. And they are not learning.
 
Today I was trying to trace the data used in the paper and the blizzard of related studies. I am patient, I am good, I am very flexible and smart. It still took me a few days to figure out what was going on, more then the usual “academic paper chase”.
 
This is interesting. They want to model how stars and planets form. I was aware of that, but put off by the boasting and shouting. And the horrible writing (clear and efficient transmission of useful knowledge). I hate memorizing someone made up names. I hate memorizing ten sets of units for every groups favorite insider language, and “today’s fad methods in modeling”. I hate when they name some method after a person or persons, then expect the whole world to just know it. And they make no effort to use Standard Internet units.
 
Please understand what I am trying to do. I come from old “systems analysis” background. I was a toddler when Cheaper by the Dozen came out (we were just half a dozen kids). I analyze systems and methods. I know “methods and procedures analysis”. I look at whole systems (if you leave any part out, you will be missing critical parts and the whole will not work). That is why I say “grok is to look at the whole of something”. Like “grok the internet”, “grok solar system exploration”, “grok DNA based life”, “grok all young stars and placs that can form stars”. That sort of thing. It is a good methods, it is efficient, it “cuts to the chase”. It “gets to the essentials”. It encourages “grok how to grok – infinitely deep”.
 
Parsing this one paper; it is not easily possible to be complete. It is part of a blizzard, not well written down, of many related and interlocking studies and notes. It captures the dreams and projects and hopes of many humans and corporations (any formal group of humans). Many of those pieces are difficult to find. It is like trying to pick up jello, of a billion grains of sand with tweezers and a magnifying glass, or catch flying needles with chopsticks. I have worked on many problems that take me hundreds of hours of continuous effort and focus to solve. So if I have to spent a few hundred, or even a few thousands hours at something, at my pace and efficiency, then that is how large the problem is, not related to any intrinsic difficulty.
 
Think of an object where the bonds between the elementary parts are all “chemical bonds” with pair wise bond energies under 20 electron volts per bond. Or where the energy to extract one element takes 20 eV. That is chemistry. Those small energies are why EM has to use huge tanks of liquid oxygen and fuel to make his rockets go. So if I look at a website like site:NIH.gov and see 463 Million entry points, I know it is mostly all the same. The difficulty of analyzing any one node, is pretty much the same as any other. The cost is just some average cost per node times the number of nodes.
 
But none of the nodes are in standard units, and the cost to translate any one raw node into standard form, so it can be used with ALL the rest can be nearly infinite. In the world of “nuclear” and “atomic”, the bonds can be in KeV and MeV (KiloElectronVolts and Mega_eV).
 
[ That allows for atomic fuels that are 10s or 100s or 1000s of times chemical fuels, and material likewise stronger. Energy does NOT all have to go into mass. Energy can be stored in rotation, vibration and in tension. I am not sure you can see what I am doing. I am working on two problems at once, which are topologically very close. The spatial distribution, velocity, complexity and many of the methods for organizing the information for one problem are nearly identical to what is needed for any other.  My brother Clif would say “it is all bits” and his job is to “move bits”.]
 
The LLMs/LWPs are all using arbitrary tokens. They go look at the blobs, find common things, assign them a code and put things into collections. The problem, because they are not working with “global open tokens for all things, in all languages”, is the names they assign, the codes are all different between groups – and NOT accessible to most of the 8.1 Billion humans. And the LLMs and LWPs are all bleating out “We are doing this to make good AIs, not just to enrich ourselves”.
 
I tried to say this in a long post a few days ago. But I needed to add to it, and that is NOT possible on Twitter(X). So I deleted the Twitter(X) copy and might one day put a current copy back. I suggested to EM that he needs to stomp on his (X) people, what I still call Twitter(X) so they follow his rules, his vision, his goals, his methods and ways of looking at the world. I said it nicely that he should have a general staff meeting, or whatever they do, and he can ask them to look deeply at the legacy systems and methods left over from the former owners. Also he has so many humans using Twitter(X), and they are faced with bad tools. And “WTF is this new thing coming out without any explanation or way to ask for help?”. All I can hope is that his people are as smart as he, and more diligent than him to keep at it until it is done right. I wouldn’t let so many things blow up, but then I also know the value of letting things blow up.
 
OK, translate that page. I can do it manually and be much faster myself when faced with tens of thousands of millions of such things. But it is the LLMs, LWPs, search engines, hosting companies, site navigators, site builders, site users, browsers, computers, devices – that need to change. All of them.
 
But to get truly nuclear changes, the whole world needs central resources that cannot be gamed, manipulated, corrupted, fiddled. That means incorruptible machine intelligence and algorithms, and incorruptible humans. And that is just not possible. But I think it might be the only way. Or “keep operating at 1/10,000th speed”.
 
I did clarify that knowledge cannot be completely and efficiently stored in these rigid print and screen formats like PDF, ePub, html, and countless “document formats” and most “data formats’. I think I have looked at and used most every type of data and algorithm storage method on the Internet and in print, and many video and sound and sensor data stream methods. I have not left out much. One human, billions of things, something won’t get looked at by me. But the world is full of people a lot smarter than me. I can tell you many of them never learned “some things are more important than life itself”.
 
This is not complete. But I compressed many of the essentials here in a form I understand. I will just have to try to explain in ways that make sense to others. It is important, or I would not take time to write.
 
Compiling human and computer languages now starts with parsing and tokenization. And those tokens need to be tied to real things that can be said and explained and shown in all human languages, in many domain specific languages. NOT IMPOSSIBLE.
 
I will look at the DOM and Blink again. I just wish the Internet were not so dominated by narrow commercial interests. Global collaboration with open methods is so much more efficient, but it is also hard work.
 
Filed as (Cheaper by the Dozen, compiling the Internet, the value of global collaboration)
 
Richard Collins, The Internet Foundation
 
Richard K Collins

About: Richard K Collins

The Internet Foundation Internet policies, global issues, global open lossless data, global open collaboration


Leave a Reply

Your email address will not be published. Required fields are marked *