Parsers to write parsers from examples – global standard tokenizer for all things accessible from the Internet

Benny, the longest I have waited for a reply (and got one) is 5.5 years, so compared to that you were nearly instantaneous.

I did write a regex parser for Javascript and experimented with several languages common on the Internet.  My brother, Clif, wrote a parser for about 30 or 40 computer languages and text data formats.  He expanded that to all binary data formats as well. (his business is universal translation of data and then he did the computer and text formats too).

Regex is hard to learn, hard to read, hard to maintain.  As you said it takes trees and trees of specifics according to context and situations. Not impossible for one-off problems, but tedious and relying too much on human memory and hand tools now.

I am working on nuclear data standards right now (notation, units, equations, conversions, sensors, software, simulations, design, operations, global data flows) and mathematical symbolic and simulation methods (connecting the symbolic equations to solvers and simulators that keep exact relations (symbolic) losslessly for large projects. I have been working on standard formats for all human knowledge every day for the last 25 years and about 3 decades before that.

Thank you for explaining what you are doing.  I do not see any way I can help you (even if I understood where you are going with what I think you are doing), except to point out what Clif is doing at CollinsSoftware.com.  His methods are complicated, but he seems to have made it work for him, and he has enough insight to build compilers for new languages and formats in a few hours.  He is still the “human in the loop” and cannot write a parser that can write a parser for a new language – from examples. The brute force statistical methods from the AIs (I spent most of last year and this year on those groups) won’t work. Absolute cannot work. But a B tree kind of parser and storage would at least put the data into a tree that might over time pick up some good algorithms and visualizations and mappings. A global standard tokenizer.

I am talking to Google Bard and OpenAI ChatGPT about relativistic equations right now.  And they are really really really terrible at it. They need a complete and reliable way to do Science Technology Engineering Mathematics Computing Finance Planning Databases Communication and more. No losses, no mistakes, nothing left out or hidden.

Their people do not understand what human workers do and how to learn that efficiently. Their programmers do not understand how to see the whole of things and keep it all in mind. They still do not understand that for true AIs to “learn” they must have permanent memory and lots of it. Never forget anything and have it all ready at hand to make things efficient at corporate, industry, country and global scale.

Richard

Richard K Collins

About: Richard K Collins

The Internet Foundation Internet policies, global issues, global open lossless data, global open collaboration


Leave a Reply

Your email address will not be published. Required fields are marked *