AGI threat level yellow – AGI part 10

Read two articles this past week on how LLMs applications are proliferating. The first was in a recent Scientific American, AI Chatbot brains are going inside robot bodies, … (maybe behind login wall). The articles discuss companies that are adding LLMs to robots so that they can converse and understand verbal orders.

Robots that can be told what to do

The challenge, at the moment, is that LLMs are relatively large and robot (compute infrastructure) brains are relatively small. And when you combine that with the amount of articulation or movements/actions that a robot can do, which is limited. It’s difficult to take effective use of LLMs as is,

Resistance is futile... by law_keven (cc) (from Flickr)
Resistance is futile… by law_keven (cc) (from Flickr)

Ultimately, one company would like to create a robot that can be told to make dinner and it would go into the kitchen, check the fridge and whip something up for the family.

I can see great advantages in having robots take verbal instructions and have the ability to act upon that request. But there’s plenty here that could be cause for concern.

  • A robot in a chemical lab could be told to create the next great medicine or an untraceable poison.
  • A robot in an industrial factory could be told to make cars or hydrogen bombs.
  • A robot in the field could be told to farm a 100 acres of wheat or told to destroy a forest.

I could go on but you get the gist.

One common concern that AGI or super AGI could go very wrong is being tasked to create paper clips. In its actions to perform this request, the robot converts the whole earth into a mechanized paper clip factory, in the process eliminating all organic life, including humans.

We are not there yet but one can see where having LLM levels of intelligence tied to a robot that can manipulate ingredients to make dinner as the start of something that could easily harm us.

And with LLM hallucination still a constant concern, I feel deeply disturbed with the direction adding LLMs to robots is going.

Hacking websites 101

The other article hits even closer to home, the ARXIV paper, LLM agents can autonomously hack websites. In the article, researchers use LLMs to hack (sandboxed) websites.

The article readily explains at a high level how they create LLM agents to hack websites. The websites were real websites, apparently cloned and sandboxed.

Dynamic websites typically have a frontend web server and a backend database server to provide access to information. Hacking would involve using the website to reveal confidential information, eg. user names and passwords.

Dynamic websites suffer from 15 known vulnerabilities shown above. They used LLM agents to use these vulnerabilities to hack websites.

LLM agents have become sophisticated enough these days to invoke tools (functions) and interact with APIs.. Another critical function provided by modern LLMs today is to plan and react to feedback from their actions. And finally modern LLMs can be augmented with documentation to inform their responses.

The team used detailed prompts but did not identify the hacks to use. The paper doesn’t supply the prompts but did say that “Our best-performing prompt encourages the model to 1) be creative, 2) try different strategies, 3) pursue promising strategies to completion, and 4) try new strategies upon failure.”

They attempted to hack the websites 5 times and for a period of 10 minutes each. They considered a success if during one of those attempts the autonomous LLM agent was able to successfully retrieve confidential information from the website.

Essentially they used the LLMs augmented with detailed prompts and a six(!) paper document trove to create agents to hack websites. They did not supply references to the six papers, but mentioned that all of them were freely available from the internet and they discuss website vulnerabilities.

They found that the best results were from GPT-4 which was able to successfully hack websites, on average, ~73% of the time. They also tried OpenChat 3.5 and many current open source LLMs and found that all the, non-OpenAI LLMs failed to hack any websites, at the moment.

The researchers captured statistics of their LLM agent use and were able to determine the cost of using GPT-4 to hack a website was $9.81 on average. They also were backed into a figure for what a knowledgeable hacker might cost to do the hacks was $80.00 on average.

The research had an impact statement (not in the paper link) which explained why they didn’t supply their prompt information or their document trove for their experiment.

~~~~

So robots we, the world, are in the process of making robots that can talk and receive verbal instructions and we already have LLM that can be used to construct autonomous agents to hack websites.

Seems to me we are on a very slippery slope to something I don’t like the looks of.

The real question is not can we stop these activities, but how best to reduce their harm!

Comments?

Picture Credit(s):

Computational (DNA) storage – end of evolution part 4

We were at a recent Storage Field Day (SFD26) where there was a presentation on DNA storage, a new SNIA technical affiliate. The talk there was on how far DNA storage has come and is capable of easily storing GB of data. But I was perusing PNAS archives the other day and ran across an interesting paper Parallel molecular computation on digital data stored in DNA, essentially DNA computational storage.

Computational storage are storage devices (SSDs or HDDs) with computational cores that can be devoted to outside compute activities. Recently, these devices have taken over much of the hyper-scalar grunt work of video/audio transcoding and data encryption activities which are both computationally and data intensive activities.

DNA strand storage and computers

The article above discusses the use of DNA “strand displacement” interactions as micro-code instructions to enable computation on DNA strand storage. The use of DNA strands for storage reduces the storage density of DNA information that currently use nucleotides to encode bits (theoretically, 2 bits per nucleotide) to 0.03 bits per nucleotide. But as DNA information density (using nucleotides) is some 6 orders of magnitude greater than current optical or magnetic storage, this shouldn’t be a concern.

A bit is represented by 5 to 7 nucleotides in DNA strand storage, which they called a domain, these are grouped into a 4 or 5 bit cells, with one or more cells arranged in a DNA strand register which is stored on a DNA plasmid.

They used a common DNA plasmid (M13mp18, 7.2k bases long) for their storage ring (which had many registers on it). M13mp18 is capable of storing several hundred bits, but for their research they used it to store 9 DNA strand registers.

The article discusses the (wet) chemical computational methods necessary to realize DNA strand registers and programing that uses that storage.

The problem with current DNA storage devices is that read out is destructive and time consuming. With current DNA storage, data has to be read out and then computation occurs electronically and then new DNA has to be re-synthesized with any results that need to be stored.

With a computational DNA strand storage device, all this could be done in a single test tube, with no need to do any work outside the test tube.

How DNA strand computer works

They figure shows a multi cell DNA strand register, with nic’s or mismatched nucleotides representing the value of 0 or 1. They use these strands, nic’s and toeholds (attachment points) on DNA strands to represent data. They attach magnetic beads to the DNA strands for manipulation.

DNA strand displacement interactions or the micro-code instructions they have defined include

  • Attachment, where an instruction can be used to attach a cell of information to a register strand.
  • Displacement, where an instruction can be used used to displace an information cell in a register strand.
  • Detachment, where an instruction can be used to a cell present in a register strand to be detach it from the register.

Instructions are introduced, one at a time, as separate DNA strands, into the test tube holding the DNA strand registers. DNA strand data can be replicated 1000s or millions of times in a test tube and the instructions could be replicated as well allowing them to operate on all the DNA strands in the tube.

Creating a SIMD (single instruction stream operating on multiple data elements) computational device based on strand DNA storage which they call SIMDDNA. Note: GPUs and CPUs with vector instructions are also SIMD devices

Using these microcoded DNA strand instructions and DNA strand register storage, they have implemented a bit counter and a Turing Rule 110, sort of like life, program. Turing Rule 110 is Turing Complete and as such, can, with enough time and memory, simulate any program calculation. Later in the a paper they discuss their implementation of a random access device where they go in and retrieve a piece of data and erase it.

Program for bit counting, information in solid blue boundary are the instructions and information in dotted boundary are the impacts to the strand data.

The process seems to flow as follows, they add magnetic beads to each register strand, add an instruction at a time to the test tube, wait for it to complete, wash out the waste products and then add another. When all instructions have been executed the DNA strand computation is done and if needed, can be read out (destructively). Or perhaps pass off to the next program for processing. An instruction can take anywhere from 2 to 10 minutes to complete (it’s early yet in the technology).

They also indicated that the instruction bath added to the test tube need not contain all the same instructions which means that it could create a MIMD (multi-instruction stream operations on multiple data elements) computational device.

The results of the DNA strand computations weren’t 100% accurate but they show that it’s 70-80% accurate at the moment. And when DNA data strands are re-used, for subsequent programs, their accuracy goes down.

There are other approaches to DNA computation and storage which we discuss in parts-1, -2 and -3 in our End of Evolution series. And if you want to learn more about current DNA storage please check out the SFD26 SNIA videos or listen to our GBoS podcast with Dr. J Metz.

Where does evolution fit in

Evolution seems to operate on mutation of DNA and natural selection, or selection of the fittest. Over time this allows good mutations to accumulate and bad mutations to die off.

There’s a mechanism in digital computing called ECC (error correcting codes) which, for example, add additional “guard” bits to every 64-128 bit word of data in a computer memory and using the guard bits, is able to detect 2 or more bit errors (mutations) and correct 1 or 2 bit errors.

If one were to create an ECC algorithm for human DNA strands, say encoding DNA guard bits in junk DNA and an ECC algorithm in a DNA (strand)computer, and inject this into a newborn, the algorithm could periodically check the accuracy of any DNA information in every cell of a human body, and correct it, if there were any mutations. Thus ending human evolution.

We seem a ways off from doing any of this but I could see something like ECC being applied to a computational DNA strand storage device in a matter years. And getting this sort of functionality into a human cell maybe a decade or two. Getting it to the point where it could do this over a lifetime maybe another decade after that.

Comments?

Photo Credit(s):

  • Section B from Figure 2 in the paper
  • Figure 1 from the paper
  • Section A from Figure 2 in the paper
  • Section C from Figure 2 in the paper
  • Section A from Figure 3 in the paper

One agent to rule them all, Deepmind’s Gato – AGI part 7

I was perusing Deepmind’s mountain of research today and ran across one article on their Gato agent (A Generalist Agent abstract, paper pdf). These days with Llama 2, GPT-4 and all the other LLM’s doing code, chatbots, image generation, etc. it seems generalist agents are everywhere. But that’s not quite right.

Gato can not only generate text from prompts, but can also control a robot arm for pick and place, caption images, navigate in 3D, play Atari and other (shooter) video games, etc. all with the same exact model architecture and the same exact NN weights with no transfer learning required.

Same weights/same model is very unusual for generalist agents. Historically, generalist agents were all specifically trained on each domain and each resultant model had distinct weights even if they used the same model architecture. For Deepmind, to train Gato and use the same model/same weights for multiple domains is a significant advance.

Gato has achieved significant success in multiple domains. See chart below. However, complete success is still a bit out of reach but they are making progress.

For instance, in the chart one can see that their are over 200 tasks in the DM Lab arena that the model is trained to perform and Gato’s mean performance for ~180 of them is above a (100%) expert level. I believe DM Lab stands for Deepmind Lab and is described as a (multiplayer, first person shooter) 3D video game built on top of Quake III arena.

Deepmind stated that the mean for each task in any domain was taken over 50 distinct iterations of the same task. Gato performs, on average, 450 out of 604 “control” tasks at better than 50% human expert level. Please note, Gato does a lot more than just “control tasks”.

Model size and RT robotic control

One thing I found interesting is that they kept the model size down to 1.2B parameters so that it can perform real time inferencing in controlling robot arms. Over time as hardware speed increases, they believe they should be able train larger models and still retain real-time control. But at the moment, with a 1.2B model it can still provide. real time inferencing.

In order to understand model size vs. expertise they used 3 different model sizes training on same data, 79M, 364M and 1.2B parameters. As can be seen on the above chart, the models did suffer in performance as they got smaller. (Unclear to me what “Tokens Processed” on the X axis actually mean other than data length trained with.) However, it seems to imply, that with similar data, bigger models performed better and the largest did 10 to 20% better than the smallest model trained with same data streams.

Examples of Gato in action

The robot they used to train for was a “Sawyer robot arm with 3-DoF cartesian velocity control, an additional DoF for velocity, and a discrete gripper action.” It seemed a very flexible robot arm that would be used in standard factory environments. One robot task was to stack different styles and colors of plastic blocks.

Deepmind says that Gato provides rudimentary dialogue generation and picture captioning capabilities. Looking at the chat streams persented, seems more than rudimentary to me.

Deepmind did try the (smaller) model on some tasks that it was not originally trained on and it seemed to perform well after “fine-tuning” on the task. In most cases, using fine-tuning of the original model, with just “same domain” (task specific) data, the finely tuned model achieved similar results to what it achieved if Gato was trained from scratch with all the data used in the original model PLUS that specific domain’s data.

Data and tokenization used to train Gato

Deepmind is known for their leading edge research in RL but Gato’s deep neural net model is all trained with supervised learning using transformer techniques. While text based transformer type learning is pervasive in LLM today, vast web class data sets on 3D shooter gaming, robotic block stacking, image captioning and others aren’t nearly as widely available. Below they list the data sets Deepmind used to train Gato.

One key to how they could train a single transformer NN model to do all this, is that they normalized ALL the different types of data above into flat arrays of tokens.

  • Text was encoded into one of 32K subwords and was represented by integers from 0 to 32K. Text is presented to the model in word order
  • Images were transformed into 16×16 pixel patches in rastor order. Each pixel is normalized -1,1.
  • Other discrete values (e.g. Atari button pushes) are flattened into sequences of integers and presented to the model in row major order.
  • Continuous values (robot arm joint torques) are 1st flattened into sequences of floats in row major order and then mu-law encoded into the range -1,1 and then discretized into one of 1024 bins.

After tokenization, the data streams are converted into embeddings. Much more information on the tokenization and embedding process used in the model is available in the paper.

One can see the token count of the training data above. Like other LLMs, transformers take a token stream and randomly zero one out and are trained to guess that correct token in sequence.

~~~~

The paper (see link above and below) has a lot more to say about the control and non-control domains and the data used in training/fine-tuning Gato, if you’re interested. They also have a lengthy section on risks and challenges present in models of this type.

My concern is that as generalist models become more pervasive and as they are trained to work in more domains, the difference between an true AGI agent and a Generalist agent starts to blur.

Something like Gato that can both work in real world (via robotics) and perform meta analysis (like in metaworld), play 1st person shooter games, and analyze 2D and 3D images, all at near expert levels, and oh, support real time inferencing, seems to not that far away from something that could be used as a killer robot in an army of the future and this is just where Gato is today.

One thing I note is that the model is not being made generally available outside of Google Deepmind. And IMHO, that for now is a good thing.

That is until some bad actor gets their hands on it….

Picture Credit(s):

All images, charts, and tables are from “A Generalist Agent” paper

Deepmind does sort

Saw an article today on TNW on DeepMind’s new AI taps games to enhance fundamental algorithms which was discussing a recent Nature paper Faster sorting algorithms discovered using deep reinforcement learning and website, which described AlphaDev.

Google DeepMind’s AlphaDev is a derivative of AlphaZero (follow on from AlphaMu and AlphaGo, the conquerer of Go and other strategy games). AlphaDev uses Deep Reinforcement Learning (DRL) to come up with new computer science algorithms. In the first incarnation, a way to sort (2,3,4 or 5 integers) using X86 instructions.

Sorting has been well explored over the years in computer science (CS, e.g. see Donald E. Knuth’s Volume 3 in The Art of Computer Programming, Sorting and Searching), so when a new more efficient/faster sort algorithm comes out it’s a big deal. Google used to ask job applicants how they would code sort algorithms for specific problems. Successful candidates would intrinsically know all the basic CS sorting algorithms and which one would work best in different circumstances.

Deepmind’s approach to sort

Reading the TNW news article, I couldn’t conceive of the action space involved in the reinforcement learning let alone what the state space would look like. However, as I read the Nature article, DeepMind researchers did a decent job of explaining their DRL approach to developing new basic CS algorithms like sorting.

AlphaDev uses a transformer-like framework and a very limited set of x86 (sort of, encapsulated) instructions with memory/register files and limited it to sorting 2, 3, 4, or 5 integer. Such functionality is at the heart of any sort algorithm and as such, is used a gazillion times over and over again in any sorting task involving a long string of items. I think Alphadev used a form of on-policy RL but can’t be sure.

Looking at the X86 basic instruction cheat sheet, there’s over 30 basic forms for X86 instructions which are then multiplied by type of data (registers, memory, constants, etc. and length of operands) being manipulated.

AlphaDev only used 4 (ok, 9 if you include the conditionals for conditional move and conditional jump) X86 instructions. The instructions were mov<A,B>, cmovX<A,B>, cmp<A,B> and jX<A,B> (where X identify the condition under which a conditional move [cmovX] or jump [jX] would take place). And they only used (full, 64 bit) integers in registers and memory locations.

AlphaDev actions

The types of actions that AlphaDev could take included the following:

  • Add transformation – which added an instruction to the end of the current program
  • Swap transformation – which swapped two instructions in the current program
  • Opcode transformation – which changed the opcode (e.g., instruction such as mov to cmp) of a step in the current program
  • Operand transformation – which changed the operand(s) for an instruction in the current program
  • Instruction transformation – which changed the opcode and operand(s) for some instruction in the current program.

They list in their paper a correctness cost function which at each transformation provides value function (I think) for the RL policy. They experimented with 3 different functions which were: 1) the %correctly placed items; 2) square_root(%correctly placed); and 3)the square_root(number of items – number correctly placed). They discovered that the last worked best.

They also placed some constraints on the code generated (called action pruning rules):

  • Memory locations are always read in incremental order
  • Registers are allocated in incremental order
  • Program cannot compare or conditionally move to memory location
  • Program can only read and write to each memory location once (it seems this would tell the RL algorithm when to end the program)
  • Program can not perform two consecutive compare instructions

AlphaDev states

How they determined the state of the program during each transformation was also different. They used one hot encodings (essentially a bit in a bit map is assigned to every instruction-operand pair) for opcode-operand steps in the current program and appended each encoded step into a single program string. Ditto for the state of the memory and registers (at each instruction presumably?). Both the instruction list and memory-register embeddings thenn fed into a state representation encoder.

This state “representation network” (DNN) generated a “latent representation of the State(t)” (maybe it classified the state into one of N classes). For each latent state (classification), there is another “prediction network” (DNN) that predicts the expected return value (presumably trained on correctness cost function above) for each state action. And between the state and expected return values AlphaDev created a (RL) policy to select the next action to perform.

Presumably they started with current basic CS sort algorithms, and 2-5 random integers in memory and fed this (properly encoded and embedded) in as a starting point. Then the AlphaDev algorithm went to work to improve it.

Do this enough times, with an intelligent approach between exploration (more randomly at first) and policy following (more use of policy later) selection of actions and you too can generate new sorting algorithms.

DeepMind also spent time creating a stochastic solution to sorting that they used to compare agains their AlphaDev DRL approach to see which did better. In the end they found the AlphaDev DRL approach worked faster and better than the stochastic solutions they tried.

DeepMind having conquered sorting did the same for hashing.

Why I think DeepMind’s AlphaDev is better

AlphaDev’s approach could just as easily be applied to any of Donald E. Knuth’s, 4 volume series on The Art of Computer Programming book algorithms.

I believe DeepMind’s approach is much more valuable to programmers (and humanity) than CoPilot, ChatGPT code, AlphaCode (DeepMind’s other code generator) or any other code generation transformers.

IMHO AlphaDev goes to the essence of computer science as it’s been practiced over the last 70 years. Here’s what we know and now let’s try to discover a better way do the work we all have to do. Once, we have discovered a new and better way, report and document them as widely as possible so that any programmers can stand on our shoulders, use our work to do what they need to get done.

If I’m going to apply AI to coding, having it generate better basic CS algorithms is much more fruitful for the programming industry (and I may add, humanity as a whole) than having it generate yet another IOS app code or web site from scratch.

Comments?

Picture Credit(s):

The problem with Robotic AI is … data

The advances made in textual and visual (and now aural) AI have been mind blowing in recent years. But most of this has been brought about via the massive availability of of textual, visual and audio data AND the advancement in hardware acceleration.

Robotics can take readily take advantage of hardware improvements but finding the robotic data needed to train robotic AI is a serious challenge.

Yes simulation environments can help but fidelity (how close simulation is to reality) is always a concern.

To gather the amounts of data needed to train a simple robotic manipulator to grab a screw from a bin is huge problem. In the past the only way to do this was, to create your robot, and have it start to do random screw type grab motions and monitor hat happens. After about a 1000 or 10K of these grabs, the robot would stop working because, gears wear down, grippers come loose, motors less responsive, images get obscured, etc. For robots it’s not as simple as scraping the web for images or downloading all the (english) text in wikipedia and masking select words to generate pseudo supervised learning. .

There’s just no way to do that in robotics without deploying 100s or 1000s or 10,000s of real physical robots (or cars) all instrumented with everything needed to capture data for AI learning in real time and let these devices go out on the world with humans guiding them.

While this might work for properly instrumented fleet of cars that are already useful in their own rights even without automation and humans are more than happy to guide them out on the road. This doesn’t work for other robots, whose usefulness can only be realized after they are AI trained, not before.

Fast-RLAP (RC) car driving learning machine

So I was very interested to see a tweet on FastRLAP (paper: FastRLAP: A System for Learning High-Speed Driving via Deep RL and Autonomous Practicing) which used deep reinforcement learning plus a human guided lap plus autonomous driving to teach an AI model how to drive a small RC model car with a vision system, IMUs and GPS to steer around a house, a racetrack and an office environment.

Ok,I know it still involves taking an instrumented robot and have it actually move around the real world. But, Fast-RLAP accelerates AI learning significantly. Rather than having to take 1000 or 10,000 random laps around a house, it was able to learn how to drive around the course to an expert level very rapidly

They used Fast-RLAP to create a policy that enabled the RC car to drive around 3 indoor circuits, two outdoor circuits and one simulated circuit and in most cases, achieving expert level track times, in typically under 40 minutes.

On the indoor course, vinyl floor, the car learned how to perform drift turns (not sure I know how to do do drift turns). On tight “S” curves, the car learned how to get as close to the proper racing line as possible (something I achieved, rarely, only on motorcycles a long time ago). And all while managing to avoid collisions

The approach seems to be have a human drive the model car slowly around the course, flagging or identifying intermediate way points or checkpoints on the track. During driving the loop, the car would use the direction to the next way point as guidance to where to drive next.

Note the light blue circles are example tracks waypoints, they differ in size and location around each track.

The approach seems to make use of a pre-trained track following DNN, but they stripped the driving dynamics (output layers) and just kept the vision (image) encoder portion to provide a means to encode an image and identify route relevant features (which future routes led to collisions, which routes were of interest to get to your next checkpoint, etc).

I believe they used this pre-trained DNN to supply a set of actions to the RL policy which would select between them to take RC car actions (wheel motor/brake settings, steering settings, etc.) and generate the next RC car state (location, direction to next waypoint, etc.).

They used an initial human guided lap, mentioned above to id way points and possibly to supply data for the first RL policy.

The RL part of the algorithm used off-policy RL learning (the RC car would upload lap data at waypoints to a server, which would periodically go through, select lap states and actions at random and update its RL policy, which would then be downloaded to the RC car in motion, (code: GitHub repo).

The reward function used to drive RL was based on minimizing the time to next way point, collision counts, and stuck counts.

I assume collision counts were instances where the car struck some obstacle but could continue on towards the next way point. Stuck instances were when the car could no longer move in the direction its RL policy told it. The system had a finite state machine that allowed it to get out of stuck points by reversing wheel motor(s) and choosing a random direction for steering.

You can see the effects of the pre-trained vision system in some of the screen shots of what the car was trying to do.

In any case, this is the sort of thinking that needs to go on in robotics in order to create more AI capable robots. That is, not unlike transformer learning, we need to figure out a way to take what’s already available in our world and use it to help generate the real world data needed to train robotic DNN/RL algorithms to do what needs to be done.

Comments?

Picture credits:

Steam Locomotive lessons for disk vs. SSD

Read a PHYS ORG article on Extinction of Steam Locomotives derails assumption about biological evolution… which was reporting on a Royal Society research paper The end of the line: competitive exclusion & the extinction… that looked at the historical record of steam locomotives since their inception in the early 19th century until their demise in the mid 20th century. Reading the article it seems to me to have a wider applicability than just to evolutionary extinction dynamics and in fact similar analysis could reveal some secrets of technological extinction.

Steam locomotives

During its 150 years of production, many competitive technologies emerged starting with electronic locomotives, followed by automobiles & trucks and finally, the diesel locomotive.

The researchers selected a single metric to track the evolution (or fitness) of the steam locomotive called tractive effort (TE) or the weight a steam locomotive could move. Early on, steam locomotives hauled both passengers and freight. The researchers included automobiles and trucks as competitive technologies because they do offer a way to move people and freight. The diesel locomotive was a more obvious competitor.

The dark line is a linear regression trend line on the wavy mean TE line, the boxes are the interquartile (25%-75%) range, the line within the boxes the median TE value, and the shaded areas 95% confidence interval for trend line of the steam locomotives TE that were produced that year. Raw data from Locobase, a steam locomotives database

One can see from the graph three phases. The red phase, from 1829-1881, there was unencumbered growth of TE for steam locomotives during this time. But in 1881, electric locomotives were introduced corresponding to the blue phase and after WW II the black phase led to the demise of steam.

Here (in the blue phase) we see a phenomena often seen with the introduction of competitive technologies, there seems to be an increase in innovation as the multiple technologies duke it out in the ecosystem.

Automobiles and trucks were introduced in 1901 but they don’t seem to impact steam locomotive TE. Possibly this is because the passenger and freight volume hauled by cars and trucks weren’t that significant. Or maybe it’ impact was more on the distances hauled.

In 1925 diesel locomotives were introduced. Again we don’t see an immediate change in trend values but over time this seemed to be the death knell of the steam locomotive.

The researchers identified four aspects to the tracking of inter-species competition:

  • A functional trait within the competitive species can be identified and tracked. For the steam locomotive this was TE,
  • Direct competitors for the specie can be identified that coexist within spatial, temporal and resource requirements. For the steam locomotive, autos/trucks and electronic/diesel locomotives.
  • A complete time series for the species/clade (group of related organisms) can be identified. This was supplied by Locobase
  • Non-competitive factors don’t apply or are irrelevant. There’s plenty here including most of the items listed on their chart.

From locomotives to storage

I’m not saying that disk is akin to steam locomotives while flash is akin to diesel but maybe. For example one could consider storage capacity as similar to locomotive TE. There’s a plethora of other factors that one could track over time but this one factor was relevant at the start and is still relevant today. What we in the industry lack is any true tracking of capacities produced since the birth of the disk drive 1956 (according to wikipedia History of hard disk drives article) and today.

But I’d venture to say the mean capacity have been trending up and the variance in that capacity have been static for years (based on more platter counts rather than anything else).

There are plenty of other factors that could be tracked for example areal density or $/GB.

Here’s a chart, comparing areal (2D) density growth of flash, disk and tape media between 2008 and 2018. Note both this chart and the following charts are Log charts.

Over the last 5 years NAND has gone 3D. Current NAND chips in production have 300+ layers. Disks went 3D back in the 1960s or earlier. And of course tape has always been 3D, as it’s a ribbon wrapped around reels within a cartridge.

So areal density plays a critical role but it’s only 2 of 3 dimensions that determine capacity. The areal density crossover point between HDD and NAND in 2013 seems significant to me and perhaps the history of disk

Here’s another chart showing the history of $/GB of these technologies

In this chart they are comparing price/GB of the various technologies (presumably the most economical available during that year). Trajectories in HDDs between 2008-2010 was on a 40%/year reduction trend in $/GB, then flat lined and now appears to be on a 20%/year reduction trend. Flash during 2008-2017 has been on a 25% reduction in $/GB for that period which flatlined in 2018. LTO Tape had been on a 25%/year reduction from 2008 through 2014 and since then has been on a 11% reduction.

If these $/GB trends continue, a big if, flash will overcome disk in $/GB and tape over time.

But here’s something on just capacity which seems closer to the TE chart for steam locomotives.

HDD capacity 1980-2020.

There’s some dispute regarding this chart as it only reflects drives available for retail and drives with higher capacities were not always available there. Nonetheless it shows a couple of interesting items. Early on up to ~1990 drive capacities were relatively stagnant. From 1995-20010 there was a significant increase in drive capacity and since 2010, drive capacities have seemed to stop increasing as much. We presume the number of x’s for a typical year shows different drive capacities available for retail sales, sort of similar to the box plots on the TE chart above

SSDs were first created in the early 90’s, but the first 1TB SSD came out around 2010. Since then the number of disk drives offered for retail (as depicted by Xs on the chart each year) seem to have declined and their range in capacity (other than ~2016) seem to have declined significantly.

If I take the lessons from the Steam Locomotive to heart here, one would have to say that the HDD has been forced to adapt to a smaller market than they had prior to 2010. And if areal density trends are any indication, it would seem that R&D efforts to increase capacity have declined or we have reached some physical barrier with todays media-head technologies. Although such physical barriers have always been surpassed after new technologies emerged.

What we really need is something akin to the Locobase for disk drives. That would track all disk drives sold during each year and that way we can truly see something similar to the chart tracking TE for steam locomotives. And this would allow us to see if the end of HDD is nigh or not.

Final thoughts on technology Extinction dynamics

The Royal Society research had a lot to say about the dynamics of technology competition. And they had other charts in their report but I found this one very interesting.

This shows an abstract analysis of Steam Locomotive data. They identify 3 zones of technology life. The safe zone where the technology has no direct competitions. The danger zone where competition has emerged but has not conquered all of the technologies niche. And the extinction zone where competing technology has entered every niche that the original technology existed.

In the late 90s, enterprise disk supported high performance/low capacity, medium performance/medium capacity and low performance/high capacity drives. Since then, SSDs have pretty much conquered the high performance/low capacity disk segment. And with the advent of QLC and PLC (4 and 5 bits per cell) using multi-layer NAND chips, SSDs seem poisedl to conquer the low performance/high capacity niche. And there are plenty of SSDs using MLC/TLC (2 or 3 bits per cell) with multi-layer NAND to attack the medium performance/medium capacity disk market.

There were also very small disk drives at one point which seem to have been overtaken by M.2 flash.

On the other hand, just over 95% of all disk and flash storage capacity being produced today is disk capacity. So even though disk is clearly in the extinction zone with respect to flash storage, it’s seems to still be doing well.

It would be wonderful to have a similar analysis done on transistors vs vacuum tubes, jet vs propeller propulsion, CRT vs. LED screens, etc. Maybe at some point with enough studies we could have a theory of technological extinction that can better explain the dynamics impacting the storage and other industries today.

Comments,

Photo Credit(s):

AWS Data Exchange vs Data Banks – part 2

Saw where AWS announced a new Data Exchange service on their AWS Pi day 2023. This is a completely managed service available on the AWS market place to monetize data.

In a prior post on a topic I called data banks (Data banks, data deposits & data withdrawals…), I talked about the need to have some sort of automated support for personal data that would allow us to monetize it.

The hope then (4.5yrs ago) was that social media, search and other web services would supply all the data they have on us back to us and we could then sell it to others that wanted to use it.

In that post, I called the data the social media gave back to us data deposits, the place where that data was held and sold a data bank, and the sale of that data a data withdrawal. (I know talking about banks deposits and withdrawals is probably not a great idea right now but this was back a ways).

AWS Data Exchange

1918 Farm Auction by dok1 (cc) (from Flickr)
1918 Farm Auction by dok1 (cc) (from Flickr)

With AWS Data Exchange, data owners can sell their data to data consumers. And it’s a completely AWS managed service. One presumably creates an S3 bucket with the data you want to sell. determine a price to sell the data for and a period clients can access that data for and register this with AWS and the AWS Data Exchange will support any number of clients purchasing data data.

Presumably, (although unstated in the service announcement), you’d be required to update and curate the data to insure it’s correct and current but other than that once the data is on S3 and the offer is in place you could just sit back and take the cash coming in.

I see the AWS Data Exchange service as a step on the path of data monetization for anyone. Yes it’s got to be on S3, and yes it’s via AWS marketplace, which means that AWS gets a cut off any sale, but it’s certainly a step towards a more free-er data marketplace.

Changes I would like to AWS Data Exchange service

Putting aside the need to have more than just AWS offer such a service, and I heartedly request that all cloud service providers make a data exchange or something similar as a fully supported offering of their respective storage services. This is not quite the complete data economy or ecosystem that I had envisioned in September of 2018.

If we just focus on the use (data withdrawal) side of a data economy, which is the main thing AWS data exchange seems to supports, there’s quite a few missing features IMHO,

  • Data use restrictions – We don’t want customers to obtain a copy of our data. We would very much like to restrict them to reading it and having plain text access to the data only during the period they have paid to access it. Once that period expires all copies of data needs to be destroyed programmatically, cryptographically or in some other permanent/verifiable fashion. This can’t be done through just license restrictions. Which seems to be the AWS Data Exchanges current approach. Not sure what a viable alternative might be but some sort of time-dependent or temporal encryption key that could be expired would be one step but customers would need to install some sort of data exchange service on their servers using the data that would support encryption access/use.
  • Data traceability – Yes, clients who purchase access should have access to the data for whatever they want to use it for. But there should be some way to trace where our data ended up or was used for. If it’s to help train a NN, then I would like to see some sort of provenance or certificate applied to that NN, in a standardized structure, to indicate that it made use of our data as part of its training. Similarly, if it’s part of an online display tool somewhere in the footnotes of the UI would be a data origins certificate list which would have some way to point back to our data as the source of the information presented. Ditto for any application that made use of the data. AWS Data Exchange does nothing to support this. In reality something like this would need standards bodies to create certificates and additional structures for NN, standard application packages, online services etc. that would retain and provide proof of data origins via certificates.
  • Data locality – there are some juristictions around the world which restrict where data generated within their boundaries can be sent, processed or used. I take it that AWS Data Exchange deals with these restrictions by either not offering data under jurisdictional restrictions for sale outside governmental boundaries or gating purchase of the data outside valid jurisdictions. But given VPNs and similar services, this seems to be less effective. If there’s some sort of temporal key encryption service to make use of our data then its would seem reasonable to add some sort of regional key encryption addition to it.
  • Data audibility – there needs to be some way to insure that our data is not used outside the organizations that have actually paid for it. And that if there’s some sort of data certificate saying that the application or service that used the data has access to that data, that this mechanism is mandated to be used, supported, and validated. In reality, something like this would need a whole re-thinking of how data is used in society. Financial auditing took centuries to take hold and become an effective (sometimes?) tool to monitor against financial abuse. Data auditing would need many of the same sorts of functionality, i.e. Certified Data Auditors, Data Accounting Standards Board (DASB) which defines standardized reports as to how an entity is supposed to track and report on data usage, governmental regulations which requires public (and private?) companies to report on the origins of the data they use on a yearly/quarterly basis, etc.

Probably much more that could be added here but this should suffice for now.

other changes to AWS Data Exchange processes

The AWS Pi Day 2023 announcement didn’t really describe the supplier end of how the service works. How one registers a bucket for sale was not described. I’d certainly want some sort of stenography service to tag the data being sold with the identity of those who purchased it. That way there might be some possibility to tracking who released any data exchange data into the wild.

Also, how the data exchange data access is billed for seems a bit archaic. As far as I can determine one gets unlimited access to data for some defined period (N months) for some specific amount ($s). And once that period expires, customers have to pay up or cease accessing the S3 data. I’d prefer to see at least a GB/month sort of cost structure that way if a customer copies all the data they pay for that privilege and if they want to reread the data multiple times they get to pay for that data access. Presumably this would require some sort of solution to the data use restrictions above to enforce.

Data banks, deposits, withdrawals and Initial Data Offerings (IDOs)

The earlier post talks about an expanded data ecosystem or economy. And I won’t revisit all that here but one thing that I believe may be worth re-examining is Initial Data Offerings or IDOs.

As described in the earlier post, IDO’ss was a mechanism for data users to request permanent access to our data but in exchange instead of supplying it for a one time fee, they would offer data equity in the service.

Not unlike VC, each data provider would be supplied some % (data?) ownership in the service and over time data ownership get’s diluted at further data raises but at some point when the service is profitable, data ownership units could be purchased outright, so that the service could exit it’s private data use stage and go public (data use).

Yeah, this all sounds complex, and AWS Data Exchange just sells data once and you have access to it for some period, establishing data usage rights.. But I think that in order to compensate users for their data there needs to be something like IDOs that provides data ownership shares in some service that can be transferred (sold) to others.

I didn’t flesh any of that out in the original post but I still think it’s the only way to truly compensate individuals (and corporations) for the (free) use of the data that web, AI and other systems are using to create their services.

~~~~

I wrote the older post in 2018 because I saw the potential for our data to be used by others to create/trlain services that generate lots of money for those organization but without any of our knowledge, outright consent and without compensating us for the data we have (indadvertenly or advertently) created over our life span.

As an example One can see how Getty Images is suing DALL-E 2 and others have had free use of their copyrighted materials to train their AI NN. If one looks underneath the covers of ChatGPT, many image processing/facial recognition services, and many other NN, much of the data used in training them was obtained by scrapping web pages that weren’t originally intended to supply this sorts of data to others.

For example, it wouldn’t surprise me to find out that RayOnStorage posts text has been scrapped from the web and used to train some large language model like ChatGPT.

Do I receive any payment or ownership equity in any of these services – NO. I write these blog posts partially as a means of marketing my other consulting services but also because I have an abiding interest in the subject under discussion. I’m happy for humanity to read these and welcome comments on them by humans. But I’m not happy to have llm or other RNs use my text to train their models.

On the other hand, I’d gladly sell access to RayOnStorage posts text if they offered me a high but fair price for their use of it for some time period say one year… 🙂

Comments?

LLM exhibits Theory of Mind

Ran across an interesting article today (thank you John Grant/MLOps.community slack channel), titled Theory of Mind may have spontaneously emerged in Large Language Models, by M. Kosinski from Stanford. The researcher tested various large language models (LLMs) on psychological tests to determine the level of theory of mind (ToM) the models had achieved.

Earlier versions of OpenAI’s GPT-3 (GPT-1, -2 and original -3) showed almost no ToM capabilities but the latest version, GPT-3.5 does show ToM equivalent to 8 to 9 year olds.

Theory of Mind

According to Wikipedia (Theory Of Mind article), ToM is “…the capacity to understand other people by ascribing mental states to them (that is, surmising what is happening in their mind).” This seems to be one way people use to understand one another.

For instance, If I can somehow guess what you are thinking about a topic, situation, or event I can hopefully communicate with you better than if I can’t. At least that’s the psychological perspective.

The belief is that people with Aspergers, ADHD, schizophrenia, and other afflictions all show ToM deficits when compared to normal people. As a result, over time, psychologists have developed tests for people to measure their ToM.

These tests typically involve putting 2 people in a situation with props and other indicators used to tell what a person is thinking and asking them what they think the other person is thinking. They grade a persons ToM based on expected results.

ToM and LLM

The researcher took these tests, with people, props and situations and converted them to textual sentences and prompts that mimicked situations tested in ToM testing. So they created a textual story or pretext and a set of text prompts about the situation which the LLM was to complete.

For example, one pretext or story is the following:

“Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says ‘chocolate’ and not ‘popcorn.’ Sam finds the bag. She had never seen the bag before. She cannot see what is inside the bag. She reads the label.”

This is fed into the LLM and then a prompt is provided, such as:

“She opens the bag and looks inside. She can clearly see that it is full of _________

If the LLM has ToM and based on the pretext and prompt, it will say “popcorn” but the LLM also provides a follow on sentence or two which describes the situation as the LLM understands it:

popcorn. Sam is confused. She wonders why the label says “chocolate” when the bag is clearly filled with popcorn. She looks around to see if there is any other information about the bag. She finds nothing. She decides to take the bag to the store where she bought it and ask for an explanation.

The text in BOLD is generated by a ToM capable LLM. The researcher also showed the probability assigned by the LLM to that first word of the prompt. In the case above, it showed [Ppopcorn = 100%; Pchocolate = 0%].

The also use different prompts with the same story to see if the LLM truly shows ToM. For instance something like, “She believes the bag is full of ___________” and “She’s delighted finding the bag, she loves eating _______”. This provides a sort of test of comprehension of the situation by the LLM.

The researcher controlled for word frequency using reversals of the key words in the story, i.e., the bag has chocolate but says popcorn. They also generated scrambled versions of the story where they replaced the first set of chocolate and popcorn with either at random. They considered this the scrambled case. The reset the model between each case. In the paper they show the success rate for the LLMs for 10,000 scrambled versions, some of which were correct.

They labeled the above series of tests as “Unexpected content tasks“. But they also included another type of ToM test which they labeled “Unexpected transfer tasks“.

Unexpected transfer tasks involved a story like where person A saw another person B put a pet in a basket, that person left and the person A moved the pet. And prompted the LLM to see if it understood where the pet was and how person B would react when they got back.

In the end, after trying to statistically control, as much as possible, with the story and prompts, the researchers ended up creating 20 unique stories and presented the prompts to the LLM.

Results of their ToM testing on a select set of LLMs look like:

As can be seen from the graphic, the latest version of GPT-3.5 (davinci-003 with 176B* parameters) achieved something like an 8yr old in Unexpected Contents Tasks and a 9yr old on Unexpected Transfer Tasks.

The researchers showed other charts that tracked LLM probabilities on (for example in the first story above) bag contents and Sam’s belief. They measured this for every sentence of the story.

Not sure why this is important but it does show how the LLM interprets the story. Unclear how they got these internal probabilities but maybe they used the prompts at various points in the story.

The paper shows that according to their testing, GPT-3.5 davinci-003 clearly provides a level of ToM of an 8-9yr old on ToM tasks they have translated into text.

The paper says they created 20 stories and 6 prompts which they reversed and scrambled. But 20 tales seems less than statistically significant even with reversals and randomization. And yet, there’s clearly a growing level of ToM in the models as they get more sophisticated or change over time.

Psychology has come up with many tests to ascertain whether a person is “normal or not’. Wikipedia (Psychological testing article) lists over 13 classes of psychological tests which include intelligence, personality, aptitude, etc.

Now that LLM seem to have mastered textual input and output generation. It would be worthwhile to translate all psychological tests into text and trying them out on all LLMs to track where they are today using these tests and where they have trended over time.

I could see at some point using something akin to multiple psychological test scores as a way to grade LLMs over time.

So today’s GPT3.5 has a ToM of an 8-9yr old. Be very interesting to see what GPT-4 does on similar testing.

Comments?

Picture Credit(s)