DeepMind takes on poker & Scotland Yard

Read an article the other day (DeepMind makes bet on AI system that can play poker, chess, Go, and more) about a new DeepMind game playing program that used a new approach to taking on perfect and imperfect information games with the same algorithms.

As you may recall, DeepMind prior game playing programs, AlphaZero and MuZero played perfect information games chess, shoji, & Go and achieved top rankings in all of them. These were all based on reinforcement learning and advanced search. .

Perfect information games have no hidden information, that is all the information needed to play a game is visible to all players (see wikipedia Perfect information article). Imperfect information games have private or hidden information, only visible to one or a select set of players. In card playing, any card that’s not shown to all players, would represent hidden information. The other difference in imperfect games is that players attempt to keep their private information hidden as long as possible.

The latest DeepMind paper (see: Player of Games Arxiv paper) discusses a new approach to automated game playing that works for both perfect and imperfect information games. DeepMind’s latest game playing program is called Player of Games (PoG).

As many may know, Texas hold’em is a form of poker where everyone is dealt two cards down and five cards are dealt up, that everyone shares (see: Texas hold’em and Betting in poker articles on wikipedia). Betting happens after the two down cards are dealt, after the next 3 up cards (called the “flop”) are dealt, then after each of the remaining 2 up cards are dealt. Players select any of the (2 down and 5 up) cards to create the best 5 card poker hand. Betting is based on a blind (sort of minimal bet). PoG plays as a single player, performing all the betting as well as card playing for Texas hold’em. No limit betting says there’s no limit (maximum) to the amount of a bet in the game.

Scotland Yard is a board game where detectives chase down a criminal (Mr. X) on the run across the city of London (see wikipedia Scotland Yard (board game) article). Detectives each get 23 transportation tickets for taxies (11), busses (8), and underground trains (4). The game takes place on a board layout of London and starts with each detective and the criminal selecting a card with their hidden position on the board. The criminal gets (not quite, but almost) an unlimited amount of transportation tickets plus 5 (in USA) universal tickets (which can be used to take ferries as well as any other form of transport). Every time (except when using universal tickets) the criminal moves, he reveals his form of transportation. And 5 times during the game the criminal also reveals his current location. The detective that finds the criminal wins.

I assume all my readers know how to play chess and Go (or at least understand them).

While MuZero and AlphaZero used reinforcement learning for training and sophisticated search for in game play, PoG needed to do something different due to the imperfect (or hidden) information present in the hold’em and Scotland Yard games.

How PoG is different

In imperfect information games, it’s important to hide private information. In poker when I got a great hand, I raised my betting levels extensively. But this often caused my opponents to withdraw from betting unless they had a great hand as well. I sometimes think that if I were to bet more consistently and only at the last betting round, bet big when I have a good hand, I might win more $. No doubt, why I don’t play poker anymore.

Like AlphaZero and MuZero, PoG also uses reinforcement learning through self game play but adds something they call Counterfactual Regret (CFR) Minimization to their game trees.

In addition to normally selecting and computing a value (reward) for the optimal move as in reinforcement learning, PoG uses CFR minimization to compute values (rewards) for all moves not taken during every stage in a game, for each player. As such, PoG computes possible rewards for the optimal move at a stage (step, move) in a game plus the values for all the regret (counterfactual or other) moves for all players. CFR minimization attempts to minimize the regret move values and maximize move optimal values at each move, for each player in a game.

CFR minimization is used during training for a game in self-play as well as during actual game play to generate sub-trees from wherever the game happens to be. PoG uses a depth limited CFR minimization to generate game sub-trees during game play which helps to reduce the time it takes to determine the best move for all players. Read the ArXiv paper to learn more.

The challenge with this approach is that it will never be as good as pure reinforcement learning + advanced search for perfect games, such as chess and Go. For example, below we show Exploitability ratings for various PoG training levels for Leduc Poker and Scotland Yard. Exploitability levels are one way to measure how good the player is playing. Lower is better.

Perfect play (in an imperfect information game) would have an Exploitability of 0. The charts show that the more training done the better the game play by PoG for (Leduc) poker and Scotland Yard. (Leduc poker is a simplified poker game with 6 cards and limited betting).

On the other hand, for perfect games the results were ok, but not stellar. Scockfish is the current non-reinforcement learning, chess playing champion. Gnugo and Pachi are non-reinforcement learning, Go playing programs. In tables below, they use a relative ranking based on a 0 baseline for chess (Stockfish with 1 thread and 100msec think time) and Go (GnuGo). Higher is better.


So, yes PoG can do well in imperfect information games with decent training and ok (but much much better than I and probably the vast majority of humans), in perfect information games.

Why concern ourselves with imperfect games, The world is chock full of imperfect information games. They seem to occur everywhere, military strategy, sport play, finance, etc. In fact, perfect games are the exception in real world situations. Thus, any advance to play multiple imperfect information games better is yet another small step towards AGI.

Photo Credit(s):

BEHAVIOR, an in-home robot, benchmark

As my readers probably already know, I’m a long time benchmark geek. So when I recently read an article out of Stanford (AI Experts Establish the “North Star” for Domestic Robotics Field) where a research team there developed a new robotic benchmark, I was interested. The new robotics benchmark is called BEHAVIOR which was documented in an article (see: BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and ecOlogical enviRonments). It essentially uses real world data to identify domestic work activities that any robot would need to perform in a home.

The problems with robot benchmarks

The problem with benchmarks are multi-faceted:

  • How realistic are the workloads used to evaluate the systems being measured?
  • How accurate are the metrics used to rank and judge benchmark submissions?
  • How costly/complex is it to run a benchmark?
  • How are submissions audited and are they reproducible?.
  • Where are benchmark results reported and are they public?

And of course robotics brings in it’s own issues that makes benchmarking more difficult:

  • What sensors does the robot have to understand how to complete tasks?
  • What manipulators does the robot have to perform the tasks required of it?
  • Do the robots move in the environment and if so, how do the robots move?
  • Does the robot perform the task in the real world on in a simulated environment.

And of course, when using a simulated environment, how realistic is it.

BEHAVIOR with iGibson (see below) seem to answer many of these concerns for an in home robot benchmarking.


First, BEHAVIOR’s home making tasks were selected from an American Time Use Survey maintained by the USA Bureau of Labor Statistics which identifies tasks Americans perform in their homes. With BEHAVIOR 1.0 there are 100 tasks ranging from building a fruit basket to cleaning a toilet, and just about everything in between. I didn’t see any cooking or mixing drinks tasks but maybe those will be added.

Second, BEHAVIOR uses a predicate logic, called BDDL (BEHAVIOR Domain Definition Language) to define initial conditions for tasks such as tables, chairs, books, etc located in the room, where objects need to be placed, and successful completion goals or what task completion should look like.

BEHAVIOR uses 15 different rooms or scenes in their benchmark, such as a kitchen, garage, study, etc. Each of the 100 tasks are performed in a specific room.

BEHAVIOR incorporates 1217 different objects in 391 categories. Once initial conditions are defined for a task, BEHAVIOR essentially randomly selects different object for the task and randomly locates them throughout the room.

In order to run the benchmark, one could conceivably create a real room, with all the objects and have them placed according to BEHAVIOR BDDL’s randomly assigned locations with a robot physically present in the room and have it perform the assigned task OR one could use a simulation engine and have the robot run the task in the simulation environment, with simulated room, objects and robot.

It appears as if BEHAVIOR could operate in any robotics simulation environment but has been currently implemented in Stanford’s open source robotics simulation engine called iGibson 2.0 (see: iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks and iGibson 2.0 website). iGibson uses the Bullet real time physics engine for realistic physical environment simulation.

A robot operating within iGibson is provided a 3D rendering of the room and objects in images or LIDAR sensor scans. It can then identify the objects that it needs to manipulate to perform the tasks. One can define the robot simulated sensors and manipulators in iGibnot 2.0 and it’s written in Python, is open source (GitHub Repo) and can be installed to run on (Ubuntu 16.04) Linux, Windows (10) or Mac (10.15) systems.

Finally, BEHAVIOR uses a set of metrics to determine how well a robot has performed its assigned task. Their first metric is success score defined as the fraction of goal conditions satisfied by the robot performing the task. Such as the number of dishes properly cleaned and placed in the drying rack divided by the total number of dishes for a “washing dishes” task. And their second metric is a set of efficiency metrics, like time to complete a task, sum total of object distance moved during the task, how well objects are arranged at task completion (is the toilet seat down…), etc.

Another feature of iGibson 2.0 is that it offers the ability to record a human (in VR) doing a task in its simulated environment. So if your robotic system is able to learn by example, then iGibson could be used to provide training data for an activity.


A couple of additions to the BEHAVIOR benchmark/iGibson simulation environment that I would like to see:

  • There ought to be a way to construct a house/apartment where multiple rooms are arranged in a hierarchy, i.e., rooms associated with floors with connections using hallways, doors, stairs, etc. between them. This way one could conceivably have a define a set of homes/apartments (let’s say 5) that a robot would perform its tasks in.
  • They need a task list to drive robot activities. Assume that there’s some amount of time let’s say 8-12 hours that a robot is active and construct a series of tasks that need to be accomplished during that period.
  • Robots should be placed in the rooms/apartments/homes at random with random orientation and then they would have to navigate through rooms/passageways to the rooms to perform the tasks.
  • They need to add pet/human avatars in the rooms throughout a home. These would represent real time obstacles to task completion/navigation as well as add more tasks associated with caring for pets/humans.
  • They need the ability to add non-home rooms that could encompass factory floors, emergency response debris fields, grocery stores, etc. and their own unique set of tasks for each of these so that it could be used as a benchmark for more than just domestic robots.

Aside from the above additions to BEHAVIOR/iGibson 2.0, there’s the question of the organization that manages the benchmark and submissions. There needs to be a website/place to publish benchmark results for a robot AND a mechanism to audit results for accuracy to insure fair play.

Typically this would be associated with an organization responsible for publishing and auditing submissions as well as guide further development of BEHAVIOR/iGibson 2.0. BEHAVIOR 1.0 is not the end but it’s a great start at providing realistic tasks that any domestic robot would need to perform. 

Benchmarks have always aided the development and assessment of new technologies. Having a in home robot benchmark like BEHAVIOR makes getting domestic robots that do what we want them to do a more likely possibility someday.

There’s a new benchmark in town and it signals the dawning of the domestic robot age.

Photo Credit(s):

Dell EMC PowerStore X and the Edge – TFDxDell

This past summer I attended a virtual TFDxDell event where there was a number of sessions discussing Dell EMC technologies for the enterprise. One session sort of struck a nerve, the Dell EMC PowerStore session and I have finally figured out what interested me most in their talk, their PowerStore X appliances and AppsON technologies

What is AppsON and PowerStore X appliance?

Essentially PowerStore X with AppsON has an onboard ESXi hypervisor which allows customers to run vSphere VMs inside the storage system with direct vVol (I assume) access to PowerStore data storage without having to go out over a (storage) network.

PowerStore X ESXi is a little behind the most recent VMware vSphere releases (at least 30 days) but it’s current enough for most shops. In non-PowerStore X appliances, PowerStoreOS runs as containers but in PowerStore X, PowerStoreOS storage functionality runs as VMs, just like any other VMs running on its ESXi hypervisor.

Moreover, PowerStore X can still service IOs from other non-PowerStore X resident VMs or bare metal applications running in the environment. In this way you get all the data services of an enterprise class storage system, that also run VMs.

With PowerStore OS 2.0 they have added scale out to AppsON. That is any PowerStore X (1000X, 3000X, 5000X or 7000X) appliance, in a PowerStore X cluster, can have their VMs move from one appliance to another using vSphere vMotion. This means that as your PowerStore X storage clusters grow, you can rebalance VM application workloads across the cluster. A PowerStore X cluster can contain up to 4 PowerStore X appliances.

PowerStore’s heritage goes back quite a ways at Dell and EMC. Prior versions of EMC Unity storage and some of its progenitors had the ability to run applications on the storage itself. But by running an ESXi hypervisor on PowerStore X appliances, it takes all this to a whole new level.

Why would anyone want AppsON?

It’s taken me sometime to understand why anyone would want to use AppsON and I have concluded that the edge might be the best environment to deploy it.

Recent VMware enhancements have reduced minimum node configurations for edge environments to 2 servers. It’s unclear to me whether a single PowerStore X appliance with AppsON is one server or two but, for the moment lets assume its just one. This means that a minimum VMware vSphere edge deployment could use 1 PowerStore X and 1 standalone, ESXi server.

In such an environment, customers could run their data intensive VMs directly on the PowerStore X and some of their non-data intensive VMs on the standalone server. But the flexibility exists to vMotion VMs from one to the other as demand dictates.

But does the edge need storage?

Yes, some do. For instance, take 5G. it enables a whole new class of mobile services and many of them can be quite data intensive. 5G is being deployed around the world as mini-data centers in cell towers. Unclear whether these data centers run vSphere but I’m sure VMware is trying their hardest to make that happen. With vSphere running your 5G mini-datacenter, PowerStore X could make a smart addition.

Then there’s all the smart cars, which are creating TBs of sensor data every time they take to the road. You’re probably not going to have a PowerStore appliance in your smart car (at least anytime soon) but they just might have one at the local service station.

And maybe given all the smart devices in your home, smart cars, smart appliances, smart robots, etc., there’s going to be a whole lot of data generated from your smart home. Having something like PowerStore X in your smart home’s mini-data center would offer a place to hold all that data and to do some processing (compressing maybe) before sending it up to the cloud.


We have just two more questions for Dell EMC,

  1. Shouldn’t the base PowerStore appliance be called PowerStore K?
  2. Shouldn’t customers be allowed to run their own K8s container apps on their PowerStore K just as easily as running VMs in their PowerStore X?

Legal Disclosure: TechFieldDay and Dell provided gifts to all participants (including me) for the TFDxDell event.

Photo credit(s):

  • From Dell EMC slides presented at TFDxDell event
  • From Dell EMC slides presented at TFDxDell event
  • From Dell EMC slides presented at TFDxDell event

Facebook’s (Meta) Kangaroo, a better cache for billions of small objects

Read an article this week in Blocks and Files, Facebook’s Kangaroo jumps over flash limitations which spiked my interest and I went and searched for more info on this and found a fb blog post, Kangaroo: A new flash cache optimized for tiny objects which sent me to an ACM SOSP (Syposium on O/S Principles) best paper of 2021, Caching billions of tiny objects on flash.

First, as you may recall flash has inherent limitations when it comes to writing. The more writes to a flash device the more NAND cells start to fail over time. Flash devices are only rated for some amount (of standard, ~4KB) block writes, For example, the Micron 5300 Max SSD only supports 3-5 (4KB blocks) DWPD (drive writes per day). So, a 2TB Micron Max 5300 SSD can only sustain from ~1.5 to 2.4B 4KB block writes per day. Now that seems more than sufficient for most work but when somebody like fb, using the SSD as a object cache, writes a few billion or more 100B(yte) objects and does this day in or day out, can consume an SSD in no time. Especially if they are writing one 100 B object per block

So there’s got to be a better way to cache small objects into bigger blocks. Their paper talks of two prior approaches:

  • Log structured storage – here multiple 100B objects are stored in a single a 4KB block and iwritten out with one IO rather than 40. This works fairly well but the index ,which maps an object key, to a log location, takes up a lot of memory space. If your caching ~3B 100B objects in a logs and each object index takes 16 bytes that’s a data space of 48GB.
  • Associative set storage – here each object is hashed into a set of (one or more) storage blocks and is stored there. In this case there’s no DRAM index but you do need a quick way to determine if an object is in the set storage or not. This can be done with bloom filters (see: wikipedia article on bloom filters). So if each associative set stores 400 objects and one needs to store 3B objects one needs a 30 MB of bloom filters (assuming 4bytes each). The only problem with associative sets is that when one adds an element to a set. the set has to be rewritten. So if over time you add 400 objects to a set you are writing that set 400 times. All of which eats into the DWPD budget for the flash storage.

In Kangaroo, fb engineers have combined the best of both of these together and added a small DRAM cache.

How does it work?

Their 1st tier is a DRAM cache, which is ~1% of the capacity of the whole object cache. Objects are inserted into the DRAM cache first and are evicted in a least recently used fashion, that is object’s that have not been used in the longest time are moved out of this cache and are written to the next layer (not quite but get to that in a moment).

Their 2nd tier is a log structured system, at ~5% of cache capacity. They call this a KLog and it consists of a ring of 4KB blocks on SSD, with a DRAM index telling where each object is located on the ring.. Objects come in and are buffered together into a 4KB block and are written to the next empty slot in the ring with its DRAM index updated accordingly. Objects are evicted from Klog in such a way that a group of them, that would be located in the same associative set and are LRU, can all be evicted at the same time. They have structured the Klog DRAM index so that it makes finding all these objects easy. Also any log structured system needs to deal with garbage collection, Let’s say you evict 5 objects in a 4K block, that leaves 35 that are still good. Garbage collection will read a number of these partially full blocks and mash all the good objects together leaving free space for new objects that need to be cached.

The 3rd and final tier is a set associative store, they call the Kset that uses bloom filters to show object presence. For this tier, an object’s key is hashed to find a block to put it in, the block is read and the object inserted and the block rewritten. Objects are evicted out of the set associative store based on LRU within a block. The bloom filters are used to determine if the object exists in an set associative block.

There are a few items missing from the above description. As can be seen in Figure 3B above, Kangaroo can jettison objects that are LRUed out of DRAM instead of adding them to the Klog. The paper suggests this can be done purely at random, say only admit, into the Klog, 95% of the objects at random being LRUed from DRAM. The jettison threshold for Klog to Kset is different. Here they will jettison single object sets. That is if there were only one object that would be evicted and written to a set, it’s jettisoned rather than saved in the Kset. The engineers call this a Kset threshold of 2 (indicating minimum number of objects in a single set that can be moved to Kset)..

While understanding an objects LRU is fairly easy if you have a DRAM index element for each block, it’s much harder when there’s no individual object index available, as in Kset.

To deal with tracking LRU in the Kset, fb engineers created a RRIParoo index with a DRAM index portion and a flash resident index portion.

  • RRIParoo’s DRAM index is effectively a 40 byte bit map which contains one bit per object, corresponding to its location in the block. A bit on in this DRAM bitmap indicates that the corresponding object has been referenced since the last time the flash resident index has been re-written. .
  • RRIParoo’s flash resident index contains 3 bit integers, each one corresponding to an object in the block. This integer represents how many clock ticks, it has been since the corresponding object has been referenced. When the need arises to add an object to a full block, the object clock counters in that block’s RRIP flash index are all incremented until one has gotten to the oldest time frame b’111′ or 7. It is this object that is evicted.

New objects are given an arbitrary clock tick count say b’001′ or 1 (as shown in Fig. 6, in the paper they use b’110′ or 6), which is not too high to be evicted right away but not too low to be considered highly referenced.

How well does Kangaroo perform

According to the paper using the same flash storage and DRAM, it can reduce cache miss ratio by 29% over set associative or log structured cache’s alone. They tested this by using simulations of real world activity on their fb social network trees.

The engineers did some sensitivity testing using various Kangaroo algorithm parameters to see how sensitive read miss rates were to Klog admission percentage, RRIParoo flash index element (clock tick counter) size, Klog capacity and Kset admission threshold.

Kangaroo performance read miss rate sensitivity to various algorithm parameters

Applications of the technology

Obviously this is great for Twitter and facebook/meta as both of these deal with vast volumes of small data objects. But databases, Kafka data streams, IoT data, etc all deal with small blocks of data and can benefit from better caching that Kangaroo offers.

Storage could also use something similar only in this case, a) the objects aren’t small and b) the cache is all in memory. DRAM indexes for storage caching, especially when we have TBs of DRAM cache, can be still be significant, especially if an index element is kept for each block in cache. So the technique could also be deployed for large storage caches as well.

Then again, similar techniques could be used to provide caching for multiple tiers of storage. Say DRAM cache, SSD Log cache and SSD associative set cache for data blocks with the blocks actually stored on large disks or QLC/PLC SSDs.

Photo credit(s):

NASA’s journey to the cloud – part 1

Read an article the other day, NASA Turns to the Cloud for Help With Next-Generation Earth Missions about how NASA was had started to migrate all their data to the cloud and intended to store all new data there as well. The hope is that researchers would no longer need to download NASA data but rather could access it directly using cloud compute resources.

It turns out that newer earth science satellites are generating so much data that hosting all this data is becoming a challenge and with the quantities being discussed, researchers downloading the data, to perform research in their own environments may take days.

Until recently, earth science data has been hosted and downloadable from NASA, ESA and other space organization sites. For example, see NASA’s GHCR DAAC (Global Hydrometerological Resource Center Distributed Active Archive Center), ESA EarthOnline, JAXA GPM website, etc. Generally one could download a time series of data from any of their prior and current earth/planetary science missions without too much trouble.

The Land Processes Distributed Active Archive Center (LP DAAC) archives and distributes Global Forest Cover Change (GFCC) data products through the NASA Making Earth System Data Records for Use in Research Environments (MEaSUREs) ( Program….

But NASA’s newest earth science satellites will be generating lot’s of data. For instance, the SWOT (Surface Water and Ocean Topography) mission data load will be 20TB/day and the NISAR (NASA-Indian Synthetic Aperture Radar) mission data load will be 80TB/day. And it’s only getting worse as more missions with newer instruments come online.

NASA estimates that, over time, they will store 247PB of data in their EarthData Cloud. At the moment, they have already migrated some (all of ASF [Alaska Satellite Facility] DAAC and some of PO.DAAC [Physical Ocean]) of their Earth Science data to AWS (us-west-2) and over time all of it will migrate there.

NASA will eat any egress charges for EOSDIS data and are also paying any and all hosting fees to storage the data in AWS. Unclear whether they are using standard S3 or S3-Intelligent Tiering. And presumably they are using S3 replication to ensure they don’t lose DAAC data in the cloud, but I don’t see any evidence of that in the literature I’ve read. Of course this doubles the storage costs for their 247PB of DAAC data.

Access to all this data is available to anyone with an EarthData login. There you can register for a profile to access NASA earth sciences data.

NASA’s EarthData also offers a number of AWS cloud based services to help one access this data:

  • EarthData search – filtered search facility to access NASA EarthData by platform (e.g. satellite), instrument (e.g. camera/visual data), organization (e.g. NASA/JPL), etc.
  • EarthData Common Metadata Repository – API driven metadata repository that ” catalogs all data and service metadata records for NASA’s EOSDIS (Earth Observing System Data and Information System) system” data, that can be accessed by anyone, which includes programatic access to EarthData search.
  • EarthData Harmony – which is a EarthData Jupyter notebook examples and API documentation to perform research on earth science data in the EarthData cloud.

One reason to movie EOSDIS DAAC data to the cloud is to allow researchers to not have to download data to run their analysis. By using in cloud EC2 compute instances, they can run their research in AWS with direct , high speed access to the EarthData.

Of course, the researcher would need to purchase their EC2 compute facility directly from AWS. w. NASA publishes a sort of AWS pricing primer for researchers to use AWS EC2 compute to do research directly on the data in the cloud. Also NASA offers a series of tutorials on how to use the AWS cloud for doing research on NASA DAAC data.

Where to from here?

I find this all somewhat discouraging. Yes it’s the Gov’t but one needs to wonder what the overall costs of hosting NASA DAAC data on the AWS cloud will be over the long haul. Most organizations use the cloud to prototype and scale up services but once these services have stabilized, theymigrate them back to onprem/CoLoinfrastructure. See for example, Dropbox’s move away from the [AWS] cloud for ~600PB of data.

I get it, the public cloud allows for nearly infinite data scaleability. But cloud storage costs is not cheap, especially when you are talking about 100s of PBs. And in today’s world, with a whole bunch of open source solutions for object storage and services, one can almost recreate any cloud service in your own data center, at much lower price.

Sure it will still take IT infrastructure and personnel to put it all together. But NASA doesn’t seem to be lacking in infrastructure or IT personnel. Even if you are enamored with AWS services and software infrastructure, one can always run AWS Outpost in your data centers. And DAAC services seem to be pretty stable over time. Yes new satellites will generate more data, but the data load is understood and very predictable. So one should be able to anticipate all this and have infrastructure in place to deal with it.

Yes, having the ability to run analysis in the cloud directly on the data sitting also in the cloud is useful, especially not having to download TB of data. But these costs can also be significant and they are born by the researcher not NASA.

Another grip is why use AWS alone. The other cloud providers all have similar object storage and compute capabilities. It seems wiser to me to set up the EarthData service such that, different DAACs reside in different clouds. This would he more complex and harder to administer and use but I believe in the long run would lead to better more effective services at a more reasonable price.

Going to the cloud doesn’t have to be a one way endeavor. After using the cloud for a while, NASA should have a better idea of the costs of doing so and at that time understand better what it can and cannot afford to do on its own.

It will be interesting to see what ESA, JAXA, CERN and other big science organizations do as they are all in the same bind, data seems to be growing unbounded.

Picture Credit(s):

The problem with smarter robots

Read an article the other week about how Deepmind (at Google) is approaching the training of robotics using simulation, reinforcement learning, elastic weights, knowledge distillation and progressive learning.

It seems relatively easy to train a robot to handle some task like grabbing or walking. But doing so can take an awfully long time. If you want to try to train a robot to grab something and put it someplace. You can have it start out making some random movements of its arm, wrist and fingers (if they have such things) and then use reinforcement learning to help it improve its movements over time.

But if each grab attempt takes 10 seconds, using reinforcement learning may take 10,000 attempts before it starts to make any significant progress and perhaps another 20,000-50,000 more to get expert at it. Let’s see 60K *10 seconds is 10,000 minutes or ~170 hours. And that’s just one object pick and place. But then maybe you would like to grab different parts and maybe place them in different locations. All these combinations start adding up.

And of course doing 1000s of movements will wear out gears, motors, mechanisms etc. If only this could all be done in electronic simulations. Then assuming the simulations are accurate enough the whole thing could be done in a matter of hours without wearing anything out. Enter robot simulators such as NVIDIA Isaac Sim, OpenAI RoboSchool/PyBullet

But the problems with simulation are …

Simulations are getting more accurate but at some point their accuracy defeats its purpose because the real world is always noisy, windy and not as deterministic as any simulation. One researcher said you could conceivable have a two armed robot be trained to throw all of a cell phones components up into the air and they will all land in their proper places, proper orientations. But in the real world this could never actually happen, or if it did, it could only happen once.

Hurricane Ike - 2008/09/12 - 21:26 UTC by CoreBurn (cc) (from Flickr)
Hurricane Ike – 2008/09/12 – 21:26 UTC by CoreBurn (cc) (from Flickr)

Weather researchers have been dealing with this problem in spades for a long time. There appears to be a fundamental limit to how far in advance we can predict weather and it’s due to the accuracy with which sensors operate and the complexity of feedback loops between the atmosphere, oceans, landforms, etc. So at a fundamental level, simulations can never be completely accurate. But they can be better.

Today’s weather simulations we see on TV/radio use models that average a number of distinct simulations, where sensor information has been slightly and randomly modified. Something similar could be done for robotic simulation environments, to make them more realistic.

But there are other problems with training robots to do lots of tasks.

Forget me not…

AI deep learning and reinforcement learning algorithms are great when charged with learning a single task, but having it learn multiple tasks is much harder to do. Because each task requires its own deep neural network (DNN) and if you train a DNN on one task and then try to train in on a another task, it forgets all the learnings from the original task. Researchers call this catastrophic forgetting.

One way researchers have dealt with this problem is to effectively freeze certain DNN nodes from having their weights changed during subsequent training rounds and leave others flexible or changeable. One can see this when one trains an image recognition DNN to classify different objects by importing a well trained object classifier and freezing all of it’s layers except the top one or two and then training these layers to classify new objects.

This works well but you have effectively changed the DNN to forget the original object classification training and replaced it with a new one. One solution to this approach is to have multiple passes of training, after each one, certain nodes and connections (of importance to that particular task) are selectively frozen. This works well for a limited number of different tasks but over time all nodes become frozen which means that no more learning can take place. Researchers call this approach to the catastrophic forgetting problem elastic weights.

One way to get around the all nodes frozen issue in elastic weights is to have multiple NNs. One which is trained on a specific task and whose weights are frozen and then a DNN that exists alongside this one with it’s own initialized set of weights. But which uses the original DNN as part of the new DNN inputs. This effectively includes and incorporates all the previously learned knowledge into the new, combined DNN. This is called Progressive Neural Networks.

In this fashion one progressive DNN can be sequentially trained on any number of tasks each of which ends up providing input to all subsequent task training activity. Such a progressive network never forgets and can use previously learned knowledge on new tasks.

The problem with progressive DNNs is a proliferation of DNN column. one for each trained task. However there are a couple of approaches to shrinking an ensemble of DNN like progressive training creates into one that is simpler and just as effective. One way is to perturb weights in DNN nodes and see how model prediction accuracy is impacted on all its tasks. If accuracy isn’t impacted that much, then that node and all its connections could be deleted from the model with minimal impact on model accuracy.

Another approach is to use one DNN to train another. Sort of like a teacher-student. This is called Knowledge Distilation. Where one DNN is a large network (the teacher) and a smaller (student) network that is trained to mimic the teacher DNN to achieve similar accuracy. This is done by training the smaller student network to match the predictions/classifications of the larger one.

Google researchers have shown that knowledge distillation works best when the gap in the sizes of the two networks (teacher and student) aren’t that large. They have solved this problem by introducing an intermediate step (called teachers assistent). They train this TA first then use the TA to train the student.

In the above graphic, when using a teacher of size 110 and a student of size 8 the resulting accuracy suffers but if one uses an intermediate DNN, with a size 20 the resultant accuracy of the student is much closer to the teacher..


So with realistic simulation we can train a robot to do any specific task, all using only compute resources. And using progressive DNN training, a robot could conceivably be trained to do any number of tasks. And with appropriate knowledge distillation one can reduce the DNN from progressive training into something much smaller (<10%) than the original DNN.

Want a personal robot that can clean up around your place, do the wash, cook your food and do anything else needed. You know what to do.

For AGI, is reward enough – part 4

Last May, an article came out of DeepMind research titled Reward is enough. It was published in an artificial intelligence journal but PDFs of it are available free of charge.

The article points out that according to DeepMind researchers, using reinforcement learning and an appropriate reward signal is sufficient to attain AGI (artificial general intelligence). We have written about the perils and pitfalls of AGI before (see Existential event risks [-part-0]NVIDIA Triton GMI, a step to far[-part-1]The Myth of AGI [-part-2], and Towards a better AGI – part 3ish. (Sorry, I only started numbering them after part 3ish).

My last post on AGI inclined towards the belief that AGI was not possible without combining deduction, induction and abduction (probabilistic reasoning) together and that any such AGI was a distant dream at best.

Then I read the Reward is Enough article and it implied that they saw a realistic roadmap towards achieving AGI based solely on reward signals and Reinforcement Learning (wikipedia article on Reinforcement Learning ). To read the article was disheartening at best. After the article came out, I made it a hobby to understand everything I could about Reinforcement Learning to understand whether what they are talking is feasible or not.

Reinforcement learning, explained

Let’s just say that the text book, Reinforcement Learning, is not the easiest read I’ve seen. But I gave it a shot and although I’m no where near finished, (lost somewhere in chapter 4), I’ve come away with a better appreciation of reinforcement learning.

The premise of Reinforcement Learning, as I understand it, is to construct a program that performs a sequence of steps based on state or environment the program is working on, records that sequence and tags or values that sequence with a reward signal (i.e., +1 for good job, -1 for bad, etc.). Depending on whether the steps are finite, i.,e, always ends or infinite, never ends, the reward tagging could be cumulative (finite steps) or discounted (infinite steps).

The record of the program’s sequence of steps would include the state or the environment and the next step that was taken. Doing this until the program completes the task or if, infinite, whenever the discounted reward signal is minuscule enough to not matter anymore.

Once you have a log or record of the state, the step taken in that state and the reward for that step you have a policy used to take better steps. Over time, with sufficient state-step-reward sequences, one can build a policy that would work’s very well for the problem at hand.

Reinforcement learning, a chess playing example

Let’s say you want to create a chess playing program using reinforcement learning. If a sequence of moves ends the game, you can tag each move in that sequence with a reward (say +1 for wins, 0 for draws and -1 for losing), perhaps discounted by the number of moves it took to win. The “sequence of steps” would include the game board and the move chosen by the program for that board position.

Figure 2: Comparison with specialized programs. (A) Tournament evaluation of AlphaZero in chess, shogi, and Go in matches against respectively Stockfish, Elmo, and the previously published version of AlphaGo Zero (AG0) that was trained for 3 days. In the top bar, AlphaZero plays white; in the bottom bar AlphaZero plays black. Each bar shows the results from AlphaZero’s perspective: win (‘W’, green), draw (‘D’, grey), loss (‘L’, red). (B) Scalability of AlphaZero with thinking time, compared to Stockfish and Elmo. Stockfish and Elmo always receive full time (3 hours per game plus 15 seconds per move), time for AlphaZero is scaled down as indicated. (C) Extra evaluations of AlphaZero in chess against the most recent version of Stockfish at the time of writing, and against Stockfish with a strong opening book. Extra evaluations of AlphaZero in shogi were carried out against another strong shogi program Aperyqhapaq at full time controls and against Elmo under 2017 CSA world championship time controls (10 minutes per game plus 10 seconds per move). (D) Average result of chess matches starting from different opening positions: either common human positions, or the 2016 TCEC world championship opening positions . Average result of shogi matches starting from common human positions . CSA world
championship games start from the initial board position.

If your policy incorporates enough winning chess move sequences and the program encounters one of these in a game and if move recorded won, select that move, if lost, select another valid move at random. If the program runs across a board position its never seen before, choose a valid move at random.

Do this enough times and you can build a winning white playing chess policy. Doing something similar for black playing program would build a winning black playing chess policy.

The researchers at DeepMind explain their AlphaZero program which plays chess, shogi, and Go in another research article, A general reinforcement learning algorithm that masters chess, shogi and Go through self-play.

Reinforcement learning and AGI

So now what does all that have to do with creating AGI. The premise of the paper is that by using rewards and reinforcement learning, one could program a policy for any domain that one encounters in the world.

For example, using the above chart, if we were to construct reinforcement learning programs that mimicked perception (object classification/detection) abilities, memory ((image/verbal/emotional/?) abilities, motor control abilities, etc. Each subsystem could be trained to solve the arena needed. And over time, if we built up enough of these subsystems one could somehow construct an AGI system of subsystems, that would match human levels of intelligence.

The paper’s main hypothesis is “(Reward is enough) Intelligence, and its associated abilities, can be understood as subserving the maximization of reward by an agent acting in its environment.”

Given where I am today, I agree with the hypothesis. But the crux of the problem is in the details. Yes, for a game of multiple players and where a reward signal of some type can be computed, a reinforcement learning program can be crafted that plays better than any human but this is only because one can create programs that can play that game, one can create programs that understand whether the game is won or lost and use all this to improve the game playing policy over time and game iterations.

Does rewards and reinforcement learning provide a roadmap to AGI

To use reinforcement learning to achieve AGI implies that

  • One can identify all the arenas required for (human) intelligence
  • One can compute a proper reward signal for each arena involved in (human) intelligence,
  • One can programmatically compute appropriate steps to take to solve that arena’s activity,
  • One can save a sequence of state-steps taken to solve that arena’s problem, and
  • One can run sequences of steps enough times to produce a good policy for that arena.

There are a number of potential difficulties in the above. For instance, what’s the state the program operates in.

For a human, which has 500K(?) pressure, pain, cold, & heat sensors throughout the exterior and interior of the body, two eyes, ears, & nostrils, one tongue, two balance sensors, tired, anxious, hunger, sadness, happiness, and pleasure signals, and 600 muscles actuating the position of five fingers/hand, toes/foot, two eyes ears, feet, legs, hands, and arms, one head and torso. Such a “body state, becomes quite complex. Any state that records all this would be quite large. Ok it’s just data, just throw more storage at the problem – my kind of problem.

The compute power to create good policies for each subsystem would also be substantial and in the end determining the correct reward signal would be non-trivial for each and every subsystem. Yet, all it takes is money, time and effort and all this could be accomplished.

So, yes, given all the above creating an AGI, that matches human levels of intelligence, using reinforcement learning techniques and rewards is certainly possible. But given all the state information, action possibilities and reward signals inherent in a human interacting in the world today, any human level AGI, would seem unfeasible in the next year or so.

One item of interest, recent DeepMind researchers have create MuZero which learns how to play Go, Chess, Shogi and Atari games without any pre-programmed knowledge of the games (that is how to play the game, how to determine if the game is won or lost, etc.). It managed to come up with it’s own internal reward signal for each game and determined what the proper moves were for each game. This seemed to combine a deep learning neural network together with reinforcement learning techniques to craft a rewards signal and valid move policies.

Alternatives to full AGI

But who says you need AGI, for something that might be a useful to us. Let’s say you just want to construct an intelligent oracle that understood all human generated knowledge and science and could answer any question posed to it. With the only response capabilities being audio, video, images and text.

Even an intelligent oracle such as the above would need an extremely large state. Such a state would include all human and machine generated information at some point in time. And any reward signal needed to generate a good oracle policy would need to be very sophisticated, it would need to determine whether the oracle’s answer; was good or not. And of course the steps to take to answer a query are uncountable, 1st there’s understanding the query, next searching out and examining every piece of information in the state space for relevance, and finally using all that information to answer to the question.

I’m probably missing a few steps in the above, and it almost makes creating a human level AGI seem easier.

Perhaps the MuZero techniques might have an answer to some or all of the above.


Yes, reinforcement learning is a valid roadmap to achieving AGI, but can it be done today – no. Tomorrow, perhaps.

Photo credit(s):

Math’s war on Gerrymandering

Read an article the other day in MIT Technology Review, Mathematicians are deploying algorithms to stop gerrymandering, that discussed how a bunch of mathematicians had created tools (Python and R applications, links below) which can be used to test whether state redistricting maps are fair or not.

The best introduction to what these applications can do is in a 2021 I. E. Block Community Lecture: Jonathan Christopher Mattingly captured on YouTube video.

For those outside the states, US Congress House of Representatives are elected via state districts. The term gerrymandering was coined in 1812 over a district map created for Massachusetts that took the form of a dragon which all but assured that the district would go Democrat-Republican vs. Federalist in the next election, (source: Wikipedia entry on Gerrymandering). This re-districting process is done every 10 years in every stateafter the US Census Bureau releases their decennial census results which determines the number of representatives that each state elects to the US House of Representatives.

Funny thing about gerrymandering, both the Democrats and the Republicans have done this in the past and will most likely do so in the future. Once district maps are approved they typically stay that way until the next census.

Essentially, what the mathematicians have done is create a way togenerate a vast number of district maps, under a specific series of constraints/guidelines and then can use these series of maps to characterize the democrat-republican split of a trial election based on some recent election result.

One can see in the above show the # of Democrats that would have been elected using the data from the 2012 and 2016 election results under four distinct maps vs a histogram representing the distribution of all the maps the mathematician’s systems created. The four specific maps indicated on the histograms are.

  • NC2012- thrown out by the state courts as being unfair
  • NC2016– one that NC legislature came out with also thrown out as being unfair as it’s equivalent to the NC2012 map
  • Bipartisan Judges – one that a group of independent judges came out with, and
  • Remedial=NC2020 – one submitted by the mathematicians which they deemed “fairer”
These North Carolina congressional district maps illustrate how geometry is not a fail-safe indicator of gerrymandering. The NC 2012 map, with its bizarre district boundaries, was deemed by the courts to be a racial gerrymander. The replacement, the NC 2016 map, looks quite different and tame by comparison, but was deemed to be an unconstitutional political gerrymander. Analysis by Duke’s Jonathan Mattingly and his team showed that the 2012 and 2016 maps were politically equivalent in their partisan outcomes. A court-appointed expert drew the NC 2020 map.

The algorithms (in Python, GitHub repo for GerryChain and R, GitHub repo for redist) take as input an election map which is a combined US census blocks (physical groupings of population defined by US Census Bureau) and some prior election results (that provide the democrat-republican votes for a particular election within those census blocks). And using this data and a list of districting constraints, such as, compactness requirements, minimal breaks of counties (state political units), population equivalence, etc. and using these inputs, the apps generate a multitude or ensemble of district maps for a state. FYI, districting constraints differ from state to state.

Once they have this ensemble of district maps that adhere to the states specified districting constraints one can compare any sample districting map to the histogram and see if a specific map is fair or not. “Fairness” means that it would result in the same Democrat-Republican split that occurred from the highest number of ensemble maps.

The latest process by which districting maps can be created is documented in a research article, Recombination: A Family of Markov Chains for Redistricting. But most of the prior generations all seem to use a tree structure together with a markov chain approach. At the leaves of the tree are the census blocks and the tree hierarchy algorithmically represents the different districting hierarchies.

Presumably, a Markov Chain encodes a method to represent the state’s districting constraints. And what the algorithm does is traverse the tree of census blocks, using the markov chain and randomness, to create district maps by splitting a branch of the tree (=district) off somewhere in the hierarchy, above the census blocks.

Doing this randomly, over a number of iterations, provides a group or ensemble of proper districting maps that can be used to build the histogram for a specific election result. (Suggest reading the above research report for more information on how this works).

One can see the effect of different election results on the distribution of election results that would have occurred with the current ensemble of maps. For example, USH12 uses the census block voting results from the North Carolina, US (Congressional) House election of 2012 and the GOV16 uses the results from the North Carolina Governor election of 2016

It’s somewhat surprising that the US Supreme Court has ruled that districting is a states issue and not subject to constitutional oversight. Not sure I agree but I’m no constitutional scholar/lawyer. So, all of the legal disputes surrounding state’s re-districting maps have been accomplished in state courts.

But what the mathematicians have done is provide the tools needed to create a multitude of districting maps and when one uses prior election results at the census block level, one can see whether any new re-districting map is representative of what one would see if one drew 100s or 1000s of proper redistricting maps.

Let’s hope this all leads to fairer state and federal elections in the future.


Photo Credits: