The Singularity May be Closer than It Appears

BrianA · March 27, 2023

I hear what you're saying Guest, but I'm still not convinced that AI-based simulation methods can't progress significantly further, even if we lack all the myriad unknown low-level details. Why? Are you familiar with the concept of "The Bitter Lesson":

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Prior to the invention of deep learning AI techniques, in particular the Transformers architecture in 2017 (from Google ironically, who in a classic Innovator's Dilemma seems to be fumbling badly on making use of their invention), many AI scientists believed in order to ever get real intelligence from a machine we needed to understand all the myriad low-level details of how minds work, painstakingly spend their careers doing "feature engineering" to try and cobble together in human-designed brittle pieces of software to do one tiny task of the brain. Many many years of work from scientists went into trying to hand-design systems for intelligence.

Then Transformers and a bunch of FLOPs and data come along and you just essentially mix it all together and voila you get intelligence. Brute force essentially wins, rather than clever low level perfectly specified simulations. A similar thing has already been accomplished in the bio world with AlphaFold 2. No human had to give it all the low level details on how every single protein should fold. Yes, it had plenty of example data to train on, but from that it was able to eventually generalize using some internal algorithm it learned that gives it the ability to then predict how novel amino acid sequences will fold.

So why not extend this into training an AI to simulate how an entire cell would behave when given novel substances we want to test on a virtual cell? With enough data and training compute, I believe the AI could similarly generalize some internal model of cells. I would not be surprised if people out there are experimenting with this idea already. The main question I would guess is do we have enough training data for it to generalize from, I'm not clear on that. You'd essentially feed the AI some kind of tokenized inputs representing molecular substances being input into a cell, and then train it to minimize its loss on the predicted behavior of cells afterward. I don't know if we currently have enough of that kind of data, perhaps we do - again, it doesn't have to be anywhere near covering all potential inputs or behaviors of cells, just enough to let the AI generalize.

Edited March 27, 2023 by BrianA

Guest · March 27, 2023

With all due respect:

you would need to make a good argument, that knowledge of all the 10s of thousands of different possible interactions within the cells and between cells - the vast majority of which are not characterized or even known - are not needed to make a good prediction about effects and side effects of a therapy.

How is that the equivalent of LLM producing intelligent seeming results, where some (not all) researchers where expecting that much more structured models would have been needed?

What do you mean with, quote: "training an AI to simulate how an entire cell would behave when given novel substances"? What does this even mean, if there are more unknowns about the internal workings of a cell than known interactions - anyone potentially able to interact with your therapy (or indirectly interact with upstream or downstream factors involved in your therapies).

There is an entire field of post-translational modifications, that was just recently understood to have a major role - O-GlcNAcylation - i.e. activating and de-activating proteins similar to phosphorylisation. That's not some minute detail that you can glance over - instead it can fundamentally alter any effects and side effects of your molecule. The fact that we are still discovering new major players in the inner workings of cells doesn't make it sound very plausible to just "simulate your way out of details".

Dean Pomerleau · March 27, 2023

I'm with @Guest. Simulate this!?

http://biochemical-pathways.com/#/map/1

BrianA · March 28, 2023

I'm just spitballin' here, remember I'm not a bio expert at all.

That being said, all I'm doing is trying to point out what might possibly be some similarities between AlphaFold 2 and then attempting to apply that process to other biological structures or processes.

You're describing all the fine details and exact-level simulation bits, but I pointed out it might not be necessary to do that (the bitter lesson) if instead you can just dump a crap ton of data and compute into a Transformer network.

Take a look at what exact data AlphaFold 2 was trained on, a quick google I found this:

https://daleonai.com/how-alphafold-works

but probably we should go look at the formal paper I assume DeepMind must have released at some point. But it looks like they had 2 kinds of data to train on: some properly labeled protein folding examples, and then a bunch more unlabeled data where no one had yet figured out how those protein sequences would fold. One of the main innovations appears to be that DM found a way to make use of that unlabeled data, turn it into embeddings and feed that into the AI training run.

For attempting to do this with an entire cell, first you'd have to define what is going to be the loss function you're aiming to minimize/predict. Here I'm not suggesting the AI would be attempting to simulate or predict all those low level details, but more like could it predict some higher level but clinically relevant resulting effects on the cell? Like apoptosis, or improved/worsened mitochondrial function, etc. It would have to be cellular effects we have enough labeled data for, taken from thousands of studies.

If you could compile a large dataset of labeled "cellular inputs" -> "cellular results" (similar in concept to the AlphaFold 2 labeled data that had input protein sequences -> folded protein shapes), then to replicate the AlphaFold 2 method you would need to combine that with an even larger unlabeled dataset containing a bunch of other "cellular inputs" that we lack labeled data for how those substances affect a cell. Some type of method similar to AlphaFold 2's Multiple Sequence Alignment would need to be invented there to let you embed these unlabeled examples into the same embedding space as the labeled data. If that could be done (maybe by comparing the molecular shape or other characteristics of these input molecules), then you might be able to get an AI model that can predict something like "if you feed this shaped/charged/etc molecule into a cell, it might likely have this particular set of output resulting effects".

The AI would not help one iota in understanding all this low level details you both bring up, but it might be able to still predict resulting effects.

BrianA · March 28, 2023

In other news, Richard Ngo of OpenAI predicts by 2025 AIs will have self awareness and be better planners than the best humans:

BrianA · March 28, 2023

Another Transformer breakthrough, now in fine motor control for robot arms, with only 15 minutes of human examples prior to these videos. Link to their research site down lower in this thread:

Dean Pomerleau · March 28, 2023

That's crazy good manipulation! It clearly demonstrates how techniques from one branch of AI (machine learning) are really starting to catalyze other areas, like robotics.

Exponential progress is starting to look a lot less like hype...

Scary, IMO.

Guest · March 28, 2023

19 hours ago, BrianA said:

I hear what you're saying Guest, but I'm still not convinced that AI-based simulation methods can't progress significantly further, even if we lack all the myriad unknown low-level details. Why? Are you familiar with the concept of "The Bitter Lesson":

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Prior to the invention of deep learning AI techniques, in particular the Transformers architecture in 2017 (from Google ironically, who in a classic Innovator's Dilemma seems to be fumbling badly on making use of their invention), many AI scientists believed in order to ever get real intelligence from a machine we needed to understand all the myriad low-level details of how minds work, painstakingly spend their careers doing "feature engineering" to try and cobble together in human-designed brittle pieces of software to do one tiny task of the brain. Many many years of work from scientists went into trying to hand-design systems for intelligence.

Then Transformers and a bunch of FLOPs and data come along and you just essentially mix it all together and voila you get intelligence. Brute force essentially wins, rather than clever low level perfectly specified simulations. A similar thing has already been accomplished in the bio world with AlphaFold 2. No human had to give it all the low level details on how every single protein should fold. Yes, it had plenty of example data to train on, but from that it was able to eventually generalize using some internal algorithm it learned that gives it the ability to then predict how novel amino acid sequences will fold.

So why not extend this into training an AI to simulate how an entire cell would behave when given novel substances we want to test on a virtual cell? With enough data and training compute, I believe the AI could similarly generalize some internal model of cells. I would not be surprised if people out there are experimenting with this idea already. The main question I would guess is do we have enough training data for it to generalize from, I'm not clear on that. You'd essentially feed the AI some kind of tokenized inputs representing molecular substances being input into a cell, and then train it to minimize its loss on the predicted behavior of cells afterward. I don't know if we currently have enough of that kind of data, perhaps we do - again, it doesn't have to be anywhere near covering all potential inputs or behaviors of cells, just enough to let the AI generalize.

But - again - we don't have that data.

There is as little understanding of "input-output" data for the various cell types in your body as is for the inner workings of the cell. We are still discovering new fundamental principles like extra-cellular vesicles to cross-talk between cells in the last years - i.e. we now know, that this is a thing. But don't know almost any details of how much, what composition or what circumstances this is happening. And based on protein transcription we are fairly sure (those proteins have to go somehwere, doing something - we just don't know where and what), there is a lot more that we have no idea about.

Guest · March 28, 2023

Just to give a fairly representative example of what kind of mess we're still in:

There was an in-vitro study (isolated cells in a dish) in 2020, where cells exposed to glucosamine where infected with viruses. As a result the virus replication was amplified (possibly as a side effect of increasing autophagy). Conclusion of the autors: beware of glucosamine.

There was a follow up study in 2021, where actual living mice (in-vivo), where supplemented with glucosamine and infected with various viruses. The result: a drastic survival advantage of mice receiving glucosamine.

Why?

It turned out, that glucosamine can by-pass the rate limiting step in the so called hexosamine biosynthetic pathway (there is a limited supply of a certain enzyme, that converts glucose to glucosamine). This in turn increases O-GlcNAcylation of certain proteins (so activating or de-activating them). There is a certain class of proteins called "mitochondrial anti-viral signaling proteins" (MAVS) - i.e. if a cell is infected, it can release a signaling cascade to tell immune cells "clean me up - I'm a risk for the organism". Those MAVS can be amplified if sufficient O-GlcNAcylation is available.

Consequently the net-effect of glucosamine in-vivo was the complete opposite of what the in-vitro study initially suggested.

There is no way for you to figure that out with a simulation that doesn't consider the inner workings of a cell AND the external interaction of the cell with other cells and tissues. BOTH of which we have very incomplete knowledge of.

BrianA · March 28, 2023

Thank you Guest for sharing that example, clearly you're pretty deep into this area and know a lot more than myself when it comes to bio. It's good to hold our feet to the fire of reality when we speculate on what might be coming in the future.

I'm coming at this from the computational side, where to me it still appears The Bitter Lesson that already happened in areas such as computer vision, linguistics, "intelligence" now with these latest LLMs may be happening in bio too. In each case there were an army of I guess what we might call traditionalists who said the only way forward to ultimately achieve these goals was with endless years of low level detailed research, and then careful replication in software via human designed hand-coded techniques.

Then we got to the point where we had juuust enough data and compute, and the right deep learning algos, and suddenly the software can then train itself to accomplish these feats. I've listened to many podcasts at this point where people in these fields when asked why we didn't have the deep learning revolution sooner say it was due to simply not having enough data and compute.

My hypothesis is the same lesson is now playing out in the bio realm. You may be correct that at the moment we don't have enough data in all the various bio-related datasets out there, but that data is growing rapidly. At some point it will be enough, just like it was for those other areas, and then someone will be able to train an AI to fairly accurately predict the various behaviors of a cell.

Let me list off some interesting papers I ran across today while just reading things:

Evolutionary-scale prediction of atomic level protein structure with a language model

https://www.biorxiv.org/content/10.1101/2022.07.20.500902v3

This is a "pure" (no MSA required) LLM approach to protein structure prediction from Meta a few months ago. It does as well as AlphaFold 2 on a benchmark, is up to 2 orders of magnitude faster, and in some cases can predict the structure of completely novel proteins that AlphaFold 2 struggles with. They simply feed in protein sequence text, and what comes out is the structure. The LLM internally invented its own prediction system and structural representations, no one knows how, it just did it magically. They point out the capabilities smoothly increase as more network parameters are added and compute.

quote: "We posit that the task of filling in missing amino acids in
protein sequences across evolution will require a language
model to learn something about the underlying structure
that creates the patterns in the sequences. As the representa-
tional capacity of the language model and the diversity of
protein sequences seen in its training increase, we expect
that deep information about the biological properties of the
protein sequences could emerge, since those properties give
rise to the patterns that are observed in the sequences.

...

Although the training objective it-
self is simple and unsupervised, performing well on this task
over millions of evolutionarily diverse protein sequences
requires the model to internalize sequence patterns across
evolution.

...

Evolutionary scale models are also shown to perform un-
supervised prediction of mutational effects (54, 55), and
have recently been used in state-of-the-art applications, for
example to predict the path of viral evolution (56, 57), and
the clinical significance of gene variants (58)."

The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

https://www.biorxiv.org/content/10.1101/2023.01.11.523679v2

Transformer models trained on DNA sequence inputs and some labeled examples where we know what portions of the DNA do, that then learn to predict the molecular phenotype of other DNA sequences

quote: "Evolution informs representations: models trained on genomes from different species top
charts. We explored the impact of different datasets to pre-train equally sized transformer models.
Both intra- (i.e. when training on multiple genomes of a single species) and inter-species (i.e., on
genomes across different species) variability play an important factor driving accuracy across tasks
(Fig. 2). Notably, the models trained on genomes coming from different species perform well on
categorical human genomics downstream tasks, as well as on human variant prediction, even when
compared to models trained exclusively on the human genome (Fig. 2). This could indicate that the
genome LMs capture a signal of evolution so fundamental across species that it better generalises to
shared functions.

...

Genomics insights are captured during training despite no supervision. The Nucleotide
Transformer models learned insights about key regulatory genomic elements through attention, as
demonstrated through the analysis of attention maps, embedding spaces, and probability distributions.
Elements such as enhancers and promoters were detected by all models, and at several heads and layers.
We also observed that each model contained at least one layer that produced embeddings which clearly
separated five of the genomic elements analyzed. As self-supervised training allowed for the detection
of these elements, we expect that this approach can be leveraged to unravel new elements and effects in
the future."

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

https://www.pnas.org/doi/10.1073/pnas.2016239118

An unsupervised Transformers based model that outperforms older supervised learning approaches.

quote: "One of the goals for artificial intelligence in biology could be the creation of controllable predictive and generative models that can read and generate biology in its native language. Accordingly, research will be necessary into methods that can learn intrinsic biological properties directly from protein sequences, which can be transferred to prediction and generation.

...

The space of representations learned from sequences by high-capacity networks reflects biological structure at multiple levels, including that of amino acids, proteins, and evolutionary homology. Information about secondary and tertiary structure is internalized and represented within the network. Knowledge of intrinsic biological properties emerges without supervision—no learning signal other than sequences is given during pretraining.

We find that networks that have been trained across evolutionary data generalize: information can be extracted from representations by linear projections, deep neural networks, or by adapting the model using supervision. Fine-tuning produces results that match state of the art on variant activity prediction. Predictions are made directly from the sequence, using features that have been automatically learned by the language model rather than selected by domain knowledge.

We find that pretraining discovers information that is not present in current state-of-the-art features.

...

Combining high-capacity generative models with gene synthesis and high throughput characterization can enable generative biology. The models we have trained can be used to generate new sequences (79). If neural networks can transfer knowledge learned from protein sequences to design functional proteins, this could be coupled with predictive models to jointly generate and optimize sequences for desired functions. The size of current sequence data and its projected growth point toward the possibility of a general purpose generative model that can condense the totality of sequence statistics, internalizing and integrating fundamental chemical and biological concepts including structure, function, activity, localization, binding, and dynamics, to generate new sequences that have not been seen before in nature but that are biologically active."

There are a large number of possibly interesting papers here:

https://github.com/OmicsML/awesome-deep-learning-single-cell-papers

Some from there:

Predicting Cellular Responses with Variational Causal Inference and Refined Relational Information

https://openreview.net/forum?id=ICYasJBlZNs

Disease state prediction from single-cell data using graph attention networks

https://dl.acm.org/doi/10.1145/3368555.3384449

Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution

https://arxiv.org/pdf/2204.13545.pdf

Learning interpretable cellular responses to complex perturbations in high-throughput screens

https://www.biorxiv.org/content/10.1101/2021.04.14.439903v2

Well that was my long-winded way of trying to show some examples where new insights are increasingly being generated in the bio area from Transformers and related neural network technologies. It appears to me there is a clear trend that as more data and compute in this area of science become available, these techniques will become better and better.

One thing I really want to focus on is how these LLMs and other models can learn and infer things that are not directly in the ground truth training data. This is because as they receive more data and compute, they are able to leverage that into deeper, more generalized understanding of reality. It's like Sherlock Holmes - if you eliminate all other possibilities, that which remains must be the truth. More data and compute allow these models to weed out failed predictive techniques, and end up with ones that output results closer to reality, even if it has to infer from the missing "holes" in the data what "must" be happening.

This is how LLMs develop unexpected new capabilities. It's how the phenomenon of "grokking" (yes, that's a technical term now in AI) happens, where suddenly during training the loss drops suddenly. The AI at that point in training literally had a breakthrough in its world modeling capability.

So - do we have enough data now to make an LLM "cell" AI? I don't know. Are we on the path to eventually have one? I think so.

I'll leave you with this video from yesterday where 6 minutes in the Chief Scientist of OpenAI, Ilya Sutskever, says he expects these techniques will take us beyond human-level AGI, despite the fact the data we have to train it on is "only" human-level (although some data we have comes from superhuman software like the Stockfish chess playing program). He believes this will happen because the AGI will be smart enough to simply extrapolate (aka "invent new science" I would argue) things that we are currently missing from our world understanding:

Edited March 28, 2023 by BrianA

Guest · March 28, 2023

I'm not saying that there can't be any value derived from AI for bio-sciences.

I'm saying that it can't really address the bottlenecks in creating therapies in ways to be a game changer.

That would ideally mean you can go from computer simulation straight to market where potentially millions of people can take the therapy in the first month without any previous data in living humans.

The by far (by an enormous margin) most costly step and most time consuming step are the human clinical trials.

But even mouse trials can easily cost 100.000s of USD.

Both of which have high failure rates - i.e. in-vitro results or similar don't work out because of any of the 10.000s of interactions (most of which are not characterized or even known) .

It's a moot point saying, that eventually we will have enough data to permit AI to do proper simulation. Sure - as long as AI isn't wiping out the human race there are centuries to gather the required information.

But given the state of bioscience at the moment and the foreseeable future (decades) this is just not going to be the case. Certainly not to the extend that it allows skipping over animal data, human pre-clinical trials and definitely not clinical trials.

AI can generate ideas for new targets, sure. But it can't simulate the potential interactions, to figure out if it will work in living organisms - because most of those interactions are unknown. See the glucosamine example, where it may in one way increase viral replication, but in another way aid the body to combat viral infections (with the latter dominating in a living animal). Both results based on previously unknown interactions.

Just to outline the example again:

glucosamine enters a cell => it can enter the HBP-mechanism, thereby skipping over the rate-limiting step in the HPB => after 4 more steps, the end result of the HBP is a molecule that can activate or de-activate proteins => following a process we don't know much about at the moment, this molecule targets MAVS exactly when needed and amplifies their action => MAVS induce a signaling cascade, that results in external signaling

This is NOT a complicated process by the standards of the human metabolism (and one step is still badly characterized!). But you would need to have quite a complete picture of the inner workings of the cell to derive this conclusion - i.e. "What happens, if I add glucosamine to an infected cell?".

And that's not a complete picture. There are 3 more pathways (that we know of) that glucosamine interacts within the cell (hence the initial in-vitro result, that it may amplify viral replication). Without the in-vivo studies it would be anyones guess which is dominating and what the end result is.

This also holds true for interactions between cells, which are difficult to study in-vitro.

Edited March 28, 2023 by Guest

BrianA · March 29, 2023

Interesting, let me ask because I believe you mentioned previously you feel one of the major bottlenecks is funding, is there an equivalent in bio funding to how things work with software:

Typically in software some initial seed development is done to build a product to the "Minimum Viable Product (MVP)" stage, where you have barely enough working features to do some test marketing and see if customers will pay for it (test product/market fit). If it gets traction then often you can take that to VCs and maybe get Series A funding.

What I'm imagining is in bio, if we get some AI tools that aren't perfect by a longshot, but if they get good enough to generate some predicted new drugs or other inputs to cells, that perhaps other AI tools could then make rough (again, quite possibly wrong) predictions of how those drugs might affect the cell in desirable ways, would that be enough to trigger more funding? To get more projects at least to the point of testing them in vitro or mice more quickly than current methods?

Here's a company that seems to hyping this approach, what do you think, is this useful or hot air?

Harnessing Multi-Omic Data with Deep Learning
An interview with Jonathan Baptista, Co-Founder & CEO, DeepLife

https://www.tetrascience.com/blog/harnessing-multi-omic-data-with-deep-learning

"DeepLife is a next-generation systems biology company harnessing omics data and the power of Deep Learning to model cells and efficiently engineer their behavior, accurately predicting cell reactions to various perturbations.

...

the proliferation of multi-omic data has made it possible to infer missing parts of a comprehensive picture of an unhealthy cell."

Saul · March 29, 2023

8 hours ago, BrianA said:

Here's a company that seems to hyping this approach, what do you think, is this useful or hot air?

Harnessing Multi-Omic Data with Deep Learning
An interview with Jonathan Baptista, Co-Founder & CEO, DeepLife

https://www.tetrascience.com/blog/harnessing-multi-omic-data-with-deep-learning

"DeepLife is a next-generation systems biology company harnessing omics data and the power of Deep Learning to model cells and efficiently engineer their behavior, accurately predicting cell reactions to various perturbations.

Looks very interesting.

I'd read about the protein folding predictor, alphafold, previously -- my understanding is that it's already being extensively used by researchers.

Whether or not DeepLife is at least partly successful, IMO it seems likely that it and/or similar AI deep learning systems will prove invaluable to studying biological processes; probably applying e.g. to human health.

I'm not one of those who fear AI or "the singularity"; there have always been individuals (and groups) who feel afraid of, or even threatened by, progress.

-- Saul

Guest · March 30, 2023

19 hours ago, BrianA said:

What I'm imagining is in bio, if we get some AI tools that aren't perfect by a longshot, but if they get good enough to generate some predicted new drugs or other inputs to cells, that perhaps other AI tools could then make rough (again, quite possibly wrong) predictions of how those drugs might affect the cell in desirable ways, would that be enough to trigger more funding? To get more projects at least to the point of testing them in vitro or mice more quickly than current methods?

Here's a company that seems to hyping this approach, what do you think, is this useful or hot air?

Harnessing Multi-Omic Data with Deep Learning
An interview with Jonathan Baptista, Co-Founder & CEO, DeepLife

https://www.tetrascience.com/blog/harnessing-multi-omic-data-with-deep-learning

"DeepLife is a next-generation systems biology company harnessing omics data and the power of Deep Learning to model cells and efficiently engineer their behavior, accurately predicting cell reactions to various perturbations.

...

the proliferation of multi-omic data has made it possible to infer missing parts of a comprehensive picture of an unhealthy cell."

What you are describing is an in-vitro study.

You're looking for a target in the cell or at the surface of a cell. You're asking yourself: "Would inhibiting or amplifying that target (directly or indirectly) lead to a good result, based on known mechanisms?"

At the next stage you have to find a molecule that is capable of just that. If you have a blue-print, you can generally synthesize it with modern biotech quite easily.

However:

the vast majority of drugs that pass the in-vitro stage, eventually fail in the subsequent trials (because of lack of effect in animal models, lack of effect in humans or too problematic side effects)

It would be a win (although not a games changer), if modern tools could better screen, if a therapy is likely to fail in in-vivo (living organism) trials. But here you're already running into the problem, that we have insufficient knowledge of the human metabolism to do a simulation about that. Just missing or incorrectly interpreting a tiny piece in a chain of metabolic reactions (see glucosamine) can and does lead to wrong predictions.

It's not like pharma companies are not doing extensive research about the known interactions of a therapy, before they commit to a 100 Million USD clinical trial. Still there is a high failure rate.

My guess is, that "organs on a chip" and organoids will be able to more efficiently screen drugs, before they make it into animal studies. But I don't see a convincing argument for AI to be able to do that just by simulation. It can help you at the in-vitro stage in identifying potential targets and molecules to go after that target. That's the first, cheapest and shortest stage.

But I don't see how it could help much with the vast lack of knowledge of the human metabolism in the next 20-30 years - let alone getting regulatory approval to skip human pre-clinical trials. And lets not talk about the full clinical trial. Not going to happen in the next 30 years.

And I'm not a luddite. I would love AI (or anything really) to be a game changer. But I'm sufficiently realistic to not buy into the singularity hype ("The world will be positively upended by 2040") in the field of bio-medicine.

Guest · March 30, 2023

to put it in computer science terms (not literally - there is no direct analogy; see it more as a metaphor):

imagine you know the general functioning of a programming language - declarations, syntax, commands etc.

Now you're confronted with an enormous system programmed using that language. Dozends of modules, each comprising 1000s of lines of code and constantly executing parts of that code based on module-internal and external regulators + these modules constantly passing hundreds of commands to each other.

You know on average 35% of the lines of code to some precision. But there are entire programming routines (e.g. a routine that is requesting input from other modules - but you have no idea that this is even possible) not covered by that 35%. And you can observe about 35% of the commands exchanged. Again, there a certain classes of commands, that you have no idea about, that they even exist.

You want to alter/improve that program by inputting some code. But you have to hand this code over to a module, that will decide what to do with it. Maybe there is only one place the module will allow it to be implemented - congrats. That's a "clean" drug. It can still be useless, because of interactions with other modules. Or maybe - as is the case for glucosamine - there are several ways the module can decide to implement it. And because of the limitations outlined above, it's very difficult to track what the code ends up doing.

You know the final outcome of the system - some nice animation that messages you "everything is fine". Or the animation says "it's going pretty bad". And maybe - but not always - it's giving you a list of modules that are particularly affected.

That's the state of bioscience.

BrianA · March 30, 2023

Interesting, well I can say as a programmer if I was faced with such a situation I'd definitely fall back onto automated methods as much as I could. Attempting to pursue understanding of the entire system would probably be impossible without the aid of some type of AI that could increase my own capabilities/act as a guide or assistant. I mean there's just too much there to hold in a typical human mind and reason about probably. I guess this is why so much of science is specialized now to the nth degree - too much to fit in a human brain.

Your analogy reminds me of a technique used in cybersecurity called fuzzing:

https://owasp.org/www-community/Fuzzing

This is where you're dealing with a black box piece of software, typically some compiled software or an entire system or network where you don't have the source code or a map at all of how it works. So you basically use automated tools called fuzzers that start injecting random inputs into the software or system to see what happens. Sometimes you crash the software, so you found a bug. Other times you get it to do something, so you discovered part of how to control it. These and other methods can be used to deobfuscate, decompile, and reverse engineer closed source software.

Interestingly from a quick google, I find AI is now assisting in software reverse engineering/decompiling:

How AI can help reverse-engineer malware: Predicting function names of code

https://www.theregister.com/2022/03/26/machine_learning_malware/

When it comes to bio, clearly companies are starting to use AI and related automated techniques to figure out the remaining parts of the bio black box. The limitation seems to be as you say to do with how much data might still be needing to be gathered, and the costs + time required (plus regulatory requirements, but I'm more interested in the AI question). I remain curious as I've said previously about just how much can be inferred by AI simply by training it on the current pile of data we have. Again I'll point out in this short clip below (click it to go to twitter.. about 1 minute video) from Ilya Sutskever (taken from that longer video I posted earlier), his hypothesis (and mine) that AI can learn things far beyond the data tokens that are given to it.

To me, going back to your glucosamine example, the question isn't necessarily "how many more low-level unknown features from cells remain to be discovered before we could simulate everything?", but rather "what fractional part of the overall features and behavior of cells would be just enough to then let a really smart AI infer the rest of the missing bits?". In other words, I don't think we need 100% of the jigsaw puzzle pieces to have an AI be able to infer what's likely on the missing pieces, and the smarter the AI the more of the puzzle it could infer. Leading to the conclusion for me that as AIs get smarter due to the scaling laws currently happening (parameter counts, compute, training dataset sizes), this would tend to shrink the total dataset size needed for a "useful" cell prediction system. At some point in the near future, these 2 trends of increasing AI smartness and increasing bio dataset sizes should reach an intersection point to enable this. Probably no single person knows exactly where that crossover is, and I'd predict we'll see initial attempts like OpenAI did with GPT-1 and all the other AI attempts before that where at first it doesn't appear to be that great. But the scaling trends get us there eventually.

Edited March 30, 2023 by BrianA

BrianA · March 30, 2023

12 hours ago, Saul said:

I'm not one of those who fear AI or "the singularity"; there have always been individuals (and groups) who feel afraid of, or even threatened by, progress.

Sure, I'm also 99% of the time a tech accelerationist, however I do feel AGI in particular is a special case. It's a new species we're on the way to creating (many many species probably, think punctuated equilibrium), that is expected to result in entities smarter than us.

a paper from yesterday: Natural Selection Favors AIs over Humans

https://arxiv.org/abs/2303.16200

Guest · March 30, 2023

6 hours ago, BrianA said:

[...]

Your analogy reminds me of a technique used in cybersecurity called fuzzing:

https://owasp.org/www-community/Fuzzing

This is where you're dealing with a black box piece of software, typically some compiled software or an entire system or network where you don't have the source code or a map at all of how it works. So you basically use automated tools called fuzzers that start injecting random inputs into the software or system to see what happens. Sometimes you crash the software, so you found a bug. Other times you get it to do something, so you discovered part of how to control it. These and other methods can be used to deobfuscate, decompile, and reverse engineer closed source software.

[...]

When it comes to bio, clearly companies are starting to use AI and related automated techniques to figure out the remaining parts of the bio black box. The limitation seems to be as you say to do with how much data might still be needing to be gathered, and the costs + time required (plus regulatory requirements, but I'm more interested in the AI question). I remain curious as I've said previously about just how much can be inferred by AI simply by training it on the current pile of data we have. Again I'll point out in this short clip below (click it to go to twitter.. about 1 minute video) from Ilya Sutskever (taken from that longer video I posted earlier), his hypothesis (and mine) that AI can learn things far beyond the data tokens that are given to it.

To me, going back to your glucosamine example, the question isn't necessarily "how many more low-level unknown features from cells remain to be discovered before we could simulate everything?", but rather "what fractional part of the overall features and behavior of cells would be just enough to then let a really smart AI infer the rest of the missing bits?". In other words, I don't think we need 100% of the jigsaw puzzle pieces to have an AI be able to infer what's likely on the missing pieces, and the smarter the AI the more of the puzzle it could infer. Leading to the conclusion for me that as AIs get smarter due to the scaling laws currently happening (parameter counts, compute, training dataset sizes), this would tend to shrink the total dataset size needed for a "useful" cell prediction system. At some point in the near future, these 2 trends of increasing AI smartness and increasing bio dataset sizes should reach an intersection point to enable this. Probably no single person knows exactly where that crossover is, and I'd predict we'll see initial attempts like OpenAI did with GPT-1 and all the other AI attempts before that where at first it doesn't appear to be that great. But the scaling trends get us there eventually.

You have to remember that we're talking about a living being - e.g. a fly, mouse, human.

That's not some change in a computer code, that you can arrange in a second, run for an hour and observe the outcome.

Even in-vitro studies (cells in a dish) take weeks to figure out what and how to target, execute and analyze. Ideally! In practice it can be months. AI can help, sure - but we will need a much better lab-robots to really speed up this process.

That still doesn't address the challenge, that interactions between cells/tissues in a living being are decisive - i.e. most in-vitro results don't translate even to the mice/rat stage. And doing experiments in animals of that kind will take months for every individual "fuzzy"-change. And there is an in-credible number of changes. Systematically probing even 10% of what we don't know will eat up the entire US government budget. In mice! Most things that somewhat work in mice fail in the next step: human clinical trials. And you are not proposing, to find 100 Million volunteers to test out the most basic areas of missing knowledge - a lot of which will require genetic engineering which can realistically only be done at the embryo stage and thus requires to wait 9 month to see the effect on a living, independent being.

There is project currently under development, to create a "worm bot", that can assess in an industrial manner the effect of drugs on c.elegans (a tiny worm in a dish). This is based on the assumption, that there can be some lessons learned from that simple organisms in-vivo. The analysis of the mechanisms, once there is a success on lifespan, still needs to be done by hand (they only look at, if a drug increases lifespan - no real bio-analysis on the impact on metabolism). But again: even mice result mostly don't translate to humans. I leave it to you to assess the value of doing pre-screening in tiny worms.

Also we are not even close to anywhere near 100%. Just by looking at the number of proteins transcribed in a cell that we have no real clue where and what they end up doing we are heavily outmatched. And we are still discovering fundamental processes about the inner workings of the cell and interaction between cells, that we didn't know 10 years ago to be fundamental processes controlling the metabolism! We still don't know much about extra-cellular vesicles or O-GlcNAcylation - just that there is a lot of it going on steering fundamental processes of the cell. Why? Where? Regulating mechanisms? Very incomplete at best.

You are saying: maybe AI is so smart, that it just knows what to do, even with gigantic gaps about the fundamental workings of the cell. Than make that argument, please. Just making that statement is not an argument. This needs to be an argument about to what extend information is required to make somewhat accurate predictions about a complex system - even assuming 1 million times scaling of current AI levels (whatever that means). I'm not into mathematical theorems, so I can't guide you in making that argument. But unless this argument can be made, everything about AI being a game changer in biomedical science is pure speculation.

As it stands AI can't solve our fundamental problems about what kind of string theory - if any at all - is the correct model of our universe. There just isn't sufficient empirical data about the fundamental workings of our universe available. The same is true for human metabolism - though at least in theory the missing information can be gathered.

BrianA · March 30, 2023

Thanks Guest for further elaborating. My estimation of the difficulties involved in having an AI eventually be able to form a mostly "complete" cell prediction/simulation system has definitely gone up a few notches, based off your comments. Not to even mention as you say how cells interact with each other.

I don't have enough knowledge or info to make the level of argument you want, mainly I'm left with analogies, intuitions, and trendlines/extrapolations.

One thing that occurs to me historically was the Human Genome Project, and the overall trend in sequencing speeds and cost reductions. IIRC the HGP project was using slower traditional sequencing approaches, which eventually was surpassed after almost a decade by Craig Venter's Celera Genomics using faster and cheaper more automated techniques. Since then AFAIK the speed and cost reduction curves have continued incredibly further. So I mean, the trends seem to be in the direction of us gaining more bio data every more quickly and cheaply. The market seems to drive this, and also tech drives it.

The problem as you explain is how could you possibly speed up the remaining metabolic and other research needed for an AI model. The robotic worms thing is one idea. I also wonder if we would have at some point enough data between how worms or smaller organisms react compared to mice (or humans), that could allow for an AI-assisted translational prediction system. Could an AI learn to predict how a mouse is likely to react, based off smaller/cheaper organism research. Based on what you've said above, my guess is you'd say no to that. But I have to think that since the higher organisms evolved from the older/smaller organisms that they must share some metabolic and other similarities, such that if the smaller organisms could be fairly fully characterized, that then that data combined with the "these are a bunch of example data where a mouse's cells behaved different, plus all the other data we have mice including their DNA and all their "omic" data", maybe an AI could start to make some educated guesses and predictions.

But again I can't say for sure, I have no hard math to back this up.

Edited March 30, 2023 by BrianA

BrianA · March 30, 2023

Here's an interesting podcast series, Cognitive Revolution. This episode was with Google's "Mother of Robots" Keerthana Gopalakrishnan. She analogizes that robotics are between GPT-2 and GPT-3 in terms of how close we might be to them becoming more of our everyday lives. She hints they are dropping a new paper soon that sounds like it's about getting additional training data for their AIs via translational learning from videos (I'm thinking Youtube videos may be a source). Interesting in that recent video with Ilya Sutskever from OpenAI he also mentioned when asked if they were running out of more data to train on that they could move into using videos. Also interestingly Github (owned by Microsoft, the partner of OpenAI) recently filed an amicus brief in a court case to try and take the position that scraping videos from Youtube should be allowed for scientific research... now I think I see why.

Anyway this podcast series has some other interesting episodes, the previous episode was the host discussing his experience as part of the GPT-4 red team, testing an early "amoral" version of it last year, check it out.

BrianA · March 30, 2023

There has been an explosion in the past 24 hours regarding calls to regulate/slow down/stop AGI development beyond GPT-4 levels. Here are some things:

The Future of Life Institute released a petition signed by Elon Musk and others to pause high-end AI training for 6 months or more, to at least try to put into place more thought out safety measures:

https://futureoflife.org/open-letter/pause-giant-ai-experiments/

discussion: https://forum.effectivealtruism.org/posts/PcDW7LybkR468pb7N/fli-open-letter-pause-giant-ai-experiments

discussion: https://www.lesswrong.com/posts/6uKG2fjxApmdxeHNd/fli-open-letter-pause-giant-ai-experiments

Eliezer Yudkowsky has an article on the Time website suggesting to completely shut down all advanced work towards AGI worldwide:

https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough/

discussion: https://www.lesswrong.com/posts/Aq5X9tapacnk2QGY4/pausing-ai-developments-isn-t-enough-we-need-to-shut-it-all

LAION has a counter-petition for Europe to create a "CERN for AGI", to effectively speed up scientific work on AGI and potentially open source it or parts of it:

https://laion.ai/blog/petition/

Eliezer Yudkowsky interview with Lex Fridman today:

BrianA · March 30, 2023

A couple of bio-related articles from the next big future blog today:

Alphafold AI and Robotic Science Lab Identified a New Liver Cancer Treatment

https://www.nextbigfuture.com/2023/03/alphafold-ai-and-robotic-science-lab-identified-a-new-liver-cancer-treatment.html

"This paper demonstrates that for healthcare, AI developments are more than the sum of their parts,” said Aspuru-Guzik. “If one uses a generative model targeting an AI-derived protein, one can substantially expand the range of diseases that we can target. If one adds self-driving labs to the mix, we will be in uncharted territory.”

Ray Kurzweil Predicted Simulated Biology is a Path to Longevity Escape Velocity

https://www.nextbigfuture.com/2023/03/ray-kurzweil-talked-about-reaching-longevity-escape-velocity-using-simulated-biology.html

"Deepmind is also working on a number of other projects in chemistry and biology to expedite the drug discovery process. Hassabis envisions the development of a “virtual cell” that models all cellular dynamics and can be used to perform in silico experiments. This would streamline the research process, requiring wet lab validation only at the final stage."

Kurzweil predicts: "Ultimately, we won’t need to test on humans. We will be able to test on a million simulated humans which will be much better than testing on a few hundred real humans."

BrianA · March 30, 2023

For now the LLM race to AGI continues to accelerate as DeepMind and Google Brain who have traditionally operated as separate entities within Google now are teaming up:

“Gemini”: Google and Deepmind develop GPT-4 competition

https://the-decoder.com/gemini-google-and-deepmind-develop-gpt-4-competition/

Dean Pomerleau · March 30, 2023

That is quite a stark opinion piece by Eliezer Yudkowski (EY) in Time. Here is a representative quote:

We are not prepared. We are not on course to be prepared in any reasonable time window. There is no plan. Progress in AI capabilities is running vastly, vastly ahead of progress in AI alignment or even progress in understanding what the hell is going on inside those systems. If we actually do this, we are all going to die.

He sure isn't pulling any punches.

From reading a lot of his previous posts on LessWrong, EY doesn't actually believe that a call to halt all AI research, and to shut down all clusters of GPUs is going to work, because it won't be heeded and even if it were it would be impossible to enforce it without a superintelligence AI monitoring everything, which is what he's trying to avoid in the first place. He is almost maximally pessimistic.

Here is the summary of his "death with dignity" post on LessWrong from almost exactly one year ago (note - it was posted on April 1st, but it subsequently became clear that he wasn't joking):

tl;dr: It's obvious at this point that humanity isn't going to solve the alignment problem, or even try very hard, or even go out with much of a fight. Since survival is unattainable, we should shift the focus of our efforts to helping humanity die with with slightly more dignity.

Here is what he means by dignity:

It is more dignified for humanity - a better look on our tombstone - if we die after the management of the AGI project was heroically warned of the dangers but came up with totally reasonable reasons to go ahead anyways.

So this letter is EY's attempt to warn of the extreme danger he sincerely believes we are in and to outline the minimum that must be done to avoid it, with full expectations that the AI developers will ignore his and others' warnings and proceed anyway.

I'm not sure what to think. There is definitely a chance he is completely wrong about the extreme risk of proceeding with AGI development. But even the developers (like Sam Altman and Ilya Sutskever from OpenAI) acknowledge that they don't know how these systems work, or what they are capable of. They hope the efforts they are making to make sure these systems are safe will be effective, but they aren't sure about that either. So the chance that EY is right (i.e. that we're doomed if we proceed) is certainly not zero.

He may be overestimating the danger. But the fact that EY has thought longer and harder about the risks of AGI than virtually anyone else on the planet and sincerely believes that we are almost certainly doomed, makes me very uncomfortable. At the very least it should increase any rational (Baysian) estimate of the likelihood of existential calamity.

I agree with him that what it would take to completely prevent the risk (a complete global moratorium on AI development) isn't going to happen. So we better hope he's wrong about the degree of danger.

BrianA · March 31, 2023

There is of course a wide set of options between "completely prevent the risk" and "let's hope he's wrong", and I think Yudkowsky knows this and perhaps is attempting to shift the Overton Window further in his direction, in the expectation if he can psychologically anchor people's estimates of "how scary should I think this new thing is", maybe he can at least get the various companies and people involved to spend more effort on safety.

I think that's a good thing, and I'm in favor of him outputting extreme sounding scenarios if it helps move the world a bit more in the safety direction. A big problem currently is the typical new researchers coming just out of school have a decision to make: should I go to work for a company working on AI capabilities which offers both high $$$ and high prestige? Or go to work somewhere "less prestigious" on AI safety that might pay less?

I've heard Connor Leahy of Conjecture say he thinks there are maybe only 100 or so people in the world currently getting paid real money to do actual AI safety work (not counting people who work on AI ethics aka "make AI not say offensive words"). That number needs to get waaaay more in line with how many people are working on capabilities. I saw some tweet yesterday that in the past week around 2600 new AI papers dropped. The capabilities side is currently exploding.

Sign In

The Singularity May be Closer than It Appears

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Posted Images

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation