I have pondered this for more than 30 years.

According to military communication research (sorry, I don't recall specifics), the theoretical maximum amount of information that the human brain can absorb per second is only approximately the equivalent of 45 bits per second. 10800 bits per 4 minutes of listening to the radio or watching TV. I don't know if they know which bits are sufficient to reproduce that experience but I agree that I probably can't send a review report by telegram about a 4 minute video. A good typist can exceed that rate by reflex even without understanding the dictation.

An analog telephone line carries only a frequency range of 3000 cycles per second. It can use QAM to transmit 56000 bits per second, but phase modulation is involved and human hearing is not very much aware of the phase aspect of sound.

An analog telephone line or AM radio station is capable of transmitting a mechanical television show, while the old NTSC broadcasts required as much bandwidth as 4 times all of the AM radio stations combined. (600 radio stations could fit in one TV channel).

30 years ago I began using a primitive extremely lossy binary representation of sound, by which voice and music could be processed experimentally with arithmetic within the limit of 64000 bytes of RAM, able to store one minute of sound at a time in the simplest case, and whole songs by using tricks.

First I made sounds by counting or pseudo random means, 3 years before using this technique for recording. Its purpose was to build a synthesizer, which I knew nothing about, and thought that it was a music generating automatic machine. In my format, counting sounds musical, and pseudo random sounds like the hiss of the s and sh sounds. It may be important that my format is so lossy that it doesn't preserve frequencies, phase, nor amplitude aspects of sound and thus its recognizability and intelligibility are unexplained and generally only useful for fast experimental calculation and exploration of "thumbnail" quality music.

Octave and musical note frequencies can be expressed in Base 12 Log2, from 0.0 for a low A to 11.11 for a high G#. Log2 explains how binary counting generates octaves but not the notes.

I now occasionally use 4 megabytes to represent experimental binary numeric sound, but usually never even come close to exceeding 64K bytes. Most of my experiments do not exceed a screenful of significant data to generate musical sound.

4 megabytes can hold over 256^4,000,000 numbers from all 00 to
all FF in hexadecimal. Assuming 8-bit PCM, there are constraints that bring the number of possible songs down, such as the average byte value must be hex 80. The same song will be heard if all bits are flipped between 0 and 1, so that makes half as many possible songs. The same song will be heard if the bytes are shifted and the player is looping and therefore the total number of songs must be divided by 4000000. The lowest bit is meaningless, so the number must be halved. Removing all bits except the highest bit still plays recognizable sound, so the number of songs must be divided by 128.
So the number of possible songs must be around...
2^4000000 / 2 due to bit flipping / 4000000 due to shift equivalence = 2 ^ 500,000 possibilities left.
Now what if everyone in the world recorded the same song using a device that records in my format? Out of 6,000,000,000 people, no two people will create an identical file.
There are less than ((2^500,000)-(2^33)) possible songs.
This should be an exponent but it is not. The reason is that all of the songs can be represented by a single bit in that much memory. We know the constant third dimension is the same number.

Now consider that this number represents every possible song sung by every possible singer in every possible language and backed by every possible instrument and every possible parody version and every possible non musical speech or combination of sounds ...plus noise, and I forgot to try to calculate how many are physically impossible sounds that should be excluded.

I recently wrote a short song which has never been sung and programmed my unique synthesizer to create it so that it sounds like that old song that I thought was spontaneously generated by a synthesizer 30 years ago. It is called Viray's Inspiration, and it fits in 32000 bytes and plays for approximately 30 seconds. The voice sounds the same, but the lyrics are different. The music sounds similar but the melody is different. That shall be for replacing all the OLD good-feeling songs that we can't get copyright licenses for with NEW ones that sound similar but are ORIGINAL and may even sound like Elvis singing Michael Jackson's hits with new lyrics and old instruments.

I have partially squeezed (not copied) the Doctor Who intro, including Video into 64K, the audio will be like the original Derbyshire version, and the video of the vortex will be Parametrically synthesized, like Farbrausch's 2000 64KB demoscene thriller called FR-08. I think that was about 15 minutes of music and video in 64K.
65536/15?/60=getting quite close to 45 bits per second, perhaps?

Hmmm. My guess is any 4-minute song can fit in 64K, so there are less than 2^65536 songs.
But if we can only hear the equivalent of 45 bits per second then
there are only 45x60x4=2^10800 songs. With or without video.
Go Farbrausch!