Hiding the Magic

The hidden cost of ChatGPT, an Information Design Journal, and the MusicCaps dataset

Jan 30, 2023

Hi,

In preparation for this newsletter, I broke a small personal rule: I logged back into Twitter. Historically, Twitter was my main information feed for finding interesting links for Data Curious. After the Musk takeover I reached a breaking point and jumped ship to Mastodon (vis.social server). I’m grateful that a small but not insignificant number of people I admire from my Twitter-sphere also did the same.

That being said, Mastodon hasn’t quiet rivaled the level of discovery achieved on Twitter. On the bright side, it’s not (as of now) a soul-sucking portal designed to encourage outrage and dunking on people. So pros and cons.

But every once in a while (like yesterday), I’ll pop my head back in to see if there’s anything in the data + tech world that might be inspiring. I’m not sure if I regret it yet or not. One thing that really stood out as different: the AI hype is everywhere.

I don’t typically write about AI. It’s not my field of interest or expertise. And I’m not going to offer much editorial on it in this letter (only a bit). But I will say that I recently finished reading Atlas of AI by Kate Crawford, which helped me understand the wider scope of where we are now and how we got here. Highly recommend it for a critical reading of machine learning technologies (i.e. AI, minus the marketing hype).

That being said, I will be featuring some AI related content here (but not at all for promotional purposes…quite the opposite in fact).

If you have any other recommendations for critical analysis of AI, let me know in the comments below or by replying to this email.

Read

OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic

Ok, I know I said that I wouldn’t write much about AI…but this story came across my feed last week and the timing felt eerily prescient. I had just been reading in Atlas of AI how fundamentally, AI technology is one of extraction. Extracting data in the form of scraping huge amounts of information from the web. Extracting valuable minerals from the Earth to fuel the huge amount of computing power needed to run the models. And extracting human labor (as demonstrated in this investigative TIME piece) by outsourcing content moderation and data labeling at the lowest possible costs.

Users of ChatGPT see exactly what OpenAI want them to see: the instant magic of a machine answering your questions reasonably. But as the piece shows, much of the work is not magic: it’s human exploitation.

Explore (but mostly read)

https://www.jbe-platform.com/images/covers/1569979x.png — Online cover image from the Information Design Journal (Issue 1, Volume 27)

Information Design Journal (Issue 1, Volume 27) is now Open Access

I was unfamiliar with this publication until last week, when I saw someone announce that the first issue of this year was just made Open Access (all articles available online). In my experience, this is rare for academic style journals. From a quick skim, I’m eager to have a closer look, especially with titles such as “A dynamic topography for visualizing time and space in fictional literary texts”.

Information+ @InfoPlusConf

We are very happy to announce our special issue of Information Design Journal with 10 contributions from #infoplus2021 is now fully published as #OpenAccess: doi.org/10.1075/idj.27… A big thank you again to our wonderful authors and reviewers! #DataVis #InformationDesign

doi.orgVolume 27, Issue 1 | John BenjaminsWelcome to e-content platform of John Benjamins Publishing Company. Here you can find all of our electronic books and journals, for purchase and download or subscriber access.

Data

Tero Parviainen @teropa

The results are very impressive, but the dataset seems like the most immediately advantageous aspect here, given they've no plans to release the model. "5,521 music examples, each of which is labeled with an English aspect list and a free text caption written by musicians"

Aran Komatsuzaki @arankomatsuzaki

MusicLM: Generating Music From Text Presents MusicLM, a model for generating high-fidelity music from text. MusicLM generates music at 24 kHz that remains consistent over several minutes. proj: https://t.co/8vzBONkPe3 abs: https://t.co/vzW01q7VpH data: https://t.co/LERn2mZMtO https://t.co/u4L4ui0RwU

MusicLM: Generating Music From Text (Dataset)

Admittedly, I was impressed by MusicLM when I first heard it. MusicLM is the newest text-to-music generational model on the block from Google Research. But even more interesting to me was (as mentioned by Tero in the tweet above) the fact that the authors open sourced the training dataset: over 5 thousand music examples with text labels attached. Could be some interesting source material to experiment with later, for both machine learning and visualization purposes. The MusicCaps dataset is on Kaggle here and the research paper as a PDF is here.

Data Curious