Data used for MLX fine-tuning

The WWDC25: Explore large language models on Apple silicon with MLX video talks about using your own data to fine-tune a large language model. But the video doesn't explain what kind of data can be used. The video just shows the command to use and how to point to the data folder. Can I use PDFs, Word documents, Markdown files to train the model? Are there any code examples on GitHub that demonstrate how to do this?

The process is explained here: https://github.com/ml-explore/mlx-examples/tree/main/lora#Custom-Data

and examples of the json files are here: https://github.com/ml-explore/mlx-examples/tree/main/lora/data

Note the specific format. Imagine how much work is required to create these files. Not an easy feat. But then that's why scaleAI exists - to put the global south (Venezuela and Chile for example) to work for $5 a day doing this work remotely. This is the dark underbelly of deep learning, along with the existential threat of global warming from its exponentially increasing energy requirement.

Data used for MLX fine-tuning
 
 
Q