•1 min read•from Machine Learning
[P] Using YouTube as a data source (lessons from building a coffee domain dataset)
![[P] Using YouTube as a data source (lessons from building a coffee domain dataset)](/_next/image?url=https%3A%2F%2Fexternal-preview.redd.it%2FW1qtdlwVAgyDO8Gb7Jxbi3iDIoTqWN84efUix2wtWLs.png%3Fwidth%3D140%26height%3D70%26auto%3Dwebp%26s%3D7e7fa489ec45bfc1ae5cece40c791310376ce134&w=3840&q=75)
| I started working on a small coffee coaching app recently - something that could answer questions around brew methods, grind size, extraction, etc. I was looking for good data and realized most written sources are either shallow or scattered. YouTube, on the other hand, has insanely high-quality content (James Hoffmann, Lance Hedrick, etc.), but it’s not usable out of the box for RAG. Transcripts are messy, chunking is inconsistent, getting everything into a usable format took way more effort than expected. So I made a small CLI tool that:
It basically became the data layer for my app, and funnily ended up getting way more traction than my actual coffee coaching app! Repo: youtube-rag-scraper [link] [comments] |
Want to read more?
Check out the full article on the original site
Tagged with
#generative AI for data analysis
#Excel alternatives for data analysis
#natural language processing for spreadsheets
#big data management in spreadsheets
#conversational data analysis
#real-time data collaboration
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data analysis tools
#data cleaning solutions
#rows.com
#large dataset processing
#YouTube
#transcripts
#coffee coaching app
#data layer
#chunking
#data source