Menu
Home
About
Our Role
Goals
The Team
Roadmap
Tokenomics
How To Buy
Knowledge Base
Contacts
Sitemap & Links
A.I.
Chart
Shop
IMMORTALITY
🏠
⬇️
PlainTextWikipedia
New name
B
I
U
S
link
image
code
HTML
list
Show page
Syntax
Convert Wikipedia database dumps into plain text files (JSON). This can parse literally all of Wikipedia with pretty high fidelity. There's a copy available on Kaggle Datasets. https://github.com/daveshap/PlainTextWikipedia?tab=readme-ov-file !!QUICK START * Download and unzip a Wikipedia dump (see Data Sources below) make sure you get a monolithic XML file * Open up wiki_to_text.py and edit the filename to point at your XML file. Also update the savedir location * Run wiki_to_text.py - it should take about 2.5 days to run, with some variation based on your CPU and storage speed !!Data Sources There are two primary data sources. * Simplified English Wikipedia: this is only about 1GB and therefore is a great test set - https://dumps.wikimedia.org/simplewiki/ * English Wikipedia: this is all of Wikipedia, so about 80GB unpacked - https://dumps.wikimedia.org/enwiki/ Navigate into the latest dump. You're likley looking for the very first file in the download section. They will look something like this: * enwiki-20210401-pages-articles-multistream.xml.bz2 18.1 GB * simplewiki-20210401-pages-articles-multistream.xml.bz2 203.5 MB Download and extract these to a storage directory. I usually shorten the folder name and filename. !!Legal https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content Wikipedia is published under Creative Commons Attribution Share-Alike license (CC-BY-SA). My script is published under the MIT license but this does not confer the same privileges to the material you convert with it.
Password
Summary of changes
📜
⏱️
⬆️