Menu
Home
About
Our Role
Goals
The Team
Roadmap
Tokenomics
How To Buy
Knowledge Base
Contacts
Sitemap & Links
A.I.
Chart
Shop
IMMORTALITY
🏠
⬇️
PlainTextWikipedia
Afrikaans
Shqip
አማርኛ
العربية
Հայերեն
Azərbaycan dili
Euskara
Беларуская мова
বাংলা
Bosanski
Български
Català
Cebuano
Chichewa
简体中文
繁體中文
Corsu
Hrvatski
Čeština
Dansk
Nederlands
English
Esperanto
Eesti
Filipino
Suomi
Français
Frysk
Galego
ქართული
Deutsch
Ελληνικά
ગુજરાતી
Kreyol ayisyen
Harshen Hausa
Ōlelo Hawaiʻi
עִבְרִית
हिन्दी
Hmong
Magyar
Íslenska
Igbo
Bahasa Indonesia
Gaeilge
Italiano
日本語
Basa Jawa
ಕನ್ನಡ
Қазақ тілі
ភាសាខ្មែរ
한국어
كوردی
Кыргызча
ພາສາລາວ
Latin
Latviešu valoda
Lietuvių kalba
Lëtzebuergesch
Македонски јазик
Malagasy
Bahasa Melayu
മലയാളം
Maltese
Te Reo Māori
मराठी
Монгол
ဗမာစာ
नेपाली
Norsk bokmål
پښتو
فارسی
Polski
Português
ਪੰਜਾਬੀ
Română
Русский
Samoan
Gàidhlig
Српски језик
Sesotho
Shona
سنڌي
සිංහල
Slovenčina
Slovenščina
Afsoomaali
Español
Basa Sunda
Kiswahili
Svenska
Тоҷикӣ
தமிழ்
తెలుగు
ไทย
Türkçe
Українська
اردو
O‘zbekcha
Tiếng Việt
Cymraeg
isiXhosa
יידיש
Yorùbá
Zulu
en
New name
B
I
U
S
link
image
code
HTML
list
Show page
Syntax
Convert Wikipedia database dumps into plain text files (JSON). This can parse literally all of Wikipedia with pretty high fidelity. There's a copy available on Kaggle Datasets. https://github.com/daveshap/PlainTextWikipedia?tab=readme-ov-file !!QUICK START * Download and unzip a Wikipedia dump (see Data Sources below) make sure you get a monolithic XML file * Open up wiki_to_text.py and edit the filename to point at your XML file. Also update the savedir location * Run wiki_to_text.py - it should take about 2.5 days to run, with some variation based on your CPU and storage speed !!Data Sources There are two primary data sources. * Simplified English Wikipedia: this is only about 1GB and therefore is a great test set - https://dumps.wikimedia.org/simplewiki/ * English Wikipedia: this is all of Wikipedia, so about 80GB unpacked - https://dumps.wikimedia.org/enwiki/ Navigate into the latest dump. You're likley looking for the very first file in the download section. They will look something like this: * enwiki-20210401-pages-articles-multistream.xml.bz2 18.1 GB * simplewiki-20210401-pages-articles-multistream.xml.bz2 203.5 MB Download and extract these to a storage directory. I usually shorten the folder name and filename. !!Legal https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content Wikipedia is published under Creative Commons Attribution Share-Alike license (CC-BY-SA). My script is published under the MIT license but this does not confer the same privileges to the material you convert with it.
Password
Summary of changes
📜
⏱️
⬆️