python - Print Wikipedia Article Title from Gensim WikiCorpus -
i believe question easy, i'm new python , think blinding me bit.
i've downloaded wikipedia dump explained under "preparing corpus" here: https://radimrehurek.com/gensim/wiki.html. ran following lines of code:
import gensim # these next 2 lines take around 16 hours wikidocs = gensim.corpora.wikicorpus.wikicorpus('enwiki-latest-pages-articles.xml.bz2') gensim.corpora.mmcorpus.serialize('wiki_en_vocab200k', wikidocs)
these lines of code taken link above. now, in separate script i've done text analysis. result of text analysis number representing index of particular article in wikidocs corpus. problem, don't know how print out text of article. obvious thing try is:
wikidocs[index_of_article]
but returns error
typeerror: 'wikicorpus' object not support indexing
i've tried few other things i'm stuck. help.
it's not such easy quesion, reason why didn't work wikicorpus
isn't iterator, it's class few functions saving , loading. can see functions buy typing wikicorpus.
, pressing tab ipython (this shows options tab-completion:
in [8]: wikidocs = gensim.corpora.wikicorpus.wikicorpus. gensim.corpora.wikicorpus.wikicorpus.get_texts gensim.corpora.wikicorpus.wikicorpus.load gensim.corpora.wikicorpus.wikicorpus.save_corpus gensim.corpora.wikicorpus.wikicorpus.getstream gensim.corpora.wikicorpus.wikicorpus.save
it looks want get_texts
, return iterator rather list though, (iterators don't directly support indexing either) you'll have use
list(wikidocs.get_texts())[i]
or
from itertools import islice next(islice(wikidocs.get_texts(),i,i+1))
Comments
Post a Comment