API Reference

Modules:

class wikinet.dump.Dump(path_xml, path_idx)

Dump loads and parses dumps from wikipedia from path_xml with index path_idx.

idx: dictionary

{'page_name': (byte offset, page id, block size)} Cached. Lazy.

links: list of strings

All links.

article_links: list of strings

Article links (not files, categories, etc.)

years: list of int

Years in the History section of a wikipedia page BC denoted as negative values

page: mwparserfromhell.wikicode

Current loaded wiki page

path_xml: string

Path to the zipped XML dump file.

path_idx: string

Path to the zipped index file.

offset_max: int

Maximum offset. Set as the size of the zipped dump.

cache: xml.etree.ElementTree.Node

Cache of the XML tree in current block

load_page(page_name, filter_top=False)

Loads & returs page (mwparserfromhell.wikicode) named page_name from dump file. Returns only the top section if filter_top.

class wikinet.corpus.Corpus(dump, output='doc', dct=None, load_index=True)

Corpus is an iterable & an iterator that uses Dump to iterate through articles.

corpus = wikinet.Corpus(dump)
print(corpus[100])
[c for c in corpus]
dump: wikinet.Dump

a Dump object

output: string

doc for array of documents tag for TaggedDocument(doc, [self.i]) bow for bag of words [(int, int)]

dct: gensim.corpus.Dictionary

used to create BoW representation

class wikinet.net.Net(path_graph='', path_barcodes='')

Net is a wrapper for networkx.DiGraph. Uses dionysus for persistence homology.

tfidf: scipy.sparse.csc.csc_matrix

sparse column matrix of tfidfs, ordered by nodes, also stored in self.graph.graph['tfidf'], lazy

MAX_YEAR: int

year = MAX_YEAR (2020) for nodes with parents without years

YEAR_FILLED_DELTA: int

year = year of parents + YEAR_FILLED_DELTA (1)

static assign_communities(graph)

Compute modular communities of graph (nx.DiGraph). Assign community number community to each node. Assign modularity to graph. See greedy_modularity_communities in networkx.

static assign_core_periphery(graph)

Compute core-periphery of graph (nx.DiGraph; converted to symmetric nx.Graph). Assign core as 1 or 0 to each node. Assign coreness to graph. See core_periphery_dir() in bctpy.

static build_graph(name='', dump=None, nodes=None, depth_goal=1, filter_top=True, remove_isolates=True, add_years=True, fill_empty_years=True, model=None, dct=None, compute_core_periphery=True, compute_communities=True, compute_community_cores=True)

Builds network.graph (networkx.Graph) from nodes (list of string). Set model (from gensim) and dct (gensim.corpora.Dictionary) for weighted edges. Set filter_top to True only if you want the top “lead” section of the article.

load_barcodes(path)

Loads barcodes from pickle.

load_graph(path)

Loads graph from path. If filename.gexf then read as gexf. Else, use pickle.

randomize(null_type, compute_core_periphery=True, compute_communities=True, compute_community_cores=True)

Returns a new wiki.Net with a randomized copy of graph. Set null_type as one of 'year', 'target'.

save_barcodes(path)

Saves barcodes as pickle.

save_graph(path)

Saves graph at path. If filename.gexf then save as gexf. Else, use pickle.

class wikinet.persistent_homology.PersistentHomology

Net is a child of PersistentHomology. So you can call any of the following with any wikinet.Net object.

cliques: list of lists

lazy

filtration: dionysus.filtration

lazy

persistence: dionysus.reduced_matrix

lazy

barcodes: pandas.DataFrame

lazy

static compute_barcodes(f, m, graph, names)

Uses dionysus filtration & persistence (in reduced matrix form) to compute barcodes.

f: dionysus.Filtration

filtration

m: dionysus.ReducedMatrix

(see homology_persistence)

names: list of strings

names of node indices

Returns

pandas.DataFrame