Input and output¶
The coclust.io
module provides functions to load data and check if it
is correct to be given as input of a clustering or co-clustering algorithm.
Data loading¶
The coclust.io.data_loading
module provides functions to load data
from files of different types.
-
coclust.io.data_loading.
load_doc_term_data
(data_filepath, term_labels_filepath=None, doc_labels_filepath=None)[source]¶ Load cooccurence data from a .[…]sv or a .mat file.
The expected formats are:
(data_filepath).[...]sv
: three […] separated columns:- 1st line:
- 1st column: number of documents
- 2nd column: number of words
- Other lines:
- 1st column: document index
- 2nd column: word index
- 3rd column: word counts
(data_filepath).mat
: matlab file with fields:'doc_term_matrix'
:scipy.sparse.csr_matrix
of shape (#docs, #terms)'doc_labels'
: list of int (len = #docs)'term_labels'
: list of string (len = #terms)
If the key
'doc_term_matrix'
is not found, data loading fails. If the key'doc_labels'
or'term_labels'
are missing, a warning message is displayed.
Term and doc labels can be separatly loaded from a one column .[x]sv|.txt file:
- (term_labels_filepath).[x]sv|.txt:
- one column, one term label per row. The row index is assumed to correspond to the term index in the (columns of the) co-occurrence data matrix.
- (doc_labels_filepath).[x]sv|.txt:
- one column, one document label per row. The row index is assumed to correspond to the non zero value number read by row from the co-occurrence data matrix.
Parameters: file_path (string) – Path to file that contains the cooccurence data Returns: 'doc_term_matrix'
:scipy.sparse.csr_matrix
of shape (#docs, #terms)'doc_labels'
: list of int (#docs)'term_labels'
: list of string (#terms)
Return type: a dictionnary Raises: ValueError
– If the input file is not found or if its content is not correct.Example
>>> dict = load_doc_term_data('../datasets/classic3.csv') >>> dict['doc_term_matrix'].shape (3891, 4303)
Input checking¶
The coclust.io.input_checking
module provides functions to check
input matrices.
-
coclust.io.input_checking.
check_array
(a, pos=True)[source]¶ Check if an array contains numeric values with non empty rows nor columns.
Parameters: - a – The input array
- pos (bool) – If
True
, check if the values are positives
Raises: TypeError
– If the array is not a Numpy/SciPy array or matrix or if the values are not numeric.ValueError
– If the array contains empty rows or columns or contains NaN values, or negative values (ifpos
isTrue
).
-
coclust.io.input_checking.
check_numbers
(matrix, n_clusters)[source]¶ Check if the given matrix has enough rows and columns for the given number of co-clusters.
Parameters: - matrix – The input matrix
- n_clusters (int) – Number of co-clusters
Raises: ValueError
– If the data matrix has not enough rows or columns.
-
coclust.io.input_checking.
check_numbers_clustering
(matrix, n_clusters)[source]¶ Check if the given matrix has enough rows and columns for the given number of clusters.
Parameters: - matrix – The input matrix
- n_clusters (int) – Number of clusters
Raises: ValueError
– If the data matrix has not enough rows or columns.
-
coclust.io.input_checking.
check_numbers_non_diago
(matrix, n_row_clusters, n_col_clusters)[source]¶ Check if the given matrix has enough rows and columns for the given number of row and column clusters.
Parameters: - matrix – The input matrix
- n_row_clusters (int) – Number of row clusters
- n_col_clusters (int) – Number of column clusters
Raises: ValueError
– If the data matrix has not enough rows or columns.
Jupyter and IPython Notebook utilities¶
The coclust.io.notebook
module provides functions to manage input and
output in the evaluation notebook.