Input and output¶

The coclust.io module provides functions to load data and check if it is correct to be given as input of a clustering or co-clustering algorithm.

Data loading¶

The coclust.io.data_loading module provides functions to load data from files of different types.

coclust.io.data_loading.load_doc_term_data(data_filepath, term_labels_filepath=None, doc_labels_filepath=None)[source]¶

Load cooccurence data from a .[…]sv or a .mat file.

The expected formats are:

(data_filepath).[...]sv: three […] separated columns:
1st line:
1st column: number of documents

2nd column: number of words
Other lines:
1st column: document index

2nd column: word index

3rd column: word counts
(data_filepath).mat: matlab file with fields:
- 'doc_term_matrix': scipy.sparse.csr_matrix of shape (#docs, #terms)
- 'doc_labels': list of int (len = #docs)
- 'term_labels': list of string (len = #terms)
If the key 'doc_term_matrix' is not found, data loading fails. If the key 'doc_labels' or 'term_labels' are missing, a warning message is displayed.

Term and doc labels can be separatly loaded from a one column .[x]sv|.txt file:

(term_labels_filepath).[x]sv|.txt:

one column, one term label per row. The row index is assumed to correspond to the term index in the (columns of the) co-occurrence data matrix.
(doc_labels_filepath).[x]sv|.txt:

one column, one document label per row. The row index is assumed to correspond to the non zero value number read by row from the co-occurrence data matrix.

Parameters:	file_path (string) – Path to file that contains the cooccurence data
Returns:	`'doc_term_matrix'`: `scipy.sparse.csr_matrix` of shape (#docs, #terms) `'doc_labels'`: list of int (#docs) `'term_labels'`: list of string (#terms)
Return type:	a dictionnary
Raises:	`ValueError` – If the input file is not found or if its content is not correct.

Example

>>> dict = load_doc_term_data('../datasets/classic3.csv')
>>> dict['doc_term_matrix'].shape
(3891, 4303)

The coclust.io.input_checking module provides functions to check input matrices.

coclust.io.input_checking.check_array(a, pos=True)[source]¶

Check if an array contains numeric values with non empty rows nor columns.

Parameters:	a – The input array pos (bool) – If `True`, check if the values are positives
Raises:	`TypeError` – If the array is not a Numpy/SciPy array or matrix or if the values are not numeric. `ValueError` – If the array contains empty rows or columns or contains NaN values, or negative values (if `pos` is `True`).

coclust.io.input_checking.check_numbers(matrix, n_clusters)[source]¶

Check if the given matrix has enough rows and columns for the given number of co-clusters.

Parameters:	matrix – The input matrix n_clusters (int) – Number of co-clusters
Raises:	`ValueError` – If the data matrix has not enough rows or columns.

coclust.io.input_checking.check_numbers_clustering(matrix, n_clusters)[source]¶

Check if the given matrix has enough rows and columns for the given number of clusters.

Parameters:	matrix – The input matrix n_clusters (int) – Number of clusters
Raises:	`ValueError` – If the data matrix has not enough rows or columns.

coclust.io.input_checking.check_numbers_non_diago(matrix, n_row_clusters, n_col_clusters)[source]¶

Check if the given matrix has enough rows and columns for the given number of row and column clusters.

Parameters:	matrix – The input matrix n_row_clusters (int) – Number of row clusters n_col_clusters (int) – Number of column clusters
Raises:	`ValueError` – If the data matrix has not enough rows or columns.

coclust.io.input_checking.check_positive(X)[source]¶

Check if all values are positives.

Parameters:	X (numpy array or scipy sparse matrix) – Matrix to be analyzed
Raises:	`ValueError` – If the matrix contains negative values.
Returns:	X
Return type:	numpy array or scipy sparse matrix

The coclust.io.notebook module provides functions to manage input and output in the evaluation notebook.

coclust.io.notebook.input_with_default_int(prompt, prefill)[source]¶

Prompt an int.

Parameters:	prompt (string) – The message printed before the field. prefill (int) – The default value.
Returns:	The value entered by the user or the default value.
Return type:	int

coclust.io.notebook.input_with_default_str(prompt, prefill)[source]¶

Prompt a string.

Parameters:	prompt (string) – The message printed before the field. prefill (string) – The default value.
Returns:	The value entered by the user or the default value.
Return type:	string