Input and output

The coclust.io module provides functions to load data and check if it is correct to be given as input of a clustering or co-clustering algorithm.

Data loading

The coclust.io.data_loading module provides functions to load data from files of different types.

coclust.io.data_loading.load_doc_term_data(data_filepath, term_labels_filepath=None, doc_labels_filepath=None)[source]

Load cooccurence data from a .[…]sv or a .mat file.

The expected formats are:

  • (data_filepath).[...]sv: three […] separated columns:

    1st line:
    • 1st column: number of documents
    • 2nd column: number of words
    Other lines:
    • 1st column: document index
    • 2nd column: word index
    • 3rd column: word counts
  • (data_filepath).mat: matlab file with fields:

    • 'doc_term_matrix': scipy.sparse.csr_matrix of shape (#docs, #terms)
    • 'doc_labels': list of int (len = #docs)
    • 'term_labels': list of string (len = #terms)

    If the key 'doc_term_matrix' is not found, data loading fails. If the key 'doc_labels' or 'term_labels' are missing, a warning message is displayed.

Term and doc labels can be separatly loaded from a one column .[x]sv|.txt file:

  • (term_labels_filepath).[x]sv|.txt:
    one column, one term label per row. The row index is assumed to correspond to the term index in the (columns of the) co-occurrence data matrix.
  • (doc_labels_filepath).[x]sv|.txt:
    one column, one document label per row. The row index is assumed to correspond to the non zero value number read by row from the co-occurrence data matrix.
Parameters:file_path (string) – Path to file that contains the cooccurence data
Returns:
  • 'doc_term_matrix': scipy.sparse.csr_matrix of shape (#docs, #terms)
  • 'doc_labels': list of int (#docs)
  • 'term_labels': list of string (#terms)
Return type:a dictionnary
Raises:ValueError – If the input file is not found or if its content is not correct.

Example

>>> dict = load_doc_term_data('../datasets/classic3.csv')
>>> dict['doc_term_matrix'].shape
(3891, 4303)

Input checking

The coclust.io.input_checking module provides functions to check input matrices.

coclust.io.input_checking.check_array(a, pos=True)[source]

Check if an array contains numeric values with non empty rows nor columns.

Parameters:
  • a – The input array
  • pos (bool) – If True, check if the values are positives
Raises:
  • TypeError – If the array is not a Numpy/SciPy array or matrix or if the values are not numeric.
  • ValueError – If the array contains empty rows or columns or contains NaN values, or negative values (if pos is True).
coclust.io.input_checking.check_numbers(matrix, n_clusters)[source]

Check if the given matrix has enough rows and columns for the given number of co-clusters.

Parameters:
  • matrix – The input matrix
  • n_clusters (int) – Number of co-clusters
Raises:

ValueError – If the data matrix has not enough rows or columns.

coclust.io.input_checking.check_numbers_clustering(matrix, n_clusters)[source]

Check if the given matrix has enough rows and columns for the given number of clusters.

Parameters:
  • matrix – The input matrix
  • n_clusters (int) – Number of clusters
Raises:

ValueError – If the data matrix has not enough rows or columns.

coclust.io.input_checking.check_numbers_non_diago(matrix, n_row_clusters, n_col_clusters)[source]

Check if the given matrix has enough rows and columns for the given number of row and column clusters.

Parameters:
  • matrix – The input matrix
  • n_row_clusters (int) – Number of row clusters
  • n_col_clusters (int) – Number of column clusters
Raises:

ValueError – If the data matrix has not enough rows or columns.

coclust.io.input_checking.check_positive(X)[source]

Check if all values are positives.

Parameters:X (numpy array or scipy sparse matrix) – Matrix to be analyzed
Raises:ValueError – If the matrix contains negative values.
Returns:X
Return type:numpy array or scipy sparse matrix

Jupyter and IPython Notebook utilities

The coclust.io.notebook module provides functions to manage input and output in the evaluation notebook.

coclust.io.notebook.input_with_default_int(prompt, prefill)[source]

Prompt an int.

Parameters:
  • prompt (string) – The message printed before the field.
  • prefill (int) – The default value.
Returns:

The value entered by the user or the default value.

Return type:

int

coclust.io.notebook.input_with_default_str(prompt, prefill)[source]

Prompt a string.

Parameters:
  • prompt (string) – The message printed before the field.
  • prefill (string) – The default value.
Returns:

The value entered by the user or the default value.

Return type:

string