Quickstart
Guides
Tutorials
< Home

Data addition & Settings

Data addition from GUI

Upon accessing the project dashboard, your initial task is to upload pertinent data sets. These may encompass data utilized for training, testing, validation, production, or any other data integral to your project's scope and requirements.

Note: During the initial data upload process, whether through the dashboard interface or API integration, it is imperative to begin by uploading at least one sample dataset from the dashboard. Subsequently, users can proceed to define the requisite data settings. Additional datasets may also be uploaded seamlessly via the API following this initial setup.

To initiate the data upload process, begin by selecting the upload type from the dropdown menu, which offers the options of 'Data', 'Data Description' and ‘Feature mapping’. In Data description, users can add description for the data columns.

For data uploads, it's necessary to specify the 'Upload Tag' from the dropdown, where you can specify the data type - Training, Testing, Validation, or you can choose to add a custom tag as well.

Note: You can only upload one file at a time, and the file can only be in CSV format.

Users have the flexibility to upload files either by dragging and dropping them or by selecting the CSV file for uploading directly. After adding the file, proceed by selecting the 'Upload File' option to initiate the upload process.

Note: CSV files are limited to 200 MB for default server and 1 GB for custom server on workspace/ project

Tip: When uploading data, if you receive an error message stating that the file already exists, navigate to the 'File Info' section and delete the existing file, if the processing is not completed.

Once the upload is complete, you will be directed to ‘Data Config’ to configure the details.  

Data Config.

Data Configuration serves as the foundational framework encompassing crucial high-level details essential for all subsequent operations within the project, and cannot be changed once set.

  1. Begin by specifying the project type, which may involve either classification or regression tasks
  2. Define the ‘Unique identifier’ - Assign a unique identifier to each data point within the dataset. This identifier distinguishes individual data entries and aids in data management and analysis.
  3. Select the true label - Identify the true label, which represents the target variable to be predicted. For instance, in a real estate dataset, the true label could denote the 'Sale Price' of a property.
  4. If applicable, choose the predicted label from the provided dropdown menu. This step is important when evaluating predictions generated by an existing model.
  5. Feature Exclusion: Select features (data points) to be excluded. There might be multiple features within your project. You can exclude the features that might not be relevant to your project from the ‘Features exclude’ option. You can see all the features included and excluded on the right.

There might be multiple features within your project. You can exclude the features that might not be relevant to your project from the ‘Features exclude’ option. You can see all the features included and excluded on the right. 

Note: Your data can have some duplicate unique identifiers, which can be dropped by selecting the checkbox.

 True and Predicated label

The predicted label is essential if you intend for the XAI model to explain the predictions generated by your model. If the predicted label is not explicitly defined, AryaXAI will automatically select the true label to construct the XAI model.

Note: When defining data features, specifically the data settings, it should be noted that these settings serve as the foundation for training the explainable model. The feature selection conducted during this stage should closely align with the final set of features utilized in your model. This alignment ensures consistency and accuracy in the interpretability analysis provided by AryaXAI.

The ‘Features’ section displays the data type. The platform starts analyzing the data and creates an explainability model for you.  

Note: Until the XAI model is not trained, the explanations (Feature importance) will show ‘nan’ as the values, you can upload any new file or open case view pages. Once the model is trained, the XAI model results can be seen in all these pages.

Once you submit the Data Config, you will be directed to the 'Project Summary' page, which is your project homepage. This page displays 3 tabs - Summary, Data Diagnostics and Model Diagnostics.

Summary

The Summary tab displays:

  • Total data volume and Unique features
  • Overview of data uploaded, which you can filter based on tag
  • Volume graph, offering a comprehensive overview of data upload activity over time. Users can delve into various parameters to plot data activity conveniently. These parameters include User tag, Feature name, date feature name, range, and plot type, for writing codes with ease.
  • Model Info

Data diagnostics

Once you upload data, AryaXAI automatically performs a comprehensive analysis of the added datasets for your initial file. This analysis includes data profiling, data modeling, and explainability. You can easily perform these tasks manually at a later stage if needed.

Selecting the 'Refresh Data' tab located at the top right corner of the interface invokes the data profiling task. This will display a dropdown menu where you can select the desired compute server. After choosing the appropriate compute server, click to refresh your data. 

Within the Data Diagnostics tab, the Data Summary table offers an overview of the total data volume and unique features.

The Data Warning section highlights any inconsistencies detected in the uploaded analytical data. These warnings encompass various issues, including missing data, high feature correlation, high cardinality, and more.

Model Performance

This section displays a benchmarking for the different models you have created. 

The ‘Model stability’ table in the Model performance tab displays the same model details as seen in the AutoML section. This section displays the performance of all models that are currently in production or staged for production. 

In the 'Data Stability' section, users can assess data drift between two models for a comprehensive overview. After uploading your initial training data, if you upload a second file containing test data, a data drift report will be automatically generated. This report provides insights into the differences and stability between the training and test datasets.

By selecting the Baseline and Current tags, users can conduct a detailed comparison, which includes features, detected drift, method, feature type, drift score, and more.

Data Settings

In the data settings section, there are four tabs - Upload data, Model upload, File info, and Data settings. Here, users can upload, delete and manage the uploaded data files efficiently. 

Data upload

Users can upload data from this section as well. When defining data features, specifically the data settings, it should be noted that these become the base for explainable model training. Therefore, the feature selection process should align with the final features employed in the model.

Model upload

When uploading own models, users need to:

  • Define the model name: Provide a name for your model.
  • Specify Model Architecture: Indicate whether the model is based on machine learning or deep learning (deep learning support is coming soon).
  • Specify Training and Testing Data: Provide the datasets to be used for training and testing the model.
Note: If the training and testing data are not provided, AryaXAI will automatically select a random sub-sample from the training data to use as testing data. This allows AryaXAI to benchmark your uploaded model against a subset of the training data.
Note: Ensure that all features used in your model are presented in the data you upload ie; data points in the model and uploaded file need to be the same

For details on Uploading and creating models, refer to Modeling section.

File Info

The File Info tab presents a comprehensive list of uploaded files, including additional upload details such as the user responsible for the upload, data file type, tag, creation date and time. Users can also delete uploaded files directly from this tab.

Data settings

The Data settings tab showcases the data configuration details set while uploading the data. Here, users can see details like base model being used, features used and excluded in the project, etc.

To modify the data settings, select the ‘Update config’ option present in ‘Data settings’. Whenever the settings are modified, the explainability model is retrained again.

Data addition from API

First, Get the API Token for the Project. This is accessible at Workspace > Projects > Documentation

The project token (and Client Id) is accessible only to the Admins of the particular project. You can refresh your API token through the ‘Refresh token’ option provided beside the Client Id.

Below this, the project URL for uploading the data is displayed. The Python script is present, which can be used directly in your compiler.

The header XAI token needs to be defined, whereas the Client Id and project name are automatically defined. 

 
headers = 
{    
"x-api-token": your project access secret token 
} 
base_format = 
{
"client_id": test_user_arya,
"project_name": Risk-monitoring_FW6FSKJQRE
}
 

Next, prepare Data in Format of Dictionary (you can upload multiple data points in the list of dictionary format)

Define the unique identifier for the data:

 
 "unique_identifier": 
 

A single data set can be passed in string format. If multiple data sets are uploaded, you need to pass a list of unique identifiers. 

Similarly, a single data point (with one unique identifier and 3-4 columns) can be directly passed through the API. However, for multiple data points, a list of unique identifiers and column needs to be created. 

Finally, you will need to give the post request:

 
resp = requests.post(url,headers= headers json=base_format)

For every post request, data successful responses and acknowledgements are provided, so you are updated on the status.   

Data addition from SDK

To upload data, we need to pass the file path and Tag.

Note: If you are uploading data for the first time, you need to pass Config as well.

Data can be uploaded to the project either directly with a file or by passing the Pandas DataFrame. 

To configure the details in ‘Project config’ and upload data through our SDK, you can use the following commands:


config = {
            "project_type": "classification",  # The Prediction Type of your project (classification / regression)
            "unique_identifier": "Id", # unique identifier for your project
            "true_label": "SaleCondition", # Target label
            "pred_label": "", # Predicted value in case you have it
            "feature_exclude": [],  # feature you don't want Arya Xai surrogate model to use for modelling
        }

# Data is diffrentiated using Tag
Tag = 'Training'  # Data is diffrentiated using Tag

#To upload the data into the project. This will also build the initial ML model.
project.upload_data('file_path','tag', config)

Once the data is uploaded, you can also view the files, and file info through SDK. 


#Check the files that are uploaded in the project.
project.files()

Additional functions:


#You can get the summary for the specific file: Missing values, Max/Min, Data type.
project.file_summary()

#To know all the settings: Data, Data Encoding & Model params
project.config()

project.all_tags()


Additionally, you can also delete the uploaded file:


#project.delete_file('file_name')

To fetch all tags which user has uploaded:


project.tags()

SDK operations: Data Summary and Diagnostics

Accessing the Data Summary table:

To view the summary of data features and the data types through SDK:


# data summary
project.data_observations('Training') # Can pass any Tag for getting Summary

# data diagnosis
project.data_warnings('Training')

To fetch default data drift report generated after uploading train and test data:


project.data_drift_diagnosis()

Seeing Data Drift report between Tags:


project.data_drift_diagnosis(['Training'],['Testing'])

SDK operations: Model upload

Before uploading a model, ensure that the corresponding features have already been uploaded through the data upload process.


project.upload_model(model_path='/content/xgb_sample_model.pkl',
                     model_name = 'XgbCustom',
                     model_data_tags = ['Training'],
                     model_type='Xgboost',
                     model_architecture='machine_learning')

Help on method upload_model in module aryaxai.core.project:


help(project.upload_model)

View uploaded model info


project.models()

Delete uploaded file:


project.delete_file('file name')

Page URL copied to clipboard