CloudCIX ML API

This page describes a number of stateless ML APIs available to be called from within any CloudCIX project. These API calls are protected by an IP address based access list and cannot be used from outside CloudCIX projects.

An Address based API Key, available from within the CloudCIX Membership App, is required to use these APIs so that billing can be made.

Large Language Models

We support the following Large Language Models (LLMs) for text generation tasks, such as summarization, and question answering tasks.

  • UCCIX-v2-Llama3.1-70B-Instruct: Our flagship Irish-English bilingual chatbot model, capable of understanding both languages, with State-of-the-Art performance on Irish. Additionally, the model is 50% more efficient on Irish tokens than other models, leading to reduction in computing time and cost. Suitable for most tasks that required advanced performance in Irish or across both languages. Free preview available here.
  • UCCIX-Instruct: Our initial version of Irish-English bilingual chatbot model. Lightweight and require less computing resource. Free preview available here.
  • GPT-4o: Most advanced model from OpenAI, advertised as useful for complex, multi-step tasks.
  • deepseek-ai/DeepSeek-R1-Distill-Llama-70B: Large Reasoning Model distilled from DeepSeek-R1 based on Llama-3.3-70B-Instruct.

Sample Usage

Our models are accessible using the OpenAI libraries (Python and TypeScript / Javascript) along with the REST API, by updating three lines of code:

  • Call to CloudCIX's endpoint in base_url when initialize OpenAI's client.
  • Use your CloudCIX API key as api_keywhen initialize OpenAI's client.
  • Call to the corresponding model_id as above.

from openai import OpenAI

client = OpenAI(
    api_key="Put your API key here",
    base_url="https://ml-openai.cloudcix.com"
)

chat_completion = client.chat.completions.create(
    model="UCCIX-v2-Llama3.1-70B-Instruct", # or "UCCIX-Instruct" or "GPT-4o" or "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
    messages=[
        {
            "role": "user",
            "content": "What is the meaning of life?",
        }
    ],
    stream=True,
    max_tokens=100,
    temperature=0.7,
)

for chunk in chat_completion:
    print(chunk.choices[0].delta.content, end="")
        


Recommendations

  • For Irish-related tasks, including knowledge regarding Ireland and its culture, generating and understanding Irish, we recommend our UCCIX series of model.
  • For deepseek reasoning model, using a temperature ranging from 0.5 to 0.7, and avoid adding a system prompt are recommended. For more details, see here.

Embedding Models

Encode text as a vector (sequence of numbers) that represents the meaningful concepts within the input content. These vectors can be used for a variety of tasks, such as semantic search, clustering, and classification.

We provide access to the following models:

  • cix_question_encoder: Our flagship Irish-English embedding model, first-ever model with support for the Irish language, while also retaining State-of-the-Art performance on English. Use this version to embed question.
  • cix_chunk_encoder: Our flagship Irish-English embedding model, first-ever model with support for the Irish language, while also retaining State-of-the-Art performance on English. Use this version to embed longer chunk of text.
  • dragonplus_question: Smaller size than our flagship model. Good for efficiency. Use this version to embed question.
  • dragonplus_vector: Smaller size than our flagship model. Good for efficiency. Use this version to embed longer chunk of text.
  • gte-large-en-v1.5_question_encoder: open-weight model available at here. We recommend to use cix_question_encoder over this model. Use this version to embed question.
  • gte-large-en-v1.5_chunk_encoder: open-weight model available at here. We recommend to use cix_question_encoder over this model. Use this version to embed longer chunk of text.
  • use4: embedding model from Google. Not recommended as our benchmarks yield superior results for our flagship embedding model.

Sample Usage

Same as LLM API, our models are accessible using the OpenAI libraries (Python and TypeScript / Javascript) along with the REST API, by updating three lines of code and using your CloudCIX's API key.


from openai import OpenAI

client = OpenAI(
    api_key="Put your API key here",
    base_url="https://ml-openai.cloudcix.com"
)

response = client.embeddings.create(
    model="cix_question_encoder",
    input=["What is PodNet in CloudCIX?", "What is Membership App?"],
    encoding_format="float"
)

print(response.data[0].embedding)
print(response.data[1].embedding)
        


Embedding Database

This Embedding Database API allows you to store your Corpus data which can be used for RAG method. You can provide chunks to the API and the API converts them into vectors and store in Database.

Methods Available

  • Create Corpus: create_embeddings
  • Get Corpus: get_corpus
  • Get Sources: get_source
  • Get Chunks: get_chunks
  • Delete Corpus: delete_corpus
  • Delete Source: delete_source
  • Vector Similarity: vector_search
  • Rerank: rerank
  • Create Corpus

    Input:

  • api_key: The Address API Key
  • name: Name of the Corpus
  • method: create_embeddings
  • chunks: A List of chunks
  • source: Name of Chunks Source
  • encoder_name: The encoder model for embeddings
  • 
    data = {
        'api_key': 'put your apikey here',
        'name': 'Name of the Corpus',
        'method': 'create_embeddings',
        'content': [
            {
                'chunks': ['put chunks here'],
                'source': 'Put the url or filename where the chunks are from',
                'encoder_name': 'The encoder model for embeddings',
            },
        ],
    }
            

    Output:

    A success message with status code 200

    Get Corpus

    Input:

  • api_key: The Address API Key
  • method: get_corpus
  • 
    data = {
        'api_key': 'put your apikey here',                
        'method': 'get_corpus',
    }
            

    Output:

  • content:A List of Lists of Corpus Names
  • Get Source

    Input:

  • api_key: The Address API Key
  • name: Name of the Corpus
  • method: get_source
  • 
    data = {
        'api_key': 'put your apikey here',
        'name': 'Name of the Corpus',
        'method': 'get_source',                
    }
            

    Output:

  • content:A List of Lists of sources
  • Delete Corpus

    Input:

  • api_key: The Address API Key
  • name: Name of the Corpus
  • method: delete_corpus
  • 
    data = {
        'api_key': 'put your apikey here',
        'name': 'Name of the Corpus',
        'method': 'delete_corpus',                
    }
            

    Output:

    A success message with status code 200

    Delete Source

    Input:

  • api_key: The Address API Key
  • name: Name of the Corpus
  • method: delete_source
  • content: A List of Sources
  • 
    data = {
        'api_key': 'put your apikey here',
        'name': 'Name of the Corpus',
        'method': 'delete_source,
        'sources': ['put sources related to corpus'],
    }
            

    Output:

    A success message with status code 200

    Vector Similarity

    Input:

  • api_key: The Address API Key
  • name: Name of the Corpus
  • method: create_embeddings
  • encoder_name: The encoder model for embeddings
  • query: A List of chunks
  • order_by: The vecotor similarity calculation method. We support Cosine Similarity, Dot Product and Euclidean Distance
  • threshold: A barrier of Euclidean Distance such that the sources to be retrieved will be within the specified (integer value) distance.\nNote: The threshold works only for Euclidean Distance
  • limit: Number of Sources to be retrieved
  • 
    data = {
        'api_key': 'put your apikey here',
        'name': 'Name of the Corpus',
        'method': 'create_embeddings',
        'encoder_name': 'Name of the Encoder Model used for embeddings, supported names are dragon_plus, cix_encoder, test_encoder, use4',
        'query': 'Put your query that needs to be matched with source chunks in the Embedding Database',
        'order_by': 'The method of vector search calculation, supported methods are cosine_similarity, dot_product, euclidean_distance',
        'threshold': 'An integer value works only for Euclidean Distance order_by method',
        'limit': 'The number of matched sources and respective chunks to be retrieved'
    }
            

    Output:

  • content:A List of Lists of sources and chunks
  • Rerank

    Input:

  • api_key: The Address API Key
  • name: Name of the Corpus
  • method: create_embeddings
  • chunks: A List of chunks
  • source: Name of Chunks Source
  • encoder_name: The encoder model for embeddings
  • 
    data = {
        'api_key': 'put your apikey here',    
        'method': 'rerank',    
        'chunks': ['put your chunks here'],
        'reranker': 'The name of the LLM used for determining the order of the chunks'
        'query': 'Put your query here',    
    }
            

    Output:

  • content:A List of reranked chunks based on the similarity with the provided code
  • Embedding Database API

    Sample Usage

    
    import requests
    
    url = 'https://ml.cloudcix.com/embedding_db/'
    
    response = requests.post(
        url=url,
        json=data,
    )
    
    if response.status_code == 200:
        print(response.json())
    else:
        print('message:', response.text, 'status_code:', response.status_code)
            


    Scraping

    This scraping API allows you to collect data from any public website. At the moment, we are supporting scraping of websites and pdfs (note that the pdf file needs to be available through a public web url).

    Input:

  • list: a list of URLs to PDF documents or webpages.
  • document_type: either 'pdf' or 'html'.
  • Output:

    A list of dictionaries with the fields:

    • 'source': the URL of the document.
    • 'error': an error message if the document could not be loaded. Optional.
    • 'page_content': a list of dictionaries with the fields:
      • 'page_number': the page number of the current page for pdf, or 0 for webpages.
      • 'text': the text content of the current page.

    HTML Scraping

    For html scraping, we also support filtering out unwanted tags and classes in the html through the variable 'exclusions'.

    Sample Usage

    
    import requests
    
    url = 'https://ml.cloudcix.com/scraping/'
    
    websites_to_scrape = ['https://docs.cloudcix.com']
    exclusions={'exclusion_tags': ['script', 'style'], 'exclusion_classes': ['footer', 'header']}
    document_type = 'html'
    
    response = requests.post(
        url=url,
        json={
            'list': websites_to_scrape,
            'exclusions': exclusions,
            'document_type': document_type,
            'api_key': 'Put your API key here',
        },
    )
    
    if response.status_code == 200:
        print(response.json())
    else:
        print('message:', response.text, 'status_code:', response.status_code)
            


    Basic PDF Scraping

    Sample Usage

    
    import requests
    
    url = 'https://ml.cloudcix.com/scraping/'
    
    websites_to_scrape = ['https://arxiv.org/pdf/2405.13010.pdf']
    document_type = 'pdf'
    
    response = requests.post(
        url=url,
        json={
            'list': websites_to_scrape,
            'document_type': document_type,
            'api_key': 'Put your API key here',
        },
    )
    
    if response.status_code == 200:
        print(response.json())
    else:
        print('message:', response.text, 'status_code:', response.status_code)
            


    High Resolution PDF Scraping

    High-resolution PDF scraping involves an advanced algorithm that preserves the layout and structure of complex documents. Unlike basic pdf scraping, this approach supports multi-column formats, tables, and images by generating HTML with <tr>, <td>, and other relevant tags for structural fidelity. Although image data is detected, it’s excluded in the final output to focus on text and tabular content for streamlined processing.

    Sample Usage

    
    import requests
    
    url = 'https://ml.cloudcix.com/scraping/'
    
    websites_to_scrape = ['https://arxiv.org/pdf/2405.13010.pdf']
    document_type = 'pdf_hi_res'
    
    response = requests.post(
        url=url,
        json={
            'list': websites_to_scrape,
            'document_type': document_type,
            'api_key': 'Put your API key here',
        },
    )
    
    if response.status_code == 200:
        print(response.json())
            


    Parsing

    This parsing API allows you to parse and convert data from files into machine-readable format (html).

    Input:

    • data: a dictionary with 2 keys:
      • api_key: your CloudCIX API key.
      • filenames: list of file names that you are sending to the API.
    • files: list of files' byte-objects.

    Output:

    A list of dictionaries with the fields:

    • 'source': the file name of the document.
    • 'error': an error message if the document could not be loaded. Optional.
    • 'page_content': a list of dictionaries with the fields:
      • 'page_number': the page number of the current page for pdf, or None for other files.
      • 'text': the text content of the current page.

    Support File Type

    Sample Usage

    
    import requests
    import json
    import os
    
    url = 'https://ml.cloudcix.com/parsing/'
    api_key = 'API_KEY_HERE'
    
    data = {
        'data': json.dumps({'api_key': api_key, 'filenames': ['test.html', '2405.13010v1.pdf']}),
    }
    
    files = [
        ('file', open('test.html', 'rb')),
        ('file', open('2405.13010v1.pdf', 'rb')),
    ]
    
    response = requests.post(url, data=data, files=files)
    
    print(response.status_code)
    print(response.text)