CloudCIX ML API

This page describes a number of stateless ML APIs available to be called from within any CloudCIX project. These API calls are protected by an IP address based access list and cannot be used from outside CloudCIX projects.

An Address based API Key, available from within the CloudCIX Membership App, is required to use these APIs so that billing can be made.

Large Language Models

We support the following Large Language Models (LLMs) for text generation tasks, such as summarization, and question answering tasks.

UCCIX-v2-Llama3.1-70B-Instruct: Our flagship Irish-English bilingual chatbot model, capable of understanding both languages, with State-of-the-Art performance on Irish. Additionally, the model is 50% more efficient on Irish tokens than other models, leading to reduction in computing time and cost. Suitable for most tasks that required advanced performance in Irish or across both languages. Free preview available here.
UCCIX-Instruct: Our initial version of Irish-English bilingual chatbot model. Lightweight and require less computing resource. Free preview available here.
GPT-4o: Most advanced model from OpenAI, advertised as useful for complex, multi-step tasks.
GPT-4.1: Latest and Most advanced model from OpenAI, advertised as useful for complex, multi-step tasks.
deepseek-ai/DeepSeek-R1-Distill-Llama-70B: Large Reasoning Model distilled from DeepSeek-R1 based on Llama-3.3-70B-Instruct.

Sample Usage

Our models are accessible using the OpenAI libraries (Python and TypeScript / Javascript) along with the REST API, by updating three lines of code:

Call to CloudCIX's endpoint in base_url when initialize OpenAI's client.
Use your CloudCIX API key as api_keywhen initialize OpenAI's client.
Call to the corresponding model_id as above.


from openai import OpenAI

client = OpenAI(
    api_key="Put your API key here",
    base_url="https://ml-openai.cloudcix.com"
)

chat_completion = client.chat.completions.create(
    model="UCCIX-v2-Llama3.1-70B-Instruct", # or "UCCIX-Instruct" or "GPT-4o" or "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
    messages=[
        {
            "role": "user",
            "content": "What is the meaning of life?",
        }
    ],
    stream=True,
    max_tokens=100,
    temperature=0.7,
)

for chunk in chat_completion:
    print(chunk.choices[0].delta.content, end="")

Recommendations

For Irish-related tasks, including knowledge regarding Ireland and its culture, generating and understanding Irish, we recommend our UCCIX series of model.
For deepseek reasoning model, using a temperature ranging from 0.5 to 0.7, and avoid adding a system prompt are recommended. For more details, see here.

Embedding Models

Encode text as a vector (sequence of numbers) that represents the meaningful concepts within the input content. These vectors can be used for a variety of tasks, such as semantic search, clustering, and classification.

We provide access to the following models:

cix_question_encoder: Our flagship Irish-English embedding model, first-ever model with support for the Irish language, while also retaining State-of-the-Art performance on English. Use this version to embed question.
cix_chunk_encoder: Our flagship Irish-English embedding model, first-ever model with support for the Irish language, while also retaining State-of-the-Art performance on English. Use this version to embed longer chunk of text.
dragonplus_question: Smaller size than our flagship model. Good for efficiency. Use this version to embed question.
dragonplus_vector: Smaller size than our flagship model. Good for efficiency. Use this version to embed longer chunk of text.
gte-large-en-v1.5_question_encoder: open-weight model available at here. We recommend to use cix_question_encoder over this model. Use this version to embed question.
gte-large-en-v1.5_chunk_encoder: open-weight model available at here. We recommend to use cix_question_encoder over this model. Use this version to embed longer chunk of text.
use4: embedding model from Google. Not recommended as our benchmarks yield superior results for our flagship embedding model.

Sample Usage

Same as LLM API, our models are accessible using the OpenAI libraries (Python and TypeScript / Javascript) along with the REST API, by updating three lines of code and using your CloudCIX's API key.


from openai import OpenAI

client = OpenAI(
    api_key="Put your API key here",
    base_url="https://ml-openai.cloudcix.com"
)

response = client.embeddings.create(
    model="cix_question_encoder",
    input=["What is PodNet in CloudCIX?", "What is Membership App?"],
    encoding_format="float"
)

print(response.data[0].embedding)
print(response.data[1].embedding)

Embedding Database

This Embedding Database API allows you to store your Corpus data which can be used for RAG method. You can provide chunks to the API and the API converts them into vectors and store in Database.

Methods Available

Create Corpus: create_embeddings

Get Corpus: get_corpus

Get Sources: get_source

Get Chunks: get_chunks

Delete Corpus: delete_corpus

Delete Source: delete_source

Vector Similarity: vector_search

Rerank: rerank

Create Corpus

Input:

api_key: The Address API Key

name: Name of the Corpus

method: create_embeddings

chunks: A List of chunks

source: Name of Chunks Source

encoder_name: The encoder model for embeddings


data = {
    'api_key': 'put your apikey here',
    'name': 'Name of the Corpus',
    'method': 'create_embeddings',
    'content': [
        {
            'chunks': ['put chunks here'],
            'source': 'Put the url or filename where the chunks are from',
            'encoder_name': 'The encoder model for embeddings',
        },
    ],
}

Output:

A success message with status code 200

Get Corpus

Input:

api_key: The Address API Key

method: get_corpus


data = {
    'api_key': 'put your apikey here',                
    'method': 'get_corpus',
}

Output:

content:A List of Lists of Corpus Names along with encoder name

Get Source

Input:

api_key: The Address API Key

name: Name of the Corpus

method: get_source


data = {
    'api_key': 'put your apikey here',
    'name': 'Name of the Corpus',
    'method': 'get_source',                
}

Output:

content:A List of Lists of sources

Delete Corpus

Input:

api_key: The Address API Key

name: Name of the Corpus

method: delete_corpus


data = {
    'api_key': 'put your apikey here',
    'name': 'Name of the Corpus',
    'method': 'delete_corpus',                
}

Output:

A success message with status code 200

Delete Source

Input:

api_key: The Address API Key

name: Name of the Corpus

method: delete_source

content: A List of Sources


data = {
    'api_key': 'put your apikey here',
    'name': 'Name of the Corpus',
    'method': 'delete_source,
    'sources': ['put sources related to given name of corpus'],
}

Output:

A success message with status code 200

Vector Similarity

Input:

api_key: The Address API Key

names: A List of names of the Corpus

method: vector_search

encoder_name: The encoder model to be used for embedding query

query: The chunk or a sentence to be compared with the chunks in the Database

order_by: The vecotor similarity calculation method. We support Cosine Similarity, Dot Product and Euclidean Distance

threshold: A barrier of Euclidean Distance such that the sources to be retrieved will be within the specified (integer value) distance.\nNote: The threshold works only for Euclidean Distance

limit: Number of Chunks to be retrieved


data = {
    'api_key': 'put your apikey here',
    'names': ['Name of the Corpus 1', 'Name of the Corpus 2',],
    'method': 'vector_search',
    'encoder_name': 'Name of the Encoder Model used for embeddings, supported names are dragon_plus, cix_encoder, test_encoder, use4',
    'query': 'Put your query that needs to be matched with source chunks in the Embedding Database',
    'order_by': 'The method of vector search calculation, supported methods are cosine_similarity, dot_product, euclidean_distance',
    'threshold': 'An integer value works only for Euclidean Distance order_by method',
    'limit': 'The number of matched sources and respective chunks to be retrieved'
}

Output:

content:A List of Lists of sources and chunks

Keyword Search

Input:

api_key: The Address API Key

names or chunks: A List of names of the Corpus or A List of tuples of Source Name and Chunks

method: keyword_search

query: The chunk or a sentence to be compared with the chunks in the Database

limit: Number of Chunks to be retrieved (Optional)


# Send only names in data when you want to compare the query with all the chunks of the sent Corpus names
data = {
    'api_key': 'put your apikey here',
    'names': ['Name of the Corpus 1', 'Name of the Corpus 2',],
    'method': 'keyword_search',
    'query': 'Put your query that needs to be matched with source chunks in the Embedding Database',
    'limit': 'The number of matched sources and respective chunks to be retrieved',
}

or

# Send only chunks in the data when you have specific chunks to compare with your query.
data = {
  'api_key': 'put your apikey here',
  'chunks': [('put source name where the chunk is from', 'put your chunk to be compared with the query'),],
  'method': 'keyword_search',
  'query': 'Put your query that needs to be matched with source chunks in the Embedding Database',
  'limit': 'The number of matched sources and respective chunks to be retrieved',
}

Output:

content:A List of Lists of sources and chunks

Rerank

Input:

api_key: The Address API Key

name: Name of the Corpus

method: create_embeddings

chunks: A List of chunks

source: Name of Chunks Source

encoder_name: The encoder model for embeddings


data = {
    'api_key': 'put your apikey here',    
    'method': 'rerank',    
    'chunks': ['put your chunks here'],
    'reranker': 'The name of the LLM used for determining the order of the chunks'
    'query': 'Put your query here',    
}

Output:

content:A List of reranked chunks based on the similarity with the provided code

Embedding Database API

Sample Usage


import requests

url = 'https://ml.cloudcix.com/embedding_db/'

response = requests.post(
    url=url,
    json=data,
)

if response.status_code == 200:
    print(response.json())
else:
    print('message:', response.text, 'status_code:', response.status_code)

Scraping

This scraping API allows you to collect data from any public website. At the moment, we are supporting scraping of websites and pdfs (note that the pdf file needs to be available through a public web url).

Input:

list: a list of URLs to PDF documents or webpages.

document_type: either 'pdf' or 'html'.

Output:

A list of dictionaries with the fields:

'source': the URL of the document.
'error': an error message if the document could not be loaded. Optional.
'page_content': a list of dictionaries with the fields:
- 'page_number': the page number of the current page for pdf, or 0 for webpages.
- 'text': the text content of the current page.

HTML Scraping

For html scraping, we also support filtering out unwanted tags and classes in the html through the variable 'exclusions'.

Sample Usage


import requests

url = 'https://ml.cloudcix.com/scraping/'

websites_to_scrape = ['https://docs.cloudcix.com']
exclusions={'exclusion_tags': ['script', 'style'], 'exclusion_classes': ['footer', 'header']}
document_type = 'html'

response = requests.post(
    url=url,
    json={
        'list': websites_to_scrape,
        'exclusions': exclusions,
        'document_type': document_type,
        'api_key': 'Put your API key here',
    },
)

if response.status_code == 200:
    print(response.json())
else:
    print('message:', response.text, 'status_code:', response.status_code)

Basic PDF Scraping

Sample Usage


import requests

url = 'https://ml.cloudcix.com/scraping/'

websites_to_scrape = ['https://arxiv.org/pdf/2405.13010.pdf']
document_type = 'pdf'

response = requests.post(
    url=url,
    json={
        'list': websites_to_scrape,
        'document_type': document_type,
        'api_key': 'Put your API key here',
    },
)

if response.status_code == 200:
    print(response.json())
else:
    print('message:', response.text, 'status_code:', response.status_code)

High Resolution PDF Scraping

High-resolution PDF scraping involves an advanced algorithm that preserves the layout and structure of complex documents. Unlike basic pdf scraping, this approach supports multi-column formats, tables, and images by generating HTML with <tr>, <td>, and other relevant tags for structural fidelity. Although image data is detected, it’s excluded in the final output to focus on text and tabular content for streamlined processing.

Sample Usage


import requests

url = 'https://ml.cloudcix.com/scraping/'

websites_to_scrape = ['https://arxiv.org/pdf/2405.13010.pdf']
document_type = 'pdf_hi_res'

response = requests.post(
    url=url,
    json={
        'list': websites_to_scrape,
        'document_type': document_type,
        'api_key': 'Put your API key here',
    },
)

if response.status_code == 200:
    print(response.json())

Parsing

This parsing API allows you to parse and convert data from files into machine-readable format (html).

Input:

data: a dictionary with 2 keys:
- api_key: your CloudCIX API key.
- filenames: list of file names that you are sending to the API.
files: list of files' byte-objects.

Output:

A list of dictionaries with the fields:

'source': the file name of the document.
'error': an error message if the document could not be loaded. Optional.
'page_content': a list of dictionaries with the fields:
- 'page_number': the page number of the current page for pdf, or None for other files.
- 'text': the text content of the current page.

Support File Type

bibtex (BibTeX bibliography)
biblatex (BibLaTeX bibliography)
bits (BITS XML, alias for jats)
commonmark (CommonMark Markdown)
commonmark_x (CommonMark Markdown with extensions)
creole (Creole 1.0)
csljson (CSL JSON bibliography)
csv (CSV table)
tsv (TSV table)
djot (Djot markup)
docbook (DocBook)
docx (Word docx)
dokuwiki (DokuWiki markup)
endnotexml (EndNote XML bibliography)
epub (EPUB)
fb2(FictionBook2 e-book)
gfm ( GitHub-Flavored Markdown), or the deprecated and less accurate markdown_github; use markdown_github only if you need extensions not supported in gfm.
haddock (Haddock markup)
html (HTML)
ipynb (Jupyter notebook)
jats (JATS XML)
jira ( Jira /Confluence wiki markup)
json (JSON version of native AST)
latex (LaTeX)
markdown (Pandoc’s Markdown)
markdown_mmd (MultiMarkdown)
markdown_phpextra (PHP Markdown Extra)
markdown_strict (original unextended Markdown)
mediawiki (MediaWiki markup)
man (groff man)
mdoc (mdoc manual page markup)
muse (Muse)
native (native Haskell)
odt (OpenOffice text document)
opml (OPML)
org (Emacs Org mode)
ris (RIS) bibliography)
rtf (Rich Text Format)
rst (reStructuredText)
t2t (txt2tags)
textile (Textile)
tikiwiki (TikiWiki markup)
twiki (TWiki markup)
typst (typst)
vimwiki (Vimwiki)

Sample Usage


import requests
import json
import os

url = 'https://ml.cloudcix.com/parsing/'
api_key = 'API_KEY_HERE'

data = {
    'data': json.dumps({'api_key': api_key, 'filenames': ['test.html', '2405.13010v1.pdf']}),
}

files = [
    ('file', open('test.html', 'rb')),
    ('file', open('2405.13010v1.pdf', 'rb')),
]

response = requests.post(url, data=data, files=files)

print(response.status_code)
print(response.text)