CloudCIX ML API
This page describes a number of stateless ML APIs available to be called from within any CloudCIX project. These API calls are protected by an IP address based access list and cannot be used from outside CloudCIX projects.
An Address based API Key, available from within the CloudCIX Membership App, is required to use these APIs so that billing can be made.
Large Language Models
We support the following Large Language Models (LLMs) for text generation tasks, such as summarization, and question answering tasks.
UCCIX-v2-Llama3.1-70B-Instruct
: Our flagship Irish-English bilingual chatbot model, capable of understanding both languages, with State-of-the-Art performance on Irish. Additionally, the model is 50% more efficient on Irish tokens than other models, leading to reduction in computing time and cost. Suitable for most tasks that required advanced performance in Irish or across both languages. Free preview available here.UCCIX-Instruct
: Our initial version of Irish-English bilingual chatbot model. Lightweight and require less computing resource. Free preview available here.GPT-4o
: Most advanced model from OpenAI, advertised as useful for complex, multi-step tasks.deepseek-ai/DeepSeek-R1-Distill-Llama-70B
: Large Reasoning Model distilled from DeepSeek-R1 based on Llama-3.3-70B-Instruct.
Sample Usage
Our models are accessible using the OpenAI libraries (Python and TypeScript / Javascript) along with the REST API, by updating three lines of code:
- Call to CloudCIX's endpoint in
base_url
when initialize OpenAI's client. - Use your CloudCIX API key as
api_key
when initialize OpenAI's client. - Call to the corresponding
model_id
as above.
from openai import OpenAI
client = OpenAI(
api_key="Put your API key here",
base_url="https://ml-openai.cloudcix.com"
)
chat_completion = client.chat.completions.create(
model="UCCIX-v2-Llama3.1-70B-Instruct", # or "UCCIX-Instruct" or "GPT-4o" or "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
messages=[
{
"role": "user",
"content": "What is the meaning of life?",
}
],
stream=True,
max_tokens=100,
temperature=0.7,
)
for chunk in chat_completion:
print(chunk.choices[0].delta.content, end="")
Recommendations
- For Irish-related tasks, including knowledge regarding Ireland and its culture, generating and understanding Irish, we recommend our
UCCIX
series of model. - For
deepseek
reasoning model, using a temperature ranging from 0.5 to 0.7, and avoid adding a system prompt are recommended. For more details, see here.
Embedding Models
Encode text as a vector (sequence of numbers) that represents the meaningful concepts within the input content. These vectors can be used for a variety of tasks, such as semantic search, clustering, and classification.
We provide access to the following models:
cix_question_encoder
: Our flagship Irish-English embedding model, first-ever model with support for the Irish language, while also retaining State-of-the-Art performance on English. Use this version to embed question.cix_chunk_encoder
: Our flagship Irish-English embedding model, first-ever model with support for the Irish language, while also retaining State-of-the-Art performance on English. Use this version to embed longer chunk of text.dragonplus_question
: Smaller size than our flagship model. Good for efficiency. Use this version to embed question.dragonplus_vector
: Smaller size than our flagship model. Good for efficiency. Use this version to embed longer chunk of text.gte-large-en-v1.5_question_encoder
: open-weight model available at here. We recommend to usecix_question_encoder
over this model. Use this version to embed question.gte-large-en-v1.5_chunk_encoder
: open-weight model available at here. We recommend to usecix_question_encoder
over this model. Use this version to embed longer chunk of text.use4
: embedding model from Google. Not recommended as our benchmarks yield superior results for our flagship embedding model.
Sample Usage
Same as LLM API, our models are accessible using the OpenAI libraries (Python and TypeScript / Javascript) along with the REST API, by updating three lines of code and using your CloudCIX's API key.
from openai import OpenAI
client = OpenAI(
api_key="Put your API key here",
base_url="https://ml-openai.cloudcix.com"
)
response = client.embeddings.create(
model="cix_question_encoder",
input=["What is PodNet in CloudCIX?", "What is Membership App?"],
encoding_format="float"
)
print(response.data[0].embedding)
print(response.data[1].embedding)
Embedding Database
This Embedding Database API allows you to store your Corpus data which can be used for RAG method. You can provide chunks to the API and the API converts them into vectors and store in Database.
Methods Available
Create Corpus
Input:
data = {
'api_key': 'put your apikey here',
'name': 'Name of the Corpus',
'method': 'create_embeddings',
'content': [
{
'chunks': ['put chunks here'],
'source': 'Put the url or filename where the chunks are from',
'encoder_name': 'The encoder model for embeddings',
},
],
}
Output:
A success message with status code 200
Get Corpus
Input:
data = {
'api_key': 'put your apikey here',
'method': 'get_corpus',
}
Output:
Get Source
Input:
data = {
'api_key': 'put your apikey here',
'name': 'Name of the Corpus',
'method': 'get_source',
}
Output:
Delete Corpus
Input:
data = {
'api_key': 'put your apikey here',
'name': 'Name of the Corpus',
'method': 'delete_corpus',
}
Output:
A success message with status code 200
Delete Source
Input:
data = {
'api_key': 'put your apikey here',
'name': 'Name of the Corpus',
'method': 'delete_source,
'sources': ['put sources related to corpus'],
}
Output:
A success message with status code 200
Vector Similarity
Input:
data = {
'api_key': 'put your apikey here',
'name': 'Name of the Corpus',
'method': 'create_embeddings',
'encoder_name': 'Name of the Encoder Model used for embeddings, supported names are dragon_plus, cix_encoder, test_encoder, use4',
'query': 'Put your query that needs to be matched with source chunks in the Embedding Database',
'order_by': 'The method of vector search calculation, supported methods are cosine_similarity, dot_product, euclidean_distance',
'threshold': 'An integer value works only for Euclidean Distance order_by method',
'limit': 'The number of matched sources and respective chunks to be retrieved'
}
Output:
Rerank
Input:
data = {
'api_key': 'put your apikey here',
'method': 'rerank',
'chunks': ['put your chunks here'],
'reranker': 'The name of the LLM used for determining the order of the chunks'
'query': 'Put your query here',
}
Output:
Embedding Database API
Sample Usage
import requests
url = 'https://ml.cloudcix.com/embedding_db/'
response = requests.post(
url=url,
json=data,
)
if response.status_code == 200:
print(response.json())
else:
print('message:', response.text, 'status_code:', response.status_code)
Scraping
This scraping API allows you to collect data from any public website. At the moment, we are supporting scraping of websites and pdfs (note that the pdf file needs to be available through a public web url).
Input:
Output:
A list of dictionaries with the fields:
- 'source': the URL of the document.
- 'error': an error message if the document could not be loaded. Optional.
-
'page_content': a list of dictionaries with the fields:
- 'page_number': the page number of the current page for pdf, or 0 for webpages.
- 'text': the text content of the current page.
HTML Scraping
For html scraping, we also support filtering out unwanted tags and classes in the html through the variable 'exclusions'.
Sample Usage
import requests
url = 'https://ml.cloudcix.com/scraping/'
websites_to_scrape = ['https://docs.cloudcix.com']
exclusions={'exclusion_tags': ['script', 'style'], 'exclusion_classes': ['footer', 'header']}
document_type = 'html'
response = requests.post(
url=url,
json={
'list': websites_to_scrape,
'exclusions': exclusions,
'document_type': document_type,
'api_key': 'Put your API key here',
},
)
if response.status_code == 200:
print(response.json())
else:
print('message:', response.text, 'status_code:', response.status_code)
Basic PDF Scraping
Sample Usage
import requests
url = 'https://ml.cloudcix.com/scraping/'
websites_to_scrape = ['https://arxiv.org/pdf/2405.13010.pdf']
document_type = 'pdf'
response = requests.post(
url=url,
json={
'list': websites_to_scrape,
'document_type': document_type,
'api_key': 'Put your API key here',
},
)
if response.status_code == 200:
print(response.json())
else:
print('message:', response.text, 'status_code:', response.status_code)
High Resolution PDF Scraping
High-resolution PDF scraping involves an advanced algorithm that preserves the layout and structure of complex documents. Unlike basic pdf scraping, this approach supports multi-column formats, tables, and images by generating HTML with <tr>, <td>, and other relevant tags for structural fidelity. Although image data is detected, it’s excluded in the final output to focus on text and tabular content for streamlined processing.
Sample Usage
import requests
url = 'https://ml.cloudcix.com/scraping/'
websites_to_scrape = ['https://arxiv.org/pdf/2405.13010.pdf']
document_type = 'pdf_hi_res'
response = requests.post(
url=url,
json={
'list': websites_to_scrape,
'document_type': document_type,
'api_key': 'Put your API key here',
},
)
if response.status_code == 200:
print(response.json())
Parsing
This parsing API allows you to parse and convert data from files into machine-readable format (html).
Input:
- data: a dictionary with 2 keys:
- api_key: your CloudCIX API key.
- filenames: list of file names that you are sending to the API.
- files: list of files' byte-objects.
Output:
A list of dictionaries with the fields:
- 'source': the file name of the document.
- 'error': an error message if the document could not be loaded. Optional.
-
'page_content': a list of dictionaries with the fields:
- 'page_number': the page number of the current page for pdf, or None for other files.
- 'text': the text content of the current page.
Support File Type
bibtex
(BibTeX bibliography)biblatex
(BibLaTeX bibliography)bits
(BITS XML, alias forjats
)commonmark
(CommonMark Markdown)commonmark_x
(CommonMark Markdown with extensions)creole
(Creole 1.0)csljson
(CSL JSON bibliography)csv
(CSV table)tsv
(TSV table)djot
(Djot markup)docbook
(DocBook)docx
(Word docx)dokuwiki
(DokuWiki markup)endnotexml
(EndNote XML bibliography)epub
(EPUB)fb2
(FictionBook2 e-book)gfm
( GitHub-Flavored Markdown), or the deprecated and less accuratemarkdown_github
; usemarkdown_github
only if you need extensions not supported ingfm
.haddock
(Haddock markup)html
(HTML)ipynb
(Jupyter notebook)jats
(JATS XML)jira
( Jira /Confluence wiki markup)json
(JSON version of native AST)latex
(LaTeX)markdown
(Pandoc’s Markdown)markdown_mmd
(MultiMarkdown)markdown_phpextra
(PHP Markdown Extra)markdown_strict
(original unextended Markdown)mediawiki
(MediaWiki markup)man
(groff man)mdoc
(mdoc manual page markup)muse
(Muse)native
(native Haskell)odt
(OpenOffice text document)opml
(OPML)org
(Emacs Org mode)ris
(RIS) bibliography)rtf
(Rich Text Format)rst
(reStructuredText)t2t
(txt2tags)textile
(Textile)tikiwiki
(TikiWiki markup)twiki
(TWiki markup)typst
(typst)vimwiki
(Vimwiki)
Sample Usage
import requests
import json
import os
url = 'https://ml.cloudcix.com/parsing/'
api_key = 'API_KEY_HERE'
data = {
'data': json.dumps({'api_key': api_key, 'filenames': ['test.html', '2405.13010v1.pdf']}),
}
files = [
('file', open('test.html', 'rb')),
('file', open('2405.13010v1.pdf', 'rb')),
]
response = requests.post(url, data=data, files=files)
print(response.status_code)
print(response.text)