CloudCIX ML API
This page describes a number of stateless ML APIs available to be called from within any CloudCIX project. These API calls are protected by an IP address based access list and cannot be used from outside CloudCIX projects.
An Address based API Key, available from within the CloudCIX Membership App, is required to use these APIs so that billing can be made.
Large Language Models
UCCIX-Instruct
Our flagship Irish-English bilingual chatbot model, capable of understanding both languages and outperforms much larger models on Irish language tasks (upto 12% compared to models 10 times larger in size). Additionally, the model is 50% more efficient on Irish tokens than other models, leading to reduction in computing time and cost.
Free preview available here.
Sample Usage
import requests
url = 'https://ml.cloudcix.com/uccix_instruct/'
data={
'api_key': 'Put your API key here',
'max_tokens': 100, # Optional, default value is 100
'prompt': 'Put any text here that you want to use to prompt UCCIX-Instruct',
'temperature': 0.0, # Optional, default is 0.0, max is 1.0
}
response = requests.post(
url=url,
json=data,
)
if response.status_code == 200:
for item in response:
print(item.decode('utf-8'), end='')
else:
print('message:', response.text, 'status_code:', response.status_code)
GPT-4o
Most advanced model from OpenAI, advertised as useful for complex, multi-step tasks.
Sample Usage
import requests
url = 'https://ml.cloudcix.com/chatgpt4/'
prompt = [
{'role': 'user', 'content': 'Put any text here that you want to use to prompt ChatGPT'},
]
data = {
'api_key': 'Put your API key here',
'max_tokens': 100, # Optional, default value is 100
'prompt': prompt,
'temperature': 0.0, # Optional, default is 0.0, max is 1.0
}
response = requests.post(
url=url,
json=data,
)
if response.status_code == 200:
for answer in response:
print(answer.decode('utf-8'), end=' ')
else:
print('message:', response.text, 'status_code:', response.status_code)
Embedding Models
Encode text as a vector (sequence of numbers) that represents the meaningful concepts within the input content. These vectors can be used for a variety of tasks, such as semantic search, clustering, and classification.
CIX Paragraph Encoder (chunk/context encoder)
Our flagship Irish-English embedding model, first-ever model with support for the Irish language, while also retaining State-of-the-Art performance on English.
Sample Usage
import requests
url = 'https://ml.cloudcix.com/cix_chunk_encoder/'
prompt = 'Put any paragraph/chunk here that you want to convert to a vector'
response = requests.post(
url=url,
json={
'list': [prompt],
'api_key': 'Put your API key here',
},
)
if response.status_code == 200:
print(response.json())
else:
print('message:', response.text, 'status_code:', response.status_code)
CIX Sentence Encoder (query encoder)
Our flagship Irish-English embedding model, first-ever model with support for the Irish language, while also retaining State-of-the-Art performance on English.
Sample Usage
import requests
url = 'https://ml.cloudcix.com/cix_question_encoder/'
prompt = 'Put any text here that you want to convert to a Vector'
response = requests.post(
url=url,
json={
'list': [prompt],
'api_key': 'Put your API key here',
},
)
if response.status_code == 200:
print(response.json())
else:
print('message:', response.text, 'status_code:', response.status_code)
Google Universal Sentence Encoder(USE) 4
Embedding model from Google. Not recommended as our benchmarks yield superior results for our flagship embedding model.
Cost: €0.005 per API call.
Sample Usage
import requests
url = 'https://ml.cloudcix.com/use4/'
prompt = 'Put any text here that you want to convert to a Use4 vector'
response = requests.post(
url=url,
json={
'list': [prompt],
'api_key': 'Put your API key here',
},
)
if response.status_code == 200:
print(response.json()['embeddings'][0])
else:
print('message:', response.text, 'status_code:', response.status_code)
Dragon+ Sentence Encoder (query encoder)
Smaller size than our flagship model. Good for efficiency.
Sample Usage
import requests
url = 'https://ml.cloudcix.com/dragonplus_question/'
prompt = 'Put any text here that you want to convert to a Dragon+ vector'
response = requests.post(
url=url,
json={
'list': [prompt],
'api_key': 'Put your API key here',
},
)
if response.status_code == 200:
print(response.json())
else:
print('message:', response.text, 'status_code:', response.status_code)
Dragon+ Paragraph Encoder (chunk/context encoder)
Smaller size than our flagship model. Good for efficiency.
Sample Usage
import requests
url = 'https://ml.cloudcix.com/dragonplus_vector/'
prompt = 'Put any paragraph/chunk here that you want to convert to a Dragon+ vector'
response = requests.post(
url=url,
json={
'list': [prompt],
'api_key': 'Put your API key here',
},
)
if response.status_code == 200:
print(response.json())
else:
print('message:', response.text, 'status_code:', response.status_code)
Scraping
This scraping API allows you to collect data from any public website. At the moment, we are supporting scraping of websites and pdfs (note that the pdf file needs to be available through a public web url).
Input:
Output:
A list of dictionaries with the fields:
- 'source': the URL of the document.
- 'error': an error message if the document could not be loaded. Optional.
-
'page_content': a list of dictionaries with the fields:
- 'page_number': the page number of the current page for pdf, or 0 for webpages.
- 'text': the text content of the current page.
HTML Scraping
For html scraping, we also support filtering out unwanted tags and classes in the html through the variable 'exclusions'.
Sample Usage
import requests
url = 'https://ml.cloudcix.com/scraping/'
websites_to_scrape = ['https://docs.cloudcix.com']
exclusions={'exclusion_tags': ['script', 'style'], 'exclusion_classes': ['footer', 'header']}
document_type = 'html'
response = requests.post(
url=url,
json={
'list': websites_to_scrape,
'exclusions': exclusions,
'document_type': document_type,
'api_key': 'Put your API key here',
},
)
if response.status_code == 200:
print(response.json())
else:
print('message:', response.text, 'status_code:', response.status_code)
Basic PDF Scraping
Sample Usage
import requests
url = 'https://ml.cloudcix.com/scraping/'
websites_to_scrape = ['https://arxiv.org/pdf/2405.13010.pdf']
document_type = 'pdf'
response = requests.post(
url=url,
json={
'list': websites_to_scrape,
'document_type': document_type,
'api_key': 'Put your API key here',
},
)
if response.status_code == 200:
print(response.json())
else:
print('message:', response.text, 'status_code:', response.status_code)
High Resolution PDF Scraping
High-resolution PDF scraping involves an advanced algorithm that preserves the layout and structure of complex documents. Unlike basic pdf scraping, this approach supports multi-column formats, tables, and images by generating HTML with <tr>, <td>, and other relevant tags for structural fidelity. Although image data is detected, it’s excluded in the final output to focus on text and tabular content for streamlined processing.
Sample Usage
import requests
url = 'https://ml.cloudcix.com/scraping/'
websites_to_scrape = ['https://arxiv.org/pdf/2405.13010.pdf']
document_type = 'pdf_hi_res'
response = requests.post(
url=url,
json={
'list': websites_to_scrape,
'document_type': document_type,
'api_key': 'Put your API key here',
},
)
if response.status_code == 200:
print(response.json())
Parsing
This parsing API allows you to parse and convert data from files into machine-readable format (html).
Input:
- api_key: your CloudCIX API key.
- filenames: list of file names that you are sending to the API.
Output:
A list of dictionaries with the fields:
- 'source': the file name of the document.
- 'error': an error message if the document could not be loaded. Optional.
-
'page_content': a list of dictionaries with the fields:
- 'page_number': the page number of the current page for pdf, or None for other files.
- 'text': the text content of the current page.
Support File Type
bibtex
(BibTeX bibliography)biblatex
(BibLaTeX bibliography)bits
(BITS XML, alias forjats
)commonmark
(CommonMark Markdown)commonmark_x
(CommonMark Markdown with extensions)creole
(Creole 1.0)csljson
(CSL JSON bibliography)csv
(CSV table)tsv
(TSV table)djot
(Djot markup)docbook
(DocBook)docx
(Word docx)dokuwiki
(DokuWiki markup)endnotexml
(EndNote XML bibliography)epub
(EPUB)fb2
(FictionBook2 e-book)gfm
(GitHub-Flavored Markdown), or the deprecated and less accuratemarkdown_github
; usemarkdown_github
only if you need extensions not supported ingfm
.haddock
(Haddock markup)html
(HTML)ipynb
(Jupyter notebook)jats
(JATS XML)jira
(Jira/Confluence wiki markup)json
(JSON version of native AST)latex
(LaTeX)markdown
(Pandoc’s Markdown)markdown_mmd
(MultiMarkdown)markdown_phpextra
(PHP Markdown Extra)markdown_strict
(original unextended Markdown)mediawiki
(MediaWiki markup)man
(roff man))mdoc
(mdoc manual page markup)muse
(Muse)native
(native Haskell)odt
(OpenOffice text document)opml
(OPML)org
(Emacs Org mode)ris
(RIS) bibliography)rtf
(Rich Text Format)rst
(reStructuredText)t2t
(txt2tags)textile
(Textile)tikiwiki
(TikiWiki markup)twiki
(TWiki markup)typst
(typst)vimwiki
(Vimwiki)
Sample Usage
import requests
import json
import os
url = 'https://ml.cloudcix.com/parsing/'
api_key = 'API_KEY_HERE'
data = {
'data': json.dumps({'api_key': api_key, 'filenames': ['test.html', '2405.13010v1.pdf']}),
}
files = [
('file', open('test.html', 'rb')),
('file', open('2405.13010v1.pdf', 'rb')),
]
response = requests.post(url, data=data, files=files)
print(response.status_code)
print(response.text)