Cross-Language Embedding Generation: Bringing Hugging Face Models to C# and Java with ONNX
πSource code with README: https://github.com/yuniko-software/tokenizer-to-onnx-model
The Challenge of Working with Embedding Models
Text embeddings have become essential components in modern AI applications, from semantic search and RAG applications to recommendation systems. However, implementing them across different programming languages presents significant challenges:
- Python-centric ecosystem: Most embedding models and tokenizers are built for Python
- Limited language support: C# and Java developers face significant implementation hurdles
- Integration overhead: Python interop adds complexity and dependencies
ONNX (Open Neural Network Exchange) is an open standard for representing machine learning models. It enables models trained in one framework (like PyTorch or TensorFlow) to be deployed in another framework with minimal changes. Popular AI libraries like Hugging Face Transformers and FastEmbed use ONNX to inference models. ONNX Runtime, the companion execution engine, provides optimized inference across various hardware platforms and languages including Python, C#, and Java.
Current Solutions for C# and Java Developers
Currently, if you want to host embedding models with C# or Java, you have limited options:
- External services like Ollama (requires separate hosting)
- Cloud APIs (OpenAI, Cohere) requiring internet connectivity and ongoing costs
- Python interop adding complexity and dependencies
While you can download Hugging Face models and run inference with ONNX Runtime in C# or Java, the critical missing piece is tokenization - an essential preprocessing step not typically implemented for these languages.
The Solution: Converting Tokenizers to ONNX
Our implementation leverages ONNX Runtime Extensions to convert Hugging Face tokenizers to ONNX format. The repository's tokenizer_to_onnx_model.ipynb notebook demonstrates this process in detail.
The solution follows a straightforward process:
- Export the tokenizer: Convert the Hugging Face tokenizer to ONNX format using ONNX Runtime Extensions in Python
- Process tokenizer outputs: Create functions to transform tokenizer outputs into model inputs
- Create language-specific implementations: Implement the same pipeline in C# and Java
- Validate consistency: Ensure identical results across all languages
This approach creates a fully portable embedding pipeline that works across languages:
βββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββ
β Input Text β --> β ONNX Tokenizerβ --> β Token Processing β
βββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββ
β
βΌ
βββββββββββββββ ββββββββββββββββββ βββββββββββββββββ
β Embedding β <-- β ONNX Model β <-- β Model Inputs β
βββββββββββββββ ββββββββββββββββββ βββββββββββββββββ
Embedding pipeline scheme
Once we've exported the tokenizer to ONNX, the entire pipeline can run natively in C# and Java with no Python dependencies. All componentsβthe tokenizer, the processing logic, and the embedding modelβoperate within the same language environment, creating a unified and efficient solution.
Why This Approach Matters
By exporting the tokenizer to ONNX format, we create an absolutely portable and native embedding pipeline for .NET and Java applications. This approach:
- Enables running embedding models completely natively in C# and Java using only the ONNX Runtime library
- Eliminates the Python and third party services dependency
- Creates a unified solution where both tokenizer and model run in the same environment
Tokenizer Conversion Process
Step 1. Initialize tokenizer
Initialize a Hugging Face tokenizer (using BAAI/bge-m3 as an example) and convert it to ONNX format. The ONNX tokenizer can be deployed in any language that supports ONNX Extensions.
def initialize_tokenizer(model_type="BAAI/bge-m3", tokenizer_path="tokenizer.onnx"):
"""Initialize and export the tokenizer if needed."""
hf_tokenizer = AutoTokenizer.from_pretrained(model_type)
if not os.path.exists(tokenizer_path):
print(f"Generating ONNX tokenizer at {tokenizer_path}")
tokenizer_model = gen_processing_models(
hf_tokenizer,
pre_kwargs={},
post_kwargs={})[0]
with open(tokenizer_path, "wb") as f:
f.write(tokenizer_model.SerializeToString())
sess_options = ort.SessionOptions()
sess_options.register_custom_ops_library(get_library_path())
tokenizer_session = ort.InferenceSession(
tokenizer_path,
sess_options=sess_options,
providers=['CPUExecutionProvider']
)
return hf_tokenizer, tokenizer_session
Step 2. Convert Tokenizer Outputs to Embedding Model Inputs
Convert ONNX tokenizer outputs (tokens, token_indices) to model inputs (input_ids, attention_mask). This step is crucial because the ONNX tokenizer outputs need to be properly formatted before they can be fed into the embedding model.
def convert_tokenizer_outputs(tokens, token_indices):
"""Convert tokenizer outputs to the format expected by the model."""
token_pairs = []
for i in range(len(tokens)):
if i < len(token_indices):
token_pairs.append((token_indices[i], tokens[i]))
token_pairs.sort()
ordered_tokens = [pair[1] for pair in token_pairs]
input_ids = np.array([ordered_tokens], dtype=np.int64)
attention_mask = np.ones_like(input_ids, dtype=np.int64)
return input_ids, attention_mask
Step 3. Generate Embedding
Use both the tokenizer and model to generate embeddings from text input:
def generate_embedding(text, tokenizer_session, model_session):
"""Generate embedding for a single text."""
tokenizer_outputs = tokenizer_session.run(None, {"inputs": np.array([text])})
tokens, _, token_indices = tokenizer_outputs
input_ids, attention_mask = convert_tokenizer_outputs(
tokens, token_indices
)
outputs = model_session.run(None, {
"input_ids": input_ids,
"attention_mask": attention_mask
})
return outputs[1]
Step 4. Run All Together
# Initialize tokenizer and model
hf_tokenizer, tokenizer_session = initialize_tokenizer(
tokenizer_path="onnx/tokenizer.onnx")
model_session = ort.InferenceSession(
"onnx/model.onnx",
providers=['CPUExecutionProvider'])
# Test with a sample text
sample_text = "A test text! Texto de prueba! Π’Π΅ΠΊΡΡ Π΄Π»Ρ ΡΠ΅ΡΡΠ°! 測試ζε! Testtext!"
embedding = generate_embedding(sample_text, tokenizer_session, model_session)
print(f"Generated embedding shape: {embedding.shape}")
print(f"Sample values: {embedding.flatten()[:5]}")
In the notebook, we also compare our results with the original Hugging Face tokenizer to ensure accuracy. The comparison shows that embeddings generated using our ONNX tokenizer are identical to those created using the original Hugging Face implementation:
similarity = compare_with_hf_tokenizer(
sample_text,
hf_tokenizer,
tokenizer_session,
model_session)
print(f"Embedding cosine similarity: {similarity}")
# Output: Embedding cosine similarity: ~1.0
This perfect similarity score (~1.0) confirms that our ONNX-based approach produces exactly the same results as the original Python implementation, validating that the tokenizer conversion process maintains complete fidelity to the original model. This ensures that developers can confidently use this approach in production environments, knowing they'll get consistent results across Python, C#, and Java implementations.
Implementing in C# and Java
After exporting the tokenizer to ONNX format, we can use it directly in C# and Java applications. Here are simplified examples for both languages.
C# Implementation
The C# implementation uses the ONNX Runtime .NET API to run both the tokenizer and embedding model:
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
namespace OnnxTokenizer.Sample;
/// <summary>
/// Provides functionality to generate embeddings using ONNX tokenizer and embedding models.
/// </summary>
public class OnnxEmbeddingGenerator : IDisposable
{
private readonly InferenceSession _tokenizerSession;
private readonly InferenceSession _modelSession;
private bool _disposed = false;
/// <summary>
/// Initializes a new instance of the OnnxEmbeddingGenerator class.
/// </summary>
/// <param name="tokenizerPath">Path to the ONNX tokenizer model.</param>
/// <param name="modelPath">Path to the ONNX embedding model.</param>
public OnnxEmbeddingGenerator(string tokenizerPath, string modelPath)
{
// Initialize tokenizer session with ONNX Extensions
var tokenizerOptions = new SessionOptions();
tokenizerOptions.RegisterOrtExtensions();
_tokenizerSession = new InferenceSession(
tokenizerPath,
tokenizerOptions);
// Initialize model session
_modelSession = new InferenceSession(modelPath);
}
/// <summary>
/// Generates embedding for the input text.
/// </summary>
/// <param name="text">The input text.</param>
/// <returns>The embedding vector as a float array.</returns>
public float[] GenerateEmbedding(string text)
{
// Create input tensor for tokenizer
var stringTensor = new DenseTensor<string>([1]);
stringTensor[0] = text;
// Create input for tokenizer using CreateFromTensor
var tokenizerInputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("inputs", stringTensor)
};
// Run tokenizer
using var tokenizerResults = _tokenizerSession.Run(tokenizerInputs);
var tokenizerResultsList = tokenizerResults.ToList();
// Extract tokens and token_indices
// (order: tokens, instance_indices, token_indices)
var tokens = tokenizerResultsList[0].AsTensor<int>().ToArray();
var tokenIndices = tokenizerResultsList[2].AsTensor<int>().ToArray();
// Convert to input_ids by sorting tokens based on token_indices
var tokenPairs = tokens.Zip(tokenIndices, (t, i) => (token: t, index: i))
.OrderBy(p => p.index)
.Select(p => p.token)
.ToArray();
// Create input_ids tensor with shape [1, tokenPairs.Length]
var inputIdsTensor = new DenseTensor<long>([1, tokenPairs.Length]);
for (int i = 0; i < tokenPairs.Length; i++)
{
inputIdsTensor[0, i] = tokenPairs[i];
}
// Create attention_mask as all 1s with same shape as input_ids
var attentionMaskTensor = new DenseTensor<long>([1, tokenPairs.Length]);
for (int i = 0; i < tokenPairs.Length; i++)
{
attentionMaskTensor[0, i] = 1;
}
// Run the model with the prepared inputs
var modelInputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor(
"input_ids",
inputIdsTensor),
NamedOnnxValue.CreateFromTensor(
"attention_mask",
attentionMaskTensor)
};
using var modelResults = _modelSession.Run(modelInputs);
var modelResultsList = modelResults.ToList();
// Extract the sentence embedding
var sentenceEmbedding = modelResultsList[1].AsTensor<float>().ToArray();
return sentenceEmbedding;
}
... Disposing logic
}
Java Implementation
The Java implementation uses the ONNX Runtime Java API for the same functionality:
package com.yunikosoftware.onnxtokenizer;
import ai.onnxruntime.*;
import ai.onnxruntime.extensions.OrtxPackage;
import java.util.*;
public class OnnxEmbeddingGenerator implements AutoCloseable {
private final OrtSession tokenizerSession;
private final OrtSession modelSession;
/**
* Initializes a new instance of the OnnxEmbeddingGenerator class.
*
* @param tokenizerPath Path to the ONNX tokenizer model.
* @param modelPath Path to the ONNX embedding model.
* @throws OrtException If there is an error initializing the ONNX sessions.
*/
public OnnxEmbeddingGenerator(
String tokenizerPath,
String modelPath) throws OrtException
{
// Initialize tokenizer session with ONNX Extensions
OrtEnvironment environment = OrtEnvironment.getEnvironment();
// Register the ONNX Runtime Extensions library
OrtSession.SessionOptions tokenizerOptions = new OrtSession.SessionOptions();
tokenizerOptions.registerCustomOpLibrary(OrtxPackage.getLibraryPath());
tokenizerSession = environment.createSession(
tokenizerPath,
tokenizerOptions);
modelSession = environment.createSession(modelPath);
}
/**
* Generates embedding for the input text.
*
* @param text The input text.
* @return The embedding vector as a float array.
* @throws OrtException If there is an error during inference.
*/
public float[] generateEmbedding(String text) throws OrtException {
OrtEnvironment env = OrtEnvironment.getEnvironment();
// Create input tensor for tokenizer
Map<String, OnnxTensor> tokenizerInputs = new HashMap<>();
String[] inputArray = new String[]{text};
try (OnnxTensor inputTensor = OnnxTensor.createTensor(env, inputArray)) {
tokenizerInputs.put("inputs", inputTensor);
// Run tokenizer
try (OrtSession.Result tokenizerResults = tokenizerSession.run(tokenizerInputs)) {
// Extract tokens and token_indices
// (order: tokens, instance_indices, token_indices)
int[] tokens = ((int[]) tokenizerResults.get(0).getValue());
int[] tokenIndices = ((int[]) tokenizerResults.get(2).getValue());
// Convert to input_ids by sorting tokens based on token_indices
List<TokenIndexPair> tokenPairs = new ArrayList<>();
for (int i = 0; i < tokens.length; i++) {
if (i < tokenIndices.length) {
tokenPairs.add(new TokenIndexPair(
tokens[i], tokenIndices[i]));
}
}
// Sort by index
tokenPairs.sort(Comparator.comparing(pair -> pair.index));
// Extract sorted tokens
long[] orderedTokens = tokenPairs.stream()
.mapToLong(pair -> pair.token)
.toArray();
// Create input_ids tensor with shape [1, orderedTokens.length]
long[][] inputIds = new long[1][orderedTokens.length];
for (int i = 0; i < orderedTokens.length; i++) {
inputIds[0][i] = orderedTokens[i];
}
// Create attention_mask as all 1s with same shape as input_ids
long[][] attentionMask = new long[1][orderedTokens.length];
for (int i = 0; i < orderedTokens.length; i++) {
attentionMask[0][i] = 1;
}
// Run the model with the prepared inputs
Map<String, OnnxTensor> modelInputs = new HashMap<>();
try (OnnxTensor inputIdsTensor = OnnxTensor.createTensor(
env,
inputIds);
OnnxTensor attentionMaskTensor = OnnxTensor.createTensor(
env,
attentionMask))
{
modelInputs.put("input_ids", inputIdsTensor);
modelInputs.put("attention_mask", attentionMaskTensor);
try (OrtSession.Result modelResults = modelSession.run(modelInputs)) {
// Extract the sentence embedding (second output)
float[][] sentenceEmbedding = (float[][]) modelResults
.get(1).getValue();
// Return the flattened embedding vector
return sentenceEmbedding[0];
}
}
}
}
}
private static class TokenIndexPair {
public final long token;
public final long index;
public TokenIndexPair(long token, long index) {
this.token = token;
this.index = index;
}
}
... AutoCloseable logic
}
Conclusion and Cross-Language Validation
Our approach to converting Hugging Face tokenizers to ONNX format offers a powerful solution for C# and Java developers who need to work with embedding models without Python dependencies or external services.
Validation of Cross-Language Results
An important aspect of our implementation is ensuring consistency across languages. The repository includes test suites that verify that all implementations match the original Hugging Face embeddings.
Benefits of This Approach
The solution presented in this article delivers several key advantages:
- True language independence: Run embedding models natively in C#, Java, or Python
- Production-ready performance: ONNX models can be deployed on GPUs for significant speedups
- Simplified architecture: No need for Python interop or external services
- Consistent results: Identical embeddings across all languages
- Deployment flexibility: Works in environments where Python isn't available
β If you find this article useful, please consider giving us a star on GitHub! β