Skip to content

Commit

Permalink
Merge pull request #108 from Encamina/@rliberoff/improve_document_con…
Browse files Browse the repository at this point in the history
…nectors

Improve document connectors.
  • Loading branch information
rliberoff authored Apr 22, 2024
2 parents 5c81211 + 5501082 commit 30d6f74
Show file tree
Hide file tree
Showing 13 changed files with 120 additions and 72 deletions.
14 changes: 11 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,10 @@ Previous classification is not required if changes are simple or all belong to t

### Breaking Changes

- Renamed `UserId` to `IndexerId` in `ChatMessageHistoryRecord`. This change requires consumers to update their database to match the new property name.
- In case of using Cosmos DB, `IndexerId` should be the new partition key of the collection. You can learn how to change the partition key and do the data migration [here](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/change-partition-key).
- Class `IDocumentConnectorUtils` has been removed. Please use an instance of `IDocumentConnectorProvider` and the method `SupportedFileExtension` to check if the file extension is supported and the method `GetDocumentConnector` to get the appropriate document connector.
- The method `GetDocumentConnector` from interface type `IDocumentConnectorProvider` now throws `InvalidOperationException` if a connector for the specified file extension is not found.
- Renamed `UserId` to `IndexerId` in `ChatMessageHistoryRecord`. This change requires consumers to update their database to match the new property name.
- In case of using Cosmos DB, `IndexerId` should be the new partition key of the collection. You can learn how to change the partition key and do the data migration [here](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/change-partition-key).

### Major Changes

Expand Down Expand Up @@ -58,11 +60,17 @@ Previous classification is not required if changes are simple or all belong to t
- Updated `xunit.analyzers` from `1.11.0` to `1.12.0`.
- Updated `xunit.extensibility.core` from `2.7.0` to `2.7.1`.
- Updated `xunit.runner.visualstudio` from `2.5.7` to `2.5.8`.
- Added new methods to interface type `IDocumentConnectorProvider`:
- New overload of `GetDocumentConnector` that receives a boolean value to throw an exception if a connector for the specified file extension is not found.
- New method `SupportedFileExtension` to check if a file extension is supported by the current instance of the `IDocumentConnectorProvider`.
- New method `AddDocumentConnector` to add (or replace) a document connector in the current instance of the `IDocumentConnectorProvider` for a specific file extension.
- Added new class `DocumentConnectorProviderBase` which provides a default base implementation of `IDocumentConnectorProvider`.

### Minor Changes

- Added `CosineStringSimilarityComparer` in `Encamina.Enmarcha.SemanticKernel` to compare two strings using cosine similarity algorithm.

- Class `SlidePptxDocumentConnector` is now `public` instead of `internal`.

## [8.1.5]

### Breaking Changes
Expand Down
2 changes: 1 addition & 1 deletion Directory.Build.props
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

<PropertyGroup>
<VersionPrefix>8.1.6</VersionPrefix>
<VersionSuffix>preview-03</VersionSuffix>
<VersionSuffix>preview-04</VersionSuffix>
</PropertyGroup>

<!--
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ namespace Encamina.Enmarcha.Bot.Adapters;

/// <summary>
/// Base class for bot adapters with custom error handling that implements the Bot Framework Protocol and can
/// be hosted in different cloud environmens, both public and private. <b>This class is abstract.</b>
/// be hosted in different cloud environments, both public and private. <b>This class is abstract.</b>
/// </summary>
public class BotCloudAdapterWithErrorHandlerBase : CloudAdapter
{
Expand Down Expand Up @@ -53,7 +53,7 @@ protected BotCloudAdapterWithErrorHandlerBase(IBotAdapterOptions<BotCloudAdapter
/// An error handler that can catch exceptions in the middleware or application.
/// </summary>
/// <param name="turnContext">The current turn context.</param>
/// <param name="exception">The catched excetion.</param>
/// <param name="exception">The caught exception.</param>
/// <returns>A task that represents the asynchronous error handling operation.</returns>
protected virtual async Task ErrorHandlerAsync(ITurnContext turnContext, Exception exception)
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ namespace Encamina.Enmarcha.SemanticKernel.Connectors.Document.Connectors;
/// <summary>
/// Extracts the text from a Microsoft PowerPoint (<c>.pptx</c>) file, just one line for each slide found.
/// </summary>
internal sealed class SlidePptxDocumentConnector : BasePptxDocumentConnector
public sealed class SlidePptxDocumentConnector : BasePptxDocumentConnector
{
/// <inheritdoc/>
protected override IEnumerable<string> GetAllTextInSlide(SlidePart slidePart)
Expand All @@ -30,6 +30,6 @@ protected override IEnumerable<string> GetAllTextInSlide(SlidePart slidePart)
}
}

return new[] { slideText.ToString().Trim() };
return [slideText.ToString().Trim()];
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,4 @@ internal sealed class DefaultDocumentContentExtractor : DocumentContentExtractor
public DefaultDocumentContentExtractor(ITextSplitter textSplitter, Func<string, int> lengthFunction) : base(textSplitter, lengthFunction)
{
}

/// <inheritdoc/>
public override IDocumentConnector GetDocumentConnector(string fileExtension)
{
return IDocumentConnectorUtils.GetDefaultDocumentConnector(fileExtension);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,4 @@ internal sealed class DefaultDocumentContentSemanticExtractor : DocumentContentS
public DefaultDocumentContentSemanticExtractor(ISemanticTextSplitter semanticTextSplitter, Func<IList<string>, CancellationToken, Task<IList<ReadOnlyMemory<float>>>> embeddingsGeneratorFunction) : base(semanticTextSplitter, embeddingsGeneratorFunction)
{
}

/// <inheritdoc/>
public override IDocumentConnector GetDocumentConnector(string fileExtension)
{
return IDocumentConnectorUtils.GetDefaultDocumentConnector(fileExtension);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
using System.Text;

using Encamina.Enmarcha.Core.Extensions;
using Encamina.Enmarcha.SemanticKernel.Connectors.Document.Connectors;
using Encamina.Enmarcha.SemanticKernel.Connectors.Document.Resources;

using Microsoft.SemanticKernel.Plugins.Document;
using Microsoft.SemanticKernel.Plugins.Document.OpenXml;

namespace Encamina.Enmarcha.SemanticKernel.Connectors.Document;

/// <summary>
/// Base class to provide instances of <see cref="IDocumentConnector"/>s.
/// </summary>
public class DocumentConnectorProviderBase : IDocumentConnectorProvider
{
private readonly Dictionary<string, IDocumentConnector> documentConnectors = new()
{
{ @".DOCX", new WordDocumentConnector() },
{ @".PDF", new CleanPdfDocumentConnector() },
{ @".PPTX", new ParagraphPptxDocumentConnector() },
{ @".TXT", new TxtDocumentConnector(Encoding.UTF8) },
{ @".MD", new TxtDocumentConnector(Encoding.UTF8) },
{ @".VTT", new VttDocumentConnector(Encoding.UTF8) },
};

/// <inheritdoc/>
public virtual void AddDocumentConnector(string fileExtension, IDocumentConnector documentConnector)
{
documentConnectors[fileExtension.ToUpperInvariant()] = documentConnector;
}

/// <inheritdoc/>
public virtual IDocumentConnector GetDocumentConnector(string fileExtension)
{
return GetDocumentConnector(fileExtension, true);
}

/// <inheritdoc/>
public IDocumentConnector GetDocumentConnector(string fileExtension, bool throwException)
{
if (documentConnectors.TryGetValue(fileExtension.ToUpperInvariant(), out var value))
{
return value;
}

if (throwException)
{
throw new InvalidOperationException(ExceptionMessages.ResourceManager.GetFormattedStringByCurrentUICulture(nameof(ExceptionMessages.FileExtensionNotSupported), fileExtension));
}

return null;
}

/// <inheritdoc/>
public virtual bool SupportedFileExtension(string fileExtension)
{
return documentConnectors.ContainsKey(fileExtension.ToUpperInvariant());
}
}
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
using Encamina.Enmarcha.AI.Abstractions;

using Microsoft.SemanticKernel.Plugins.Document;

namespace Encamina.Enmarcha.SemanticKernel.Connectors.Document;

/// <summary>
/// Base class for document content extractors.
/// </summary>
public abstract class DocumentContentExtractorBase : IDocumentConnectorProvider, IDocumentContentExtractor
public class DocumentContentExtractorBase : DocumentConnectorProviderBase, IDocumentContentExtractor
{
/// <summary>
/// Initializes a new instance of the <see cref="DocumentContentExtractorBase"/> class.
Expand Down Expand Up @@ -41,13 +39,10 @@ public virtual IEnumerable<string> GetDocumentContent(Stream stream, string file
}

/// <inheritdoc/>
public Task<IEnumerable<string>> GetDocumentContentAsync(Stream stream, string fileExtension, CancellationToken cancellationToken)
public virtual Task<IEnumerable<string>> GetDocumentContentAsync(Stream stream, string fileExtension, CancellationToken cancellationToken)
{
// Using Task.Run instead of Task.FromResult because the operation in GetDocumentContent is potentially slow,
// and Task.Run ensures it is executed on a separate thread, maintaining responsiveness.
return Task.Run(() => GetDocumentContent(stream, fileExtension), cancellationToken);
}

/// <inheritdoc/>
public abstract IDocumentConnector GetDocumentConnector(string fileExtension);
}
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ namespace Encamina.Enmarcha.SemanticKernel.Connectors.Document;
/// <summary>
/// Base class for document content semantic extractors.
/// </summary>
public abstract class DocumentContentSemanticExtractorBase : IDocumentConnectorProvider, IDocumentContentExtractor
public abstract class DocumentContentSemanticExtractorBase : DocumentConnectorProviderBase, IDocumentContentExtractor
{
/// <summary>
/// Initializes a new instance of the <see cref="DocumentContentSemanticExtractorBase"/> class.
Expand Down Expand Up @@ -46,7 +46,4 @@ public virtual Task<IEnumerable<string>> GetDocumentContentAsync(Stream stream,

return SemanticTextSplitter.SplitAsync(content, EmbeddingsGeneratorFunction, cancellationToken);
}

/// <inheritdoc/>
public abstract IDocumentConnector GetDocumentConnector(string fileExtension);
}
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
<PropertyGroup>
<PackageReadmeFile>README.md</PackageReadmeFile>
</PropertyGroup>

<ItemGroup>
<PackageReference Include="Microsoft.SemanticKernel.Plugins.Document" Version="1.7.1-alpha" />
<PackageReference Include="PdfPig" Version="0.1.8" />
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,43 @@ public interface IDocumentConnectorProvider
/// </summary>
/// <param name="fileExtension">The file extension.</param>
/// <returns>A valid instance of <see cref="IDocumentConnector"/> that could handle documents from the given file extension.</returns>
/// <exception cref="InvalidOperationException">
/// If the <paramref name="fileExtension"/> is not supported or no suitable <see cref="IDocumentConnector"/> instance for it can be found.
/// </exception>
IDocumentConnector GetDocumentConnector(string fileExtension);

/// <summary>
/// Determines the most appropriate document connector from a specified file extension.
/// </summary>
/// <param name="fileExtension">The file extension.</param>
/// <param name="throwException">
/// If <see langword="true"/> an <see cref="InvalidOperationException"/> is thrown if the <paramref name="fileExtension"/> is not supported
/// or no suitable <see cref="IDocumentConnector"/> instance for it can be found.
/// </param>
/// <returns>A valid instance of <see cref="IDocumentConnector"/> that could handle documents from the given file extension.</returns>
/// <exception cref="InvalidOperationException">
/// If the <paramref name="fileExtension"/> is not supported or no suitable <see cref="IDocumentConnector"/> instance for it can be found.
/// </exception>
IDocumentConnector GetDocumentConnector(string fileExtension, bool throwException);

/// <summary>
/// Determines whether a specified file extension is supported.
/// </summary>
/// <param name="fileExtension">The file extension to check.</param>
/// <returns>
/// Returns <see langword="true"/> if the file extension is supported; otherwise, <see langword="false"/>.
/// </returns>
bool SupportedFileExtension(string fileExtension);

/// <summary>
/// Adds a new document connector for a specified file extension.
/// </summary>
/// <remarks>
/// If the file extension already has a document connector associated with it, the existing connector is replaced.
/// </remarks>
/// <param name="fileExtension">The file extension.</param>
/// <param name="documentConnector">
/// A valid instance of <see cref="IDocumentConnector"/> to handle documents with the specified file extension.
/// </param>
void AddDocumentConnector(string fileExtension, IDocumentConnector documentConnector);
}

This file was deleted.

6 changes: 2 additions & 4 deletions src/Encamina.Enmarcha.Testing.Smtp/SmtpProcessor.cs
Original file line number Diff line number Diff line change
Expand Up @@ -210,15 +210,15 @@ private void ProcessCommands()
}
else
{
logger.LogError(@"Socket exception different than code `10060`!", socketException);
logger.LogError(socketException, @"Socket exception different than code `10060`!");
}

isRunning = false;
context.Socket.Dispose();
}
catch (Exception exception)
{
logger.LogError(@"Unexpected exception processing commands!", exception);
logger.LogError(exception, @"Unexpected exception processing commands!");

isRunning = false;
context.Socket.Dispose();
Expand Down Expand Up @@ -282,8 +282,6 @@ private void Data()

rawSmtpMessage.Raw.Append(header.ToString());

////header.Length = 0;

var line = context.ReadLine();
while (line is not null && !line.Equals(@"."))
{
Expand Down

0 comments on commit 30d6f74

Please sign in to comment.