Angelves

PdfAndExcelLogicalExtractor Description

PdfAndExcelLogicExtractor is a piece of software designed to extract information from PDF documents in a logical and orderly manner, so that can later be processed by the systems that integrates with.

The system is based on adaptable logic implemented in a template system that can process all documents of a certain type.

A template is capable of adapting to variations established in its definition, such as months with different numbers of days, displaced document areas, or differences between pages within the same document. And in general, any type of extraction logic needed.

Template extraction logic can perform result cleaning based on predefined rules, obtaining pure data types or removing parts of insignificant text.

Template extraction logic can perform calculations based on results obtained from different extractions or predefined values, such as calculating totals in a table based on price per unit.

Likewise, all functionalities or exceptions that the template logic of a document type requires to be a more effective tool can be specifically programmed.

Installation

The system is contained in a dynamic link library DLL, included in a NuGet package, which can be incorporated into any type of platform or software project in the .NET universe.

Direct integration into a .NET project can be done including the NuGet in the project itself and using the well-defined standard calls in the documentation or in Visual Studio's own intelligence system.

Open NuGet Package Manager in Visual Studio and search for 'Angelves' for access to our packages.

Manual installation by console:

PM> NuGet\Install-Package Angelves.PdfAndExcelLogicalExtractor -Version 1.1.0

Interface

Integration defines a very simple software interface with a few overloaded calls and a response data model accessible directly, which can be easily processed as JSON responses.

namespace Angelves.PdfAndExcelLogicalExtractor.PublicInterface
{
    public interface ILetsGoToExtraction
    {
        ExtractionResult Start(Template template, string filePath);

        ExtractionResult Start(string templatePath, string filePath);

        ExtractionResult Start(string templatePath, Stream fileStream);
  
        ExtractionResult Start(Template template, Stream fileStream);

        string GetWordsInPdf(string filePath, string? filter1 = null, int round = 0);
    }
}

Templates

A Template is a JSON file that implements extraction logic against a PDF or Excel document.

The goal of these templates is obtain an organized outcome, capable of be processed by an application or system, from plain text reading.

The following table contains a description of sections of template, where you can define the extraction boxes.

Section	Description
Name	The name of the template is set in this property.
Settings	This section configures the template configuration.
Offsets [pro]	In this section, control points are established to detect movements in the text over the original format.
Metaboxes [pro]	In this section the Metaboxes are defined. These elements are extraction boxes in themselves but have no impact on the output, as they are used in the calculation of formulas.
Boxes	In this section, the Boxes or extraction boxes are defined, which are the data that we extract from the document, calculations, tables, etc.
Renames	In this section the output field renamings are established. This section is useful for redefining automatic names from extract operations.

The following table contains a description of the commands that you can use within a Box in the template definition.

Functionality

Description

Name

Box name. This property is important for calculations.

id (*1)
[pro]

offset identifier

type

Defines the type of the Box:

Type

Description

Parameters

Literal

Exact value taken from the template.

empty

Empty field.

text

Extracted text.

number

Whole number.

Decimal

Decimal number.

DateTime

Date or date and time.

The format in Format.

TextSplit

Text divided by a break character.

Break character in Parameters[0].

table

Table in the document.

Definition in Header: array of Json Box objects, which define the header.

array

List of values from a start position to an end position.

Parameter list:

Parameter	Description
additionaldata	Text or Number
w	The width of the columns.

ArraySwap (*2)
[pro]

Array with flipped axes integrated into Table. This type generates a row for each valid entry.

Parameter list:

Parameter	Description
additionaldata	Text or Number
deletenullsinarray	Deletes null values from the original array.
arrayindex	Json object with y1 and y2, which define the row of the index extraction, days for example, in the top row.
w	The width of the columns.

idoffset

Box link with the id of the offset control.

required

Boxing is required or not.

ConvertToDateIfExcelNumber

Convert to datetime if the value obtained from an Excel file is a number indicating the number of days since 1/1/1900.

metabox
[pro]

Converts a normal Box into a Metabox that will not be reflected in the output.

x1, x2, y1 and y2

Text extraction coordinates in Pdf documents.

e1, e2

Spreadsheet style coordinates of start and end(optional) boxes, for extracting text from Excel documents.

deviationbetweenpages
[pro]

Derivation of the Y coordinate between pages.

master (*2)

Boolean that sets the column that masters the Table extraction.

additionaldata

Additional data required by some type of Box.

parameters

Array of parameters required by some type of Box.

alternativecoords
[pro]

Alternative coordinates based on conditions
Json array of conditions with a "condition" field where a formula equal to result is expressed, and with 4 optional fields x1, x2, y1 and y2, where the new coordinates are established if the condition is met.

extractionrules

Rules in data extraction.
Json array with the fields "action", "target" and "parameters[]":

action	Description	target	parameters
Erase	Delete the extraction result.
EraseFrom	Clears the result from the first character to the end.	Character
EraseTo	Clears the result from the beginning to the first character.	Character
MaxRightCharsFromChar	Extracting a number of characters from a given character to the right.	Character
QuitSpaces	Remove all spaces.
QuitLineReturns	Remove all return and carriage jump lines.
Replace	Replaces the searched text with the replaced text.	Searched text	[0] = Replaced text, [1] = "exact" if exact.
ToLower	Converts to lowercase.
ToUpper	Converts to uppercase.
ToRoundInteger	Converts text to rounded integer format text, if possible.
ToDecimal	Converts text into decimal-formatted text with two decimal places, if it can.
ToDate	Converts text to DateTime formatted text, if it can. The format is taken from Format.

formula
[pro]

Calculation formulas with numbers in template and/or metaboxes or boxes.
Formula syntax:

Element	Description
[SELF]	Original value extracted without modification.
[BOX: name]	Value of a Box or Metabox identified by "name".
[BOXROW: name] (*2)	Value of a Box or a Metabox belonging to the same row of a table identified by "name".
[SWAPVALUE] (*2)	The result of the value extracted by each row of the ArraySwap.
+, -, * and /	Supported mathematical operators. If no mathematical operator is used, it will be understood that the values are literal and the result is text.

comments

Comments, without use in the process.

(*1): Applicable to the offsets section
(*2): Applicable only to Tables.
(*3): Applicable only to Renames.
[pro]: Professional version only.

Example

{
  "templatename": "EXAMPLE TEMPLATE",
  "config": {
    "externalsfieldsintables": false,
    "decimalseparator": ","
  },
  "offsets": [
    {
      "id": 1,
      "text": "PROGRAM",
      "x": 82.92,
      "y": 153.95
    },
    {
      "id": 2,
      "text": "ADVERTISER:",
      "x": 82.92,
      "y": 101.99
    }
  ],
  "metaboxes": [
    {
      "name": "end_month_1",
      "type": "Number",
      "x1": 714,
      "y1": 146,
      "x2": 723,
      "y2": 153
    },
    {
      "name": "end_month_2",
      "type": "Number",
      "x1": 705,
      "y1": 146,
      "x2": 714,
      "y2": 143
    },
    {
      "name": "end_month_3",
      "type": "Number",
      "x1": 696,
      "y1": 146,
      "x2": 705,
      "y2": 153
    }
  ],
  "boxes": [
    {
      "name": "station",
      "required": true,
      "x1": 600,
      "y1": 60,
      "x2": 800,
      "y2": 80
    },
    {
      "name": "advertiser",
      "idoffset": 2,
      "required": true,
      "x1": 160,
      "y1": 90,
      "x2": 350,
      "y2": 110
    },
    {
      "name": "product",
      "idoffset": 2,
      "required": true,
      "x1": 160,
      "y1": 106,
      "x2": 350,
      "y2": 116
    },
    {
      "name": "campaign",
      "type": "Empty"
    },
    {
      "name": "reference",
      "idoffset": 2,
      "required": true,
      "x1": 600,
      "y1": 90,
      "x2": 800,
      "y2": 110,
      "extractionrules": [
        {
          "action": "QuitSpaces"
        },
        {
          "action": "Erase",
          "target": "N.ORDER:"
        }
      ]
    },
    {
      "name": "invoicedate",
      "type": "DateTime",
      "idoffset": 2,
      "required": true,
      "format": "dd/MM/yyyy",
      "x1": 600,
      "y1": 106,
      "x2": 800,
      "y2": 125,
      "extractionrules": [
        {
          "action": "Erase",
          "target": "DATE:"
        },
        {
          "action": "QuitSpaces"
        }
      ]
    },
    {
      "name": "table1",
      "type": "Table",
      "header": [
        {
          "name": "format",
          "idoffset": 1,
          "required": true,
          "master": true,
          "x1": 235,
          "y1": 171.10,
          "x2": 273,
          "y2": 178.41,
          "extractionrules": [
            {
              "action": "QuitSpaces"
            },
            {
              "action": "Erase",
              "target": "20"
            },
            {
              "action": "Erase",
              "target": "\""
            }
          ]
        },
        {
          "name": "duration",
          "idoffset": 1,
          "required": true,
          "x1": 235,
          "x2": 280,
          "extractionrules": [
            {
              "action": "QuitSpaces"
            },
            {
              "action": "Erase",
              "target": "CRADLE"
            },
            {
              "action": "Erase",
              "target": "\""
            }
          ]
        },
        {
          "name": "program",
          "idoffset": 1,
          "required": true,
          "x1": 70,
          "x2": 180
        },
        {
          "name": "startend_hour",
          "type": "TextSplit",
          "idoffset": 1,
          "parameters": [ "-" ],
          "x1": 180,
          "x2": 235,
          "extractionrules": [
            {
              "action": "QuitSpaces"
            },
            {
              "action": "Erase",
              "target": "("
            },
            {
              "action": "Erase",
              "target": "MF"
            },
            {
              "action": "Erase",
              "target": "S-U"
            },
            {
              "action": "Erase",
              "target": ")"
            }
          ]
        },
        {
          "name": "swap",
          "type": "ArraySwap",
          "idoffset": 1,
          "additionaldata": "Number",
          "deletenullsinarray": true,
          "x1": 444.21,
          "x2": 723.43,
          "w": 9.007,
          "arrayindex": {
            "y1": 146,
            "y2": 153
          },
          "alternativecoords": [
            {
              "condition": "[BOX:end_month_1] ! 31",
              "x2": 714.42
            },
            {
              "condition": "[BOX:end_month_2] ! 30",
              "x2": 705.41
            },
            {
              "condition": "[BOX:end_month_3] ! 29",
              "x2": 696.41
            }
          ]
        },
        {
          "name": "totalpasses",
          "type": "Decimal",
          "idoffset": 1,
          "x1": 310,
          "x2": 350
        },
        {
          "name": "unitprice",
          "type": "Decimal",
          "formula": "[BOXROW:totalprice] / [BOXROW:totalpasses]"
        },
        {
          "name": "discount",
          "type": "Decimal",
          "idoffset": 1,
          "x1": 380,
          "x2": 410,
          "extractionrules": [
            {
              "action": "Erase",
              "target": "%"
            },
            {
              "action": "QuitSpaces"
            }
          ]
        },
        {
          "name": "agencydiscount",
          "type": "Empty"
        },
        {
          "name": "totalprice",
          "type": "Decimal",
          "idoffset": 1,
          "required": true,
          "x1": 410,
          "x2": 445,
          "extractionrules": [
            {
              "action": "Erase",
              "target": "€"
            },
            {
              "action": "QuitSpaces"
            }
          ]
        }
      ]
    },
    {
      "name": "comments",
      "idoffset": 2,
      "x1": 60,
      "y1": 230,
      "x2": 720,
      "y2": 260,
      "extractionrules": [
        {
          "action": "QuitSpaces"
        },
        {
          "action": "Erase",
          "target": "COMMENTS:"
        }
      ]
    }
  ],
  "renames": [
    {
      "name": "startend_hour.INDEX[0]",
      "rename": "starthour",
      "exact": true,
      "casesensitive": false
    },
    {
      "name": "startend_hour.INDEX[1]",
      "rename": "endhour",
      "exact": true,
      "casesensitive": false
    },
    {
      "name": "swap.SWAP[",
      "rename": "passes",
      "exact": false,
      "casesensitive": false
    }
  ]
}

Commercial Use License

This software is provided under the terms of this Commercial Use License ("License"). By downloading, installing, or using this software, you agree to the terms and conditions of this License.

License Grant:
This software is a NuGet that can be freely downloaded. The purpose of the software is to provide functionality for logically ordering the reading of a PDF in plain text provided by third-party software. The user is granted a limited, non-exclusive, non-transferable license to use this software for evaluation purposes during a 30-day trial period. After this period, the user must acquire a commercial license to continue using this software.
Use Restrictions:
The user may not decompile, modify, or resell this software. However, the user may redistribute the software when integrated into their own software under a commercial license. The user must acquire a valid license to use this software for commercial purposes.
Disclaimer:
The third-party software provided for PDF reading may not be able to read some types of PDFs, such as those from scanned images. The license holder shall not be liable for any loss or damage arising from the third-party software's inability to read a specific PDF.
Ownership Rights:
All ownership and intellectual property rights of this software are owned by the license holder. This software is protected by copyright laws and other applicable laws.
Disclaimer of Warranties:
This software is provided "as is," without warranties of any kind, whether express or implied. The license holder shall not be liable for any direct, indirect, incidental, special, exemplary, or consequential damages arising out of the use or inability to use this software.
Governing Law:
This License shall be governed and construed in accordance with the laws of the State of Spain without regard to its conflict of law principles.
Additional Terms:
The terms and conditions of this License may be subject to change without prior notice. It is the user's responsibility to periodically review the terms of this License.
Third-party software license:
PdfPig Liscence
Newtonsoft Json Liscence
OpenXml & ClosedXml Liscence

By downloading, installing, or using this software, you acknowledge that you have read and understood the terms and conditions of this License and agree to comply with them.

Congratulations for your decision!

The Universe of PDF reading automation unfolds before you. Your request will be processed, and we will reach out to you as soon as possible.

Product

Version

Partner (optional)

Full Name

A name is required.

Email address

An email is required.

Phone number

A phone number is required.

Message

Empty message or not comply security standards.